# Generate Synonyms Using GPT2 (Jonathan Chua)
This model uses an existing corpus of word-synonym pairs to train a GPT2 model to predict synonyms of new words. Synonyms of words were taken from https://www.wordsapi.com/, then fed into the model to train.

Steps for obtaining and preparing of the corpus can be found in this notebook: https://colab.research.google.com/drive/1w00viidTv7xXRDotvP-og9X0l2d6u2vK?usp=sharing

## Set up


In [1]:
!pip install -q tensorflow-gpu==1.13.1
!pip install -q gpt_2_simple

In [2]:
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
gpt2.download_gpt2(model_name="345M")

Fetching checkpoint: 1.05Mit [00:00, 312Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 80.4Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 227Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:09, 143Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 247Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 92.2Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 133Mit/s]                                                       


In [5]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Import corpus of words stored as a text file

In [6]:
file_name = "train_string.txt"

In [43]:
f = open(file_name, "r")

string = f.read()

In [44]:
print(len(string))
string

515310




## Training

In [9]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='345M',
              steps=2500,
              learning_rate = 2e-5,
              restore_from='fresh',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Loading checkpoint models/345M/model.ckpt
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from models/345M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:01<00:00,  1.46s/it]


dataset has 157096 tokens
Training...
[10 | 40.86] loss=2.32 avg=2.32
[20 | 71.45] loss=2.12 avg=2.22
[30 | 101.84] loss=2.13 avg=2.19
[40 | 132.21] loss=2.18 avg=2.19
[50 | 162.74] loss=1.87 avg=2.12
[60 | 193.32] loss=2.26 avg=2.15
[70 | 223.73] loss=1.82 avg=2.10
[80 | 254.08] loss=1.99 avg=2.09
[90 | 284.46] loss=2.00 avg=2.08
[100 | 314.99] loss=2.02 avg=2.07
[110 | 345.53] loss=2.03 avg=2.07
[120 | 375.90] loss=2.05 avg=2.06
[130 | 406.23] loss=1.89 avg=2.05
[140 | 436.70] loss=1.90 avg=2.04
[150 | 467.22] loss=1.98 avg=2.03
[160 | 497.70] loss=1.79 avg=2.02
[170 | 528.02] loss=2.04 avg=2.02
[180 | 558.32] loss=1.81 avg=2.01
[190 | 588.78] loss=1.95 avg=2.00
[200 | 619.20] loss=1.98 avg=2.00
||----|-|--|-------|-------------------------| \______________| |__|_|__|_| |_| | |_| | | __| |_| |_| |_____| \_______ (__)__| __/| |__/| | | ( | \___|\___| \ | \ |_| |/|_|_|

{12}

|_________|___/| ____\__|____\__|_____| \__| |__\/ |__/ |__/( |__/| |___) |_| |_| | |( {12} |_________|___/| \_

KeyboardInterrupt: ignored

In [10]:
ls checkpoint/run1

checkpoint                                   model-1000.index
counter                                      model-1000.meta
encoder.json                                 model-1142.data-00000-of-00001
events.out.tfevents.1595935436.6111c5f2d344  model-1142.index
hparams.json                                 vocab.bpe
model-1000.data-00000-of-00001


In [11]:
gpt2.copy_checkpoint_to_gdrive()

## Generate Text

download test set

In [15]:
import random
import json
with open('test_list.json', encoding='utf-8') as f:
  test_list = json.load(f)

len(test_list)

740

In [67]:
test_list[:5]

['bracing <|startgen|> brace brisk fresh refreshful refreshing <|endgen|>',
 'muse <|startgen|> contemplate excogitate meditate mull ponder <|endgen|>',
 'fabric <|startgen|> cloth material textile framework <|endgen|>',
 'disposition <|startgen|> temperament inclination tendency disposal <|endgen|>',
 'imperfect <|startgen|> fallible frail weak progressive <|endgen|>']

take the first word of each test string as the starting word

In [16]:
def generate_and_compare(string):
  word = string.split(' ')[0]

  gpt2.generate(sess,
              length=250,
              temperature=0.7,
              prefix=f"{word} <|startgen|>",
              truncate='<|endgen|>',
              nsamples=5,
              batch_size=5
              )
  
  print(f"true synonyms are: {string}")

### Some good samples

In [26]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

brilliance <|startgen|> brilliancy luster lustre radiance exultation 
brilliance <|startgen|> luminancy brilliance brilliancy dynamism 
brilliance <|startgen|> brilliancy luster lustre radiance 
brilliance <|startgen|> brilliancy brilliance dazzle dazzle 
brilliance <|startgen|> brilliance brilliance fulness brilliancy glories 
true synonyms are: brilliance <|startgen|> genius blaze glare grandeur grandness <|endgen|>


In [65]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

tribute <|startgen|> homage respect due thankfulness thank 
tribute <|startgen|> due gratitude thanks thanks 
tribute <|startgen|> gratitude givers giveers show-off 
tribute <|startgen|> amour attendant attendant-at-arms attendant 
tribute <|startgen|> pledge burthen stripe tincture 
true synonyms are: tribute <|startgen|> protection testimonial <|endgen|>


In [52]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

labor <|startgen|> manual work manualisation 
labor <|startgen|> work do house care hair 
labor <|startgen|> work do part 
labor <|startgen|> grind grindout buttock buttockwork carver carverage 
labor <|startgen|> job process undertaking jut protrude protrude outgrowth 
true synonyms are: labor <|startgen|> labour proletariat project task undertaking <|endgen|>


In [56]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

nurture <|startgen|> nurture nurture raise trained bred breed 
nurture <|startgen|> educate groom train educate train-up 
nurture <|startgen|> educate groom raise model 
nurture <|startgen|> foster raise groom raise progenitor 
nurture <|startgen|> train groom raise educate groom 
true synonyms are: nurture <|startgen|> raising rearing foster nourish sustain <|endgen|>


In [63]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

evident <|startgen|> indisputable undeniable solid 
evident <|startgen|> undeniable undeniable unmistakable 
evident <|startgen|> indisputable undeniable undeniable undeniable 
evident <|startgen|> undeniable undeniable unmistakable 
evident <|startgen|> inevitable inevitable of-favor 
true synonyms are: evident <|startgen|> discernible observable apparent manifest palpable <|endgen|>


In [64]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

disposition <|startgen|> attitude outlook look 
disposition <|startgen|> mind-set mindset mindset attitude posture 
disposition <|startgen|> attitude attitude disposition 
disposition <|startgen|> disposition feeling 
disposition <|startgen|> state of mind mental status 
true synonyms are: disposition <|startgen|> temperament inclination tendency disposal <|endgen|>


### Some bad samples

In [20]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

auxiliary <|startgen|> auxiliary adjunct services 
auxiliary <|startgen|> auxiliary nuncio 
auxiliary <|startgen|> auxiliary supporting helper 
auxiliary <|startgen|> auxiliary help support adjunct 
auxiliary <|startgen|> auxiliary annunciate prophetic 
true synonyms are: auxiliary <|startgen|> subsidiary supplemental supplementary aide accessory <|endgen|>


In [24]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

pointer <|startgen|> episode cameo appearance pitting 
pointer <|startgen|> radiative radiant cradle foundation 
pointer <|startgen|> inventor inventor joker joke 
pointer <|startgen|> balmy regular patchwork 
pointer <|startgen|> capture find grab touch 
true synonyms are: pointer <|startgen|> cursor arrow <|endgen|>


In [29]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

follow <|startgen|> paucity scantiness scantiness scantiness 
follow <|startgen|> ah! ahh! ahhh! ahh! 
follow <|startgen|> bailey bair bairie chaparral scrub 
follow <|startgen|> leger gehenna har = 
follow <|startgen|> assent accede acquiesce 
true synonyms are: follow <|startgen|> postdate pursue trace adopt espouse <|endgen|>


In [32]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

cordial <|startgen|> nebbishy prickly prickly 
cordial <|startgen|> ancestral ancestral inerrant authoritative 
cordial <|startgen|> calling callous callous-fathered hardened 
cordial <|startgen|> pliant tolerant 
cordial <|startgen|> dogmatic dogmaticial dogmaticous 
true synonyms are: cordial <|startgen|> liqueur affable amiable genial <|endgen|>


In [66]:
string = test_list[random.randint(0, 739)]
generate_and_compare(string)

design <|startgen|> repeatedly repeated 
design <|startgen|> tiffin 
design <|startgen|> bender bendancer contortionist contortionist contralto 
design <|startgen|> dab rag doll tell-all 
design <|startgen|> billy billy-crap billy-talk billywhip 
true synonyms are: design <|startgen|> blueprint pattern figure plan aim <|endgen|>
