# Deep NLP - Word Embeddings

#### Agenda today
- Review foundations of NLP: Bag of words vs tf-df
- Representing contexts & preserving meaning of words: Continous bag of words
- Implementing word2vec

Think back to NLP as we've understood it so far.

If we've had some luck with NLP modeling, likely with a NaiveBayes algorithm, we were able to illustrate some correlations between words and some other feature of interest.

But to whatever extent that our models were able to make connections and pick up on correlations, they did this *without any understanding of the **meaning** of the words in question*.

Let's think for a minute about words and objective meanings!

We can make sense of meaning for computational purposes by thinking about meaning in terms of similarity, i.e. thinking about meaning *holistically*.

Q. Is there any precedent for this way of thinking about meaning? <br/>
A. [Yes](https://plato.stanford.edu/entries/meaning-holism/#ArgForMeaHol)

So what will this look like for us?

*Remember cosine similarity?*

$\rightarrow$We'll have much the same idea here: Associate each word with values along particular dimensions in a multi-dimensional space. If we had a dimension for *softness*, for example, then pillows and marshmallows would score higher on it than rocks and bricks.

#### Continous Bag of Words (CBOW)
The goal of continous bag of words is to predict _the probability_ of a word given some context. A context may be a single word or a vector of words. Below is the architecture of the model. 

<img src='cbow.jpeg' width = 400>

##### CBOW step by step 
- The input layer and the target, both are one- hot encoded of size [1 X V]

- There are two sets of weights. one is between the input and the hidden layer and second between hidden and output layer.

- Input-Hidden layer matrix size =[V X N] , hidden-Output layer matrix  size =[N X V] where N represents the number of neurons in the hidden layers. 

- The activation function in the hidden layer is linear! 

- The input is multiplied by the input-hidden weights and called hidden activation. It's the corresponding row in the input matrix, very similar to how sum of weights are calculated in MLP.

- The hidden input gets multiplied by hidden- output weights and output is calculated.

- Error between output and target is calculated and backpropagation is used to adjust the weight initialized before. 

- The weight  between the hidden layer and the output layer is taken as the __word vector representation of the word__ -> hence, word to vec!

#### Implementing word2vec

In [3]:
import gensim
import numpy as np

In [4]:
# Reading in the data

import json

with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [5]:
type(data)

list

In [6]:
len(data)

216930

In [7]:
# Let's look at the first element in our list

data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [8]:
# How many words do we have?

len(data[0]['question'])

98

In [9]:
# Let's try that again!



data[0]['question'].split(' ')

["'For",
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life,',
 'Galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 "man's",
 "theory'"]

In [10]:
len(data[0]['question'].split(' '))

18

In [11]:
length = 0
for clue in data:
    length += len(clue['question'].split(' '))
length

3169994

## Using Word2Vec

In [17]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?

import string
text = []

for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '',
                                                        string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    
    text.append(new_sent)

In [18]:
# Let's check the new structure of our first clue

text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

King + Woman - Man = Queen

Brother + Woman - Man = Sister

In [19]:
# Constructing the model is simply a matter of
# instantiating a Word2Vec object.

model = gensim.models.Word2Vec(text, sg=1)

In [15]:
# To train, call 'train()'!

model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(11335282, 15849970)

In [16]:
model.corpus_total_words

3169994

## model.wv

In [20]:
# The '.wv' attribute stores the word vectors

model.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1a42cb02e8>

In [21]:
model.wv['child']

array([-0.02013678, -0.768867  , -0.3995168 ,  0.20007966,  0.07782069,
        0.3070786 , -0.02616971,  0.22320455, -0.19875549, -0.48531622,
       -0.3757766 , -0.05396411, -0.04683578, -0.11336355, -0.22662023,
        0.01906323,  0.021305  ,  0.05067752,  0.12077358,  0.11734386,
       -0.516456  , -0.05390289, -0.00173848,  0.3537908 ,  0.1281792 ,
        0.64577883,  0.23364232,  0.40055946, -0.22854975,  0.20258924,
       -0.05578034, -0.10420907, -0.42711267,  0.28573084, -0.17423692,
       -0.41932258,  0.06479436,  0.04241097, -0.19720377, -0.02992231,
       -0.20466733,  0.30786785,  0.17101948,  0.17466737,  0.4910097 ,
       -0.4866816 , -0.12191375, -0.25491777, -0.25815502, -0.18216774,
       -0.11897507,  0.3471193 , -0.71689504, -0.03110847, -0.25038064,
       -0.3100907 ,  0.4050753 ,  0.24704766,  0.16912584,  0.32673478,
       -0.3095468 ,  0.41637126, -0.05078639, -0.28241867, -0.4771867 ,
       -0.11264411, -0.09655994, -0.00761371, -0.01329694, -0.37

### model.wv methods
#### 'most_similar()' and 'similarity()'

In [22]:
model.wv.most_similar('furniture')

[('drip', 0.847236156463623),
 ('pottery', 0.8100799322128296),
 ('ceramic', 0.7932129502296448),
 ('decorative', 0.7931036353111267),
 ('canvas', 0.7903770208358765),
 ('fastener', 0.7888020277023315),
 ('linen', 0.7869349718093872),
 ('trapeze', 0.7865070104598999),
 ('wrap', 0.78385329246521),
 ('artwork', 0.7797967195510864)]

In [23]:
model.wv.similarity('furniture', 'jewelry')

0.77938473

In [24]:
model.wv.most_similar(positive=['cat', 'animal', 'pet', 'mammal'])

[('parrot', 0.873432457447052),
 ('rodent', 0.8573928475379944),
 ('carnivore', 0.8493756651878357),
 ('wading', 0.8472229838371277),
 ('reptile', 0.843704342842102),
 ('marsupial', 0.8394194841384888),
 ('lizard', 0.838648796081543),
 ('shorttailed', 0.8353908061981201),
 ('predatory', 0.8343661427497864),
 ('arachnid', 0.8338077664375305)]

In [25]:
model.wv.most_similar(positive=['cat', 'animal'], negative='pet')

[('mammal', 0.35181742906570435),
 ('species', 0.34889715909957886),
 ('lizard', 0.3367321491241455),
 ('creature', 0.3365228772163391),
 ('insect', 0.3316500186920166),
 ('rodent', 0.3236585259437561),
 ('animals', 0.31926828622817993),
 ('marsupial', 0.31866103410720825),
 ('extinct', 0.30677351355552673),
 ('birds', 0.304746150970459)]

In [191]:
model.wv.most_similar(positive=['king', 'woman'], negative='man', topn=3)

[('empress', 0.49203425645828247),
 ('throne', 0.4811129570007324),
 ('queen', 0.480634868144989)]

In [93]:
model.wv.most_similar(positive='usa')

[('pageant', 0.5714678764343262),
 ('fargo', 0.5668860673904419),
 ('minneapolis', 0.5316091775894165),
 ('90210', 0.5216965079307556),
 ('summer', 0.5198173522949219),
 ('bronx', 0.5194722414016724),
 ('edina', 0.5112684965133667),
 ('tonight', 0.5014094114303589),
 ('yesterday', 0.4972144365310669),
 ('bway', 0.4931708872318268)]

In [94]:
model.wv.most_similar('canada')

[('hawaii', 0.7062503099441528),
 ('switzerland', 0.7037385702133179),
 ('territory', 0.7032467126846313),
 ('alaska', 0.6920301914215088),
 ('peru', 0.6817297339439392),
 ('colombia', 0.6571628451347351),
 ('pakistan', 0.6542963981628418),
 ('morocco', 0.644716739654541),
 ('japan', 0.6415838003158569),
 ('statehood', 0.6406919360160828)]

In [95]:
model.wv.most_similar('shakespeare')

[('hemingway', 0.6982564330101013),
 ('tennyson', 0.6815193295478821),
 ('hamlet', 0.6687659621238708),
 ('dickens', 0.6686815023422241),
 ('poe', 0.6508814692497253),
 ('shakespeares', 0.6488662958145142),
 ('macbeth', 0.6414898633956909),
 ('shelley', 0.6403170824050903),
 ('tolstoy', 0.6312527060508728),
 ('eliot', 0.6235682964324951)]

In [102]:
model.wv.most_similar('greg')

[('kinnear', 0.851963460445404),
 ('hatcher', 0.8268362283706665),
 ('reese', 0.8254306316375732),
 ('2004br', 0.8243862986564636),
 ('det', 0.8158860802650452),
 ('1989br', 0.8155031204223633),
 ('kiefer', 0.8080366253852844),
 ('randy', 0.8043558597564697),
 ('jake', 0.8002606630325317),
 ('bobby', 0.7988304495811462)]

In [109]:
model.wv.most_similar('jefferson')

[('quincy', 0.7526087760925293),
 ('madison', 0.7080703973770142),
 ('sen', 0.6974426507949829),
 ('dewey', 0.6706094741821289),
 ('hw', 0.6679573059082031),
 ('booker', 0.6666173934936523),
 ('josiah', 0.6663467884063721),
 ('marshall', 0.6623543500900269),
 ('rep', 0.6621801257133484),
 ('hoover', 0.6592093706130981)]

In [110]:
model.wv.most_similar('washington')

[('lincoln', 0.6199996471405029),
 ('memorial', 0.6156197786331177),
 ('virginia', 0.5984926819801331),
 ('illinois', 0.5932536125183105),
 ('arlington', 0.5758087635040283),
 ('nebraska', 0.5739896297454834),
 ('missouri', 0.5725917816162109),
 ('texas', 0.5661277770996094),
 ('kansas', 0.565746545791626),
 ('montana', 0.5633023381233215)]

In [99]:
model.wv.most_similar(positive=['president', 'germany'], negative='usa')

[('february', 0.39004337787628174),
 ('remains', 0.38808560371398926),
 ('exile', 0.3841812312602997),
 ('dictator', 0.38287919759750366),
 ('1929', 0.379180908203125),
 ('1918', 0.37386244535446167),
 ('voyage', 0.37059006094932556),
 ('decade', 0.3700261414051056),
 ('russia', 0.3685533106327057),
 ('completed', 0.3664979338645935)]

In [100]:
model.wv.most_similar(positive=['president', 'france'], negative='usa')

[('exile', 0.3980652391910553),
 ('remains', 0.38099491596221924),
 ('voyage', 0.37491631507873535),
 ('shah', 0.36604753136634827),
 ('dictator', 0.35999220609664917),
 ('philippines', 0.3587689697742462),
 ('february', 0.3575618863105774),
 ('date', 0.3540005087852478),
 ('conquest', 0.35296034812927246),
 ('1793', 0.3519843816757202)]

#### 'doesnt_match()'

In [90]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'food'])

'frog'

In [194]:
model.wv.doesnt_match(['lunch', 'this'])

'lunch'

In [195]:
model.wv.doesnt_match(['tree', 'flower', 'bush', 'plant', 'toothbrush'])

'bush'

In [196]:
model.wv.doesnt_match(['tree', 'flower', 'plant', 'toothbrush'])

'toothbrush'

#### 'closer_than()'

In [91]:
# Which words are closer to 'king' than 'queen' is?

model.wv.closer_than('king', 'queen')

['prince', 'emperor', 'ruler']

#### 'distance()'

In [None]:
# For this it will make more sense to
# normalize our vectors.

model.init_sims(replace=True)

In [199]:
model.wv.distance('king', 'king')

0.0

In [206]:
model.wv.distance('joy', 'happiness')

0.4097934365272522

#### 'evaluate_word_analogies()'

Check out [this text file](https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt)!

In [249]:
relatives = model.wv.evaluate_word_analogies('https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt')[1][4]

In [250]:
len(relatives['correct'])

185

In [251]:
len(relatives['incorrect'])

235

In [253]:
relatives['correct'][:5]

[('BOY', 'GIRL', 'BROTHER', 'SISTER'),
 ('BOY', 'GIRL', 'BROTHERS', 'SISTERS'),
 ('BOY', 'GIRL', 'DAD', 'MOM'),
 ('BOY', 'GIRL', 'FATHER', 'MOTHER'),
 ('BOY', 'GIRL', 'HE', 'SHE')]

In [254]:
relatives['incorrect'][:5]

[('BOY', 'GIRL', 'GRANDFATHER', 'GRANDMOTHER'),
 ('BOY', 'GIRL', 'GRANDPA', 'GRANDMA'),
 ('BOY', 'GIRL', 'GRANDSON', 'GRANDDAUGHTER'),
 ('BOY', 'GIRL', 'GROOM', 'BRIDE'),
 ('BOY', 'GIRL', 'HUSBAND', 'WIFE')]