# Deep NLP - Word Embeddings

Think back to NLP as we've understood it so far.

If we've had some luck with NLP modeling, likely with a NaiveBayes algorithm, we were able to illustrate some correlations between words and some other feature of interest.

But to whatever extent that our models were able to make connections and pick up on correlations, they did this *without any understanding of the **meaning** of the words in question*.

Let's think for a minute about words and objective meanings!

We can make sense of meaning for computational purposes by thinking about meaning in terms of similarity, i.e. thinking about meaning *holistically*.

Q. Is there any precedent for this way of thinking about meaning? <br/>
A. [Yes](https://plato.stanford.edu/entries/meaning-holism/#ArgForMeaHol)

So what will this look like for us?

*Remember cosine similarity?*

$\rightarrow$We'll have much the same idea here: Associate each word with values along particular dimensions in a multi-dimensional space. If we had a dimension for *softness*, for example, then pillows and marshmallows would score higher on it than rocks and bricks.

In [2]:
!pip install --upgrade gensim


Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/c8/3a/32a1edf4f335eba0873021a7ddb3230f05dedd2b5450960118b402ca0771/gensim-3.8.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.7MB)
[K    100% |████████████████████████████████| 24.7MB 1.4MB/s eta 0:00:01 0% |▎                               | 174kB 1.5MB/s eta 0:00:17    29% |█████████▌                      | 7.4MB 3.3MB/s eta 0:00:06
[?25hCollecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/37/c0/25d19badc495428dec6a4bf7782de617ee0246a9211af75b302a2681dea7/smart_open-1.8.4.tar.gz (63kB)
[K    100% |████████████████████████████████| 71kB 4.7MB/s ta 0:00:011
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/flatironschool/Library/Caches/pip/wheels/5f/ea/fb/5b1a947b369724063b2617011f1540c44eb00e28c3d2

In [3]:
import gensim
import numpy as np

What is Gensim? See [here](https://en.wikipedia.org/wiki/Gensim) and [here](https://radimrehurek.com/gensim/). But, basically, gensim is a package with lots of topic-modeling and NLP tools, inlcuding Word2Vec.

Find the data [here!](https://drive.google.com/file/d/0BwT5wj_P7BKXb2hfM3d2RHU1ckE/view) (Just click 'Download')

In [7]:
# Reading in the data

import json

with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [8]:
# Let's check the datatype of our data
type(data)


list

In [9]:
# And the length
len(data)


216930

In [10]:
# Let's look at the first element in our list
data[0]


{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [13]:
# How many words do we have in our first question?

len(data[0]['question'].split(sep=' '))

18

In [17]:
# Let's count the total number of
# clue words we have.

sum([len(i['question'].split(sep=' ')) for i in data])

3169994

## Using Word2Vec

In [19]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?
import string

text= []
for clue in data:
    sentence=clue['question'].translate(str.maketrans('','',
                                                     string.punctuation)).split(' ')   #replace the punctuation with nothing 
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    text.append(new_sent)

In [20]:
# Let's check the new structure of our first clue
text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

In [21]:
# Constructing the model is simply a matter of
# instantiating a Word2Vec object. Word2Vec trains a NN but we will use the words that are built not the NN itself.

model = gensim.models.Word2Vec(text, sg=1)    #sg:skipgram, two models inside word2vec. one tries to predict single word most associated w context of words
#skipgram starts w a single word and predicts words in that context . this is based on measuring nearness (words that appear near eachother in a sentence)
#features is basically like topics

#CBOW is the opposite of sg. starts w context and predicts word


Continuous Bag of Words vs. Skipgram

<a href="https://www.researchgate.net/figure/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models_fig1_281812760"><img src="https://www.researchgate.net/profile/Wang_Ling/publication/281812760/figure/fig1/AS:613966665486361@1523392468791/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models.png" alt="Illustration of the Skip-gram and Continuous Bag-of-Word (CBOW) models."/></a>

[More on Skipgram](https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b)

In [22]:
model.epochs

5

In [23]:
# To train, call 'train()'!

model.train(text, total_examples=model.corpus_count,
           epochs=model.epochs)

(11337425, 15849970)

In [24]:
# Checking word  count

model.corpus_total_words

3169994

## model.wv

In [25]:
# The '.wv' attribute stores the word vectors

model.wv['apple']   #this is apple represented as linear combination of themes (features or topics)

array([-0.2179959 , -0.13832371,  0.82045907, -0.09228509, -0.31157538,
        0.0558117 ,  0.31941363,  0.13714567,  0.18510565, -0.18914342,
        0.20846863,  0.19939539,  0.17979723, -0.08736508,  0.7754108 ,
       -0.05262485,  0.49912634, -0.6519212 ,  0.02689071,  0.11322739,
        0.02308858,  0.20876797, -0.6626779 , -0.21118072, -0.5116142 ,
       -0.15118527, -0.2473922 ,  0.2358805 , -0.12133547,  0.17606536,
       -0.36533666, -0.16933222,  0.27797   ,  0.15719958, -0.2768855 ,
        0.32499468, -0.58233935,  0.01879555, -0.8135983 ,  0.02875136,
       -0.0355689 , -0.41054052, -0.6825258 ,  0.2764217 ,  0.4702477 ,
       -0.4349774 , -0.27470565,  0.10966807, -0.12253669,  0.46156558,
       -0.26079467, -0.19115892, -0.10137106, -0.28267348,  0.00846066,
        0.28400433, -0.16284879,  0.21241169,  0.4398918 , -0.33026662,
       -0.0697035 , -0.41292527,  0.1949021 ,  0.00706244,  0.28849742,
        0.3122224 , -0.035186  ,  0.56784433, -0.14559768,  0.38

In [28]:
# The vectors are keyed by the words
model.wv['apple'].max()    #cant get the words in that feature , it is built out of words though


0.82045907

### model.wv methods
#### 'most_similar()' and 'similarity()'

In [29]:
model.wv.most_similar('furniture')   #the numbers are output of cosine similarity

[('pottery', 0.7091644406318665),
 ('neoclassical', 0.7081478238105774),
 ('decorative', 0.7044602036476135),
 ('fastener', 0.6931874752044678),
 ('chippendale', 0.6854918599128723),
 ('ceramic', 0.680397629737854),
 ('fasteners', 0.6780710816383362),
 ('accessory', 0.6768207550048828),
 ('cabriole', 0.6768094897270203),
 ('nouveau', 0.6759829521179199)]

In [30]:
model.wv.similarity('furniture', 'jewelry')

0.6321034

In [31]:
# What's most similar to 'cat'?

model.wv.most_similar(positive='cat')   #the numbers are output of cosine similarity, positive is most similar

[('dog', 0.7162733674049377),
 ('cheetah', 0.7162524461746216),
 ('shorthaired', 0.7084864377975464),
 ('carnivore', 0.7036651372909546),
 ('parrot', 0.6998318433761597),
 ('hound', 0.6978839635848999),
 ('pup', 0.6911518573760986),
 ('scavenger', 0.6882691383361816),
 ('feline', 0.6873884201049805),
 ('pachyderm', 0.6840150356292725)]

In [33]:
model.wv.most_similar(negative='cat')   #the dissimilar, absolute numbers matter. not the sign

[('themselves', -0.06060933321714401),
 ('peoples', -0.06555520743131638),
 ('tonight', -0.0809701681137085),
 ('19', -0.0884326696395874),
 ('jews', -0.0942421406507492),
 ('poor', -0.09581827372312546),
 ('fans', -0.10032132267951965),
 ('hearts', -0.10156692564487457),
 ('guest', -0.10171807557344437),
 ('gods', -0.1039452999830246)]

In [34]:
# Let's try the familiar example: King - Man + Woman = Queen

model.wv.most_similar(positive=['king', 'woman'],
                     negative='man')

[('throne', 0.3153429627418518),
 ('empress', 0.27299660444259644),
 ('nun', 0.24938170611858368),
 ('monarch', 0.24484051764011383),
 ('duchess', 0.24148623645305634),
 ('reign', 0.23693794012069702),
 ('heir', 0.2345750480890274),
 ('prince', 0.22962161898612976),
 ('isabella', 0.2274600863456726),
 ('athena', 0.22651103138923645)]

In [35]:
# Shakespeare
model.wv.most_similar('shakespeare')


[('sophocles', 0.7530242800712585),
 ('shakespeares', 0.7390450835227966),
 ('euripides', 0.7144793272018433),
 ('falstaff', 0.7142761945724487),
 ('moliere', 0.6936957836151123),
 ('shaws', 0.6876543164253235),
 ('shakespearean', 0.6849218010902405),
 ('ibsen', 0.684044361114502),
 ('romeo', 0.6758466958999634),
 ('rur', 0.6730616688728333)]

In [36]:
# Greg

model.wv.most_similar('greg')


[('kinnear', 0.8410297632217407),
 ('steely', 0.7922472357749939),
 ('dwayne', 0.78643798828125),
 ('abduljabbar', 0.7844822406768799),
 ('hamlisch', 0.7826417684555054),
 ('bebe', 0.7811517119407654),
 ('gehrig', 0.7807881832122803),
 ('waterstona', 0.7800036072731018),
 ('connelly', 0.7775446176528931),
 ('walston', 0.7745929956436157)]

In [38]:
# Washington

model.wv.most_similar('washington', topn=10)


[('dc', 0.8387752771377563),
 ('arlington', 0.6481685042381287),
 ('dca', 0.6466912031173706),
 ('dcs', 0.6350167989730835),
 ('newseum', 0.6256836652755737),
 ('p3', 0.6181975603103638),
 ('delaware', 0.6124393939971924),
 ('virginia', 0.6111010909080505),
 ('washingtons', 0.6036571264266968),
 ('missouri', 0.6002857089042664)]

In [39]:
model.wv.similarity('seattle', 'washington')

0.3626155

#### 'doesnt_match()'

In [42]:
model.wv.doesnt_match(['breakfast','lunch','food', 'frog'])

'frog'

In [44]:
model.wv.doesnt_match(['good','bad','indifferent'])

'bad'

#### 'closer_than()'

In [45]:
# Which words are closer to 'king' than 'queen' is?

model.wv.closer_than('king','queen')

['prince', 'kings', 'iii', 'throne', 'ruler', 'iv', 'ix', 'haakon', 'olaf']

#### 'distance()'

In [46]:
# For this it will make more sense to
# normalize our vectors.

model.init_sims(replace=True)

In [47]:
model.wv.distance('king', 'king')

0.0

In [48]:
model.wv.distance('joy', 'happiness')

0.43866634368896484

#### 'evaluate_word_analogies()'

Check out [this text file](https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt)!

In [49]:
relatives = model.wv.evaluate_word_analogies('https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt')[1][4]

In [50]:
len(relatives['correct'])

148

In [51]:
len(relatives['incorrect'])

272

In [52]:
relatives['correct'][:5]

[('BOY', 'GIRL', 'BROTHER', 'SISTER'),
 ('BOY', 'GIRL', 'DAD', 'MOM'),
 ('BOY', 'GIRL', 'FATHER', 'MOTHER'),
 ('BOY', 'GIRL', 'HE', 'SHE'),
 ('BOY', 'GIRL', 'HIS', 'HER')]

In [53]:
relatives['incorrect'][:5]

[('BOY', 'GIRL', 'BROTHERS', 'SISTERS'),
 ('BOY', 'GIRL', 'GRANDFATHER', 'GRANDMOTHER'),
 ('BOY', 'GIRL', 'GRANDPA', 'GRANDMA'),
 ('BOY', 'GIRL', 'GRANDSON', 'GRANDDAUGHTER'),
 ('BOY', 'GIRL', 'GROOM', 'BRIDE')]