## 1a. Import the libraries and sample data

In [1]:
from nltk.corpus import brown
import gensim
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk

# Downloads the NLTK package
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('brown')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/leticiachoo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/leticiachoo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     /Users/leticiachoo/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

## 1b. Display sample data

In [2]:
print('The sentences:\n')

for i, sent in enumerate(brown.sents()):
    text = ''
    
    for w in sent:
        text += (' ' if text and (w not in ('`', '.', ',', '\'', '""', '(', '[') and text[-1] not in {'(', ']'}) else '') + w
    
    print(text, '\n')
    
    if i == 15:
        break
        
print('...')

The sentences:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place. 

The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted. 

The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr.. 

`` Only a relative handful of such reports was received '', the jury said, `` considering the widespread interest in the election, the number of voters and the size of this city ''. 

The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous ''. 

It recommended that Fulton legislators a

## 2a. Generate word embeddings

Uses the sample corpus [brown](https://www.nltk.org/book/ch02.html) from Brown University.

Credits: [@alvations](https://www.kaggle.com/alvations/word2vec-embedding-using-gensim-and-nltk)

In [4]:
import gensim
from nltk.corpus import brown

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

## 2b. Display similarities for a single word

For cosine distance,

<img src="https://miro.medium.com/max/442/1*UODvtQMybHE8c0eL3K5z5A.png" width="300" height="auto" />

(Source: [Prabhu, 2019](https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223))

<img src="https://miro.medium.com/max/697/0*XMW5mf81LSHodnTi.png" width="300" height="auto" />

(Source: [Dhruvil Karani, 2018](https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223))

Read more on `Word2Vec` for gensim [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [23]:
model = gensim.models.Word2Vec.load('brown.embedding')

word = 'university'
cherry_pick = ('school', 'college', 'technical', 'education', 'degree', 'knowledge', 'research')
trunc_vector = lambda w: [model.wv[w][0], '..', model.wv[w][-1]]

print(f'Word we are checking "{word}"\n')
print(f'Vector for {len(model.wv[word])} embeddings:\n\n', model.wv[word], '\n')

print(f'Picking words with likely similarity to "{word}":\n\n',
    pd.DataFrame(list(map(lambda w: (
        w, 
        model.wv.similarity(word, w), 
        model.wv.distances(word, [w])[0],
        trunc_vector(w)
    ), cherry_pick)),
    columns=['Word', 'Cos. Score', 'Cos. Distance', 'Vector'])
, '\n')

print(f'Top 20 similar words to "{word}":\n\n',
    pd.DataFrame([(
        wv[0], 
        wv[1],
        model.wv.distances(word, [wv[0]])[0],
        trunc_vector(wv[0])
    ) for wv in model.wv.most_similar(positive=[word], topn=20)],
    columns=['Word', 'Cos. Score', 'Cos. Distance', 'Vector'])
, '\n')

Word we are checking "university"

Vector for 100 embeddings:

 [ 0.11175527  0.25669736  0.20066865  0.11175066 -0.06460495 -0.33296835
  0.20352826  0.3468266  -0.30710742 -0.28286117  0.16939154 -0.22782493
  0.19033255  0.16709751  0.24337387 -0.16267173  0.26980072 -0.14518285
 -0.52309054 -0.51736414  0.2850579  -0.11510846  0.48376077  0.07662556
 -0.05919393 -0.13305968 -0.23038775  0.01760273 -0.21032424  0.23470333
  0.21679243 -0.06695802  0.28115085 -0.37063724 -0.17731045  0.05806558
 -0.1888917  -0.05340989 -0.34433    -0.05935063  0.01267142 -0.2682686
  0.17947511  0.1153377   0.20314875 -0.00816581 -0.01876937 -0.02480987
  0.09931239  0.31092817  0.01604638 -0.2854515  -0.25131583 -0.18981163
 -0.10706858 -0.22966878  0.17690061  0.05214861 -0.06181134 -0.07889867
  0.02951878  0.2008032  -0.03759096 -0.1692586  -0.18953413  0.4518703
  0.02252458  0.284502   -0.2633548   0.36973944  0.2051202   0.1467263
  0.21679464 -0.00282996  0.33435     0.08990233  0.10696647  0