![@mikegchambers](../../images/header.png)

# Word2vec

In this notebook, we explore Word2vec using Gensim, the Natural Language Toolkit (NLTK) and scikit-learn PCA for visualization.

![Words](words.png)

# ! Install Libraries !

We need to ensure that these libraries are installed on the server.  The NTLK library is probably already installed, bit the Gensim library is not included by default with the SageMaker Notebook server we are using.

Once these libraries are installed you can re-comment-out these lines as they won't need to be run again on this server.

In [None]:
# ! pip install nltk==3.6.7
# ! pip install gensim==3.8.3

(Note: In the original version of this notebook, these libraries were not pinned to specific version numbers. As newer versions of the libraries were released, this broke the following code.  Therefore I have now pinned the versions in the code above.  Kevin Schwarz was kind enough to find an alternate solution and submitted a GitHub pull request: https://github.com/learn-mikegchambers-com/aws-mls-c01/pull/2 . See this is an alternative solution for newer libraries.)

# Import Libraries

In [None]:
import nltk
import gensim 

import numpy as np
import random

from sklearn.decomposition import PCA

%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# The Data

In this build session we're going to use the the AWS documentation corpus that we introduced in the LDA build lesson.
Here we load in the corpus, and randomize the order of the documents/lines.

In [None]:
text_file = open("corpus.txt", "r")
corpus = text_file.readlines()
random.shuffle(corpus)

## Tokenize Words
Now we process the corpus of documents.  We tokenize the words and convert them all to lower-case using the NLTK.  You could modify the code here to add more pre-processing, such as number removal, which I have left out as we expect words like 'ec2' and 's3'.

https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize#nltk.tokenize.word_tokenize

In [None]:
data = [] 
  
for doc in corpus:
    
    t = [] 
      
    for word in nltk.tokenize.word_tokenize(doc): 
        t.append(word.lower()) 
        
    data.append(t) 

# The Model

For our model we are using a very popular easy to use Python library called Gensim.

https://radimrehurek.com/gensim/

Here is an extract from the Gensim `Word2vec` documentation:

- size (int, optional) – Dimensionality of the word vectors.
- window (int, optional) – Maximum distance between the current and predicted word within a sentence.
- min_count (int, optional) – Ignores all words with total frequency lower than this.

- sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.

(Source: https://radimrehurek.com/gensim/models/word2vec.html)

### SkipGram
The following line creates a (popular for large datasets) SkipGram method Word2vec: 

In [None]:
model = gensim.models.Word2Vec(data, size=100, window=5, min_count=5, sg=1) 

### CBOW (Continuous Bag of Words)
The following line creates a CBOW method Word2vec: 

In [None]:
# model = gensim.models.Word2Vec(data, size=100, window=5, min_count=1) 

# Testing the Model

We can now use the model to compare the similarity between words that we know will be in the vocabulary.

In [None]:
print(model.wv.similarity('sagemaker', 'algorithm'))
print(model.wv.similarity('ec2', 'ebs') )
print(model.wv.similarity('ec2', 'algorithm') )

And we can view the vector for a given word.

In [None]:
word_vectors = model.wv
word_vectors.word_vec("amazon") 

And we can list out words that the model has determined are similar to a given word.

In [None]:
print(word_vectors.most_similar('sagemaker'))

# Word Cloud

Now, let's hack some code together to create a 3D graph of words we're interested in.

First we define a list of terms.  I have grouped them together in the code, but it's just one, flat, list.

In [None]:
vocab = [
    'sagemaker', 'algorithm', 'forecast', 'rekognition', 'textract',
    'ebs', 'ec2', 'elb',
    's3', 'efs',
    'lambda', 'batch',
    'iam', 'policy', 'allow', 'deny', 'access', 'permission',
    'python', 'java',
    'png', 'csv'
]

Now we collect the weights for all the words in out list.

In [None]:
vectors = []
for v in vocab:
    vectors.append(word_vectors.word_vec(v))
vectors = np.array(vectors)

Each word has 100 values in the vector (unless you edited the code).  We can't visualize that, so let's use PCA to reduce the dimensionality down to 3D.

In [None]:
pca = PCA(n_components=3)
pca_vectors = pca.fit_transform(vectors)

Now let's create a 3D graph of the vectors for our list of words.  This time the matplotlib chart is interactive as we included the line `%matplotlib notebook` at the start of this notebook.  Use your mouse to rotate the graph and explore the values.

In [None]:
fig = plt.figure(figsize=(10,9))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(pca_vectors)):
    w = pca_vectors[i]
    ax.scatter(w[0],w[1],w[2])
    ax.text(w[0],w[1],w[2], vocab[i], fontsize=10)

Want more room in the notebook?  Here's a cool hack:

In [None]:
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))