# Generating Word Embeddings - Lab

## Introduction

In this lab, we'll learn how to generate our own word embeddings by training our own Word2Vec model, and also by building embedding layers right into our Deep Neural Networks!

## Objectives

You will be able to:

* Demonstrate a basic understanding of the architecture of the Word2Vec model
* Demonstrate an understanding of the various tunable parameters of word2vec such as vector size and window size

## Getting Started

In this lab, we'll start by creating our own word embeddings by making use of the Word2Vec Model. Then, we'll move onto building Neural Networks that make use of **_Embedding Layers_** to accomplish the same end-goal, but directly in our model. 

The easiest way to make use of Word2Vec is to import it from the [Gensim Library](https://radimrehurek.com/gensim/). This model contains a full implementation of Word2Vec, which we can use to begin training immediately. For this lab, we'll be working with the [News Category Dataset from Kaggle](https://www.kaggle.com/rmisra/news-category-dataset/version/2#_=_).  This dataset contains headlines and article descriptions from the news, as well as categories for which type of article they belong to.  In this lab, we'll learn how to train a Word2Vec model on the text data to generate word embeddings for them. In the next lab, we'll then use the vectors created by our Word2Vec model to effectively train a classifier to predict the category of news given the headline and description of each article. In this lab, we won't do any classification, although we will learn how to train a Word2Vec model and explore the relationships between different word vectors in our embedding!

Run the cell below to import everything we'll need for this lab. 

In [2]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/ad/63/5a4b694ac7d0dd0a7d061ba6af0dbd057379da21c7ea7efd44ae3299f87d/gensim-3.7.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.6MB)
[K    100% |████████████████████████████████| 24.6MB 940kB/s ta 0:00:01    93% |██████████████████████████████  | 23.1MB 1.3MB/s eta 0:00:02
Collecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/ff/c8/de7dcf34d4b5f2ae94fe1055e0d6418fb97a63c9dc3428edd264704983a2/smart_open-1.8.0.tar.gz (40kB)
[K    100% |████████████████████████████████| 40kB 1.4MB/s ta 0:00:01
Collecting bz2file (from smart-open>=1.7.0->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.7.0->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/99/7b/d

In [5]:
import zipfile
import pandas as pd
import numpy as np
np.random.seed(0)
from gensim.models import Word2Vec
from nltk import word_tokenize

Now, we'll import the data. You'll find the data stored in the file `'News_Category_Dataset_v2.json'`.  This file is compressed, so that it can be more easily stored in a github repo. **_Make sure to unzip the file before continuing!_**

In the cell below, use the `read_json` function from pandas to read the dataset into a DataFrame. Be sure to also include the parameter `lines=True` when reading in the dataset!

Once you've loaded in the data, inspect the head of the DataFrame to see what our data looks like. 

In [8]:
zip = zipfile.ZipFile('News_Category_Dataset_v2.zip')
zip.extractall()

In [12]:
df = pd.read_json('News_Category_Dataset_v2.json', lines=True)

## Preparing the Data

Since we're working with text data, we'll still need to do some basic preprocessing and tokenize our data. You'll notice from the sample of the data above that two different columns contain text data--`headline` and `short_description`. The more text data our Word2Vec model has, the better it will perform. Therefore, we'll want to combine the two columns before tokenizing each comment and training our Word2Vec model. 

In the cell below:

* Create a column called `combined_text` that consists of the data from `df.headline` plus a space character (`' '`) plus the data from `df.short_description`.
* Use the `combined_text` column's `map()` function and pass in `word_tokenize`. Store the result returned in `data`.

In [13]:
df['combined_text'] = df.headline + ' ' +  df.short_description
data = df['combined_text'].map(word_tokenize)

Let's inspect the first 5 items in `data` to see how everything looks. 

In [14]:
data[:5]

0    [There, Were, 2, Mass, Shootings, In, Texas, L...
1    [Will, Smith, Joins, Diplo, And, Nicky, Jam, F...
2    [Hugh, Grant, Marries, For, The, First, Time, ...
3    [Jim, Carrey, Blasts, 'Castrato, ', Adam, Schi...
4    [Julianna, Margulies, Uses, Donald, Trump, Poo...
Name: combined_text, dtype: object

You'll notice that although the words are tokenized, they are still in the same order they were in as headlines. This is important, because the words need to be in their original order for Word2Vec to establish the meaning of them. Recall from our previous lesson on how Word2Vec works that we can specify a  **_Window Size_** that tells the model how many words to take into consideration at one time. 

If our window size was 5, then the model would start by looking at the words "Will Smith joins Diplo and", and then slide the window by one, so that it's looking at "Smith joins Diplo and Nicky", and so on, until it had completely processed the text example at index 1 above. By doing this for every piece of text in the entire dataset, the Word2Vec model learns excellent vector representations for each word in an **_Embedding Space_**, where the relationships between vectors capture semantic meaning (recall the vector that captures gender in the previous "king - man + woman = queen" example we saw).

Now that we've prepared our data, let's train our model and explore a bit!

## Training the Model

We'll start by instantiating a Word2Vec Model from gensim below. 

In the cell below:

* Create a `Word2Vec` model and pass in the following arguments:
    * The dataset we'll be training on, `data`
    * The size of the word vectors to create, `size=100`
    * The window size, `window=5`
    * The minimum number of times a word needs to appear in order to be counted in  the model, `min_count=1`.
    * The number of threads to use during training, `workers=4`

In [15]:
model = Word2Vec(data, size=100, window=5, min_count=1, workers=4)

Now, that we've created our Word2Vec model, we still need to train it on our model. 

In the cell below:

* Call `model.train()` and pass in the following parameters:
    * The dataset we'll be training on, `data`
    * The `total_examples`  of sentences in the dataset, which we can find in `model.corpus_count`. 
    * The number of `epochs` we want to train for, which we'll set to `10`

In [16]:
model.train(data, total_examples=model.corpus_count, epochs=10)

(55562858, 67337030)

Great! We now have a fully trained model! The word vectors themselves are stored inside of a `Word2VecKeyedVectors` instance, which we'll find stored inside of `model.wv`. For simplicity's sake, let's go ahead and store this inside of the variable `wv` in order to save ourselves some keystrokes down the line. 

In [17]:
wv = model.wv

## Examining Our Word Vectors

Now that we have a trained Word2Vec model, let's go ahead and explore the relationships between some of the words in our corpus! 

One cool thing we can use Word2Vec for is to get the most similar words to a given word. We can do this passing in the word to `wv.most_similar()`. 

In the cell below, let's try getting the most similar word to `'Texas'`.

In [18]:
wv.most_similar('Texas')

[('Maryland', 0.8250618577003479),
 ('Arkansas', 0.8156073689460754),
 ('Oklahoma', 0.7990760803222656),
 ('Illinois', 0.7949262857437134),
 ('Oregon', 0.7940429449081421),
 ('Pennsylvania', 0.7934418320655823),
 ('Massachusetts', 0.7932147979736328),
 ('Louisiana', 0.7901225090026855),
 ('California', 0.787533164024353),
 ('Ohio', 0.7865491509437561)]

Interesting! All of the most similar words are also states. 

We can also get the least similar vectors to a given word by passing in the word to the `most_similar()` function's `negative` parameter. 

In the cell below, get the least similar words to `'Texas'`.

In [19]:
wv.most_similar(negative='Texas')

[('much-vaunted', 0.4382411539554596),
 ('Headstrong', 0.4316440224647522),
 ('Uglies', 0.4202665090560913),
 ('Parent/Grandparent', 0.4069547653198242),
 ('Warstler', 0.3762126564979553),
 ('Ex-hockey', 0.3660905361175537),
 ('exiling', 0.364873468875885),
 ('Lunatics', 0.36445826292037964),
 ('Rank-and-file', 0.36365893483161926),
 ('Trashbag', 0.36026397347450256)]

These seem like just noise. This is because of the way Word2Vec is computing the similarity between word vectors in the embedding space. Although the word vectors closest to a given word vector are almost certainly going to have similar meaning or connotation with our given word, the word vectors that the model considers 'least similar' are just the word vectors that are farthest away, or have the lowest cosine similarity. It's important to understand that while the closest vectors in the embedding space will almost certainly share some level of semantic meaning with a given word, there is no guarantee that this relationship will hold at large distances. 

We can also get the vector for a given word by passing in the word as if we were passing in a key to a dictionary. 

In the cell below, get the word vector for `'Texas'`.

In [20]:
wv['Texas']

array([ 1.6168797 ,  1.1505369 , -1.8672785 , -1.4434919 ,  1.7127879 ,
        2.8044503 , -0.07990683, -1.4771386 ,  1.3732891 , -2.3244207 ,
       -0.92589015,  0.9747711 ,  1.3837317 ,  1.551782  ,  0.6771376 ,
       -0.59979194, -1.3349247 ,  0.88946813, -0.2548882 ,  2.4079425 ,
        3.108622  ,  2.1763377 , -0.56800586,  0.46032643, -0.2767879 ,
        1.3888208 , -0.98538697, -1.509525  , -0.68201077,  0.60408825,
        2.1673067 ,  1.4260831 , -0.5377123 , -0.8472265 ,  0.31442913,
        1.7815683 , -1.0171872 , -0.43475226,  0.45857567, -0.4957267 ,
       -0.31520665, -0.15205568,  0.9930887 ,  0.39535895, -1.3436544 ,
        0.87974733, -2.2526894 , -1.2446444 , -0.76615703,  1.0446639 ,
       -0.31616879, -2.4223952 , -1.5286347 , -0.9472497 , -0.5148168 ,
       -1.3581412 , -0.4465725 , -0.6727815 ,  0.45265302, -0.0340201 ,
       -0.0771371 ,  0.7302364 ,  3.5413947 , -0.8131564 , -2.0949214 ,
        0.67541766,  0.04654254, -1.2838364 , -1.2376904 ,  0.08

Let's get all of the word vectors from the object at once. We can find these inside of `wv.vectors`.  Do this now in the cell below.  

In [21]:
wv.vectors

array([[ 1.4066353 , -0.81938744,  0.977179  , ...,  0.14406243,
        -1.6500566 , -0.07787284],
       [ 0.04761624,  1.0016757 ,  1.3382974 , ...,  0.31546775,
        -0.74379945, -1.1263111 ],
       [-0.39176315,  0.16908583,  0.9593872 , ..., -0.9907391 ,
        -0.5598413 ,  0.7043345 ],
       ...,
       [ 0.0045312 ,  0.02685374, -0.07468879, ...,  0.15714285,
         0.02821173,  0.05593785],
       [ 0.06224699,  0.04601824, -0.00367321, ...,  0.00754866,
        -0.01500098, -0.07499362],
       [-0.01206802,  0.09020916, -0.09984657, ...,  0.07835849,
        -0.01906109,  0.03957682]], dtype=float32)

As a final exercise, let's try recreating the _'king' - 'man' + 'woman' = 'queen'_ example we've seen before. We can do this by using the `most_similar` function and putting the things we want added together inside of an array passed to the `positive` parameter, and the things we want subtracted as an array passed to the the `negative` parameter. 

Do this now in the cell below. 

As we can see from the output above, our model isn't perfect, but 'Queen' is still in the top 3, and with 'Princess' not too far behind. As we can see from the word in first place, 'reminiscent', our model is far from perfect. This is likely because we didn't give it too much training, or training data. However, for the small amount of training data it was given, the model still performs remarkably well! 

We'll see in the next lab that from a practical standpoint, one of the best things we can do for performance is to start by loading in the weights from an open-sourced model that has been trained for a very long time on a massive amount of data, such as the GloVe model from the Stanford NLP Group. There's not really any benefit from training the model ourselves, unless our text uses different, specialized vocabulary that isn't likely to be well represented inside an open-source model.

## Summary

In this lab, we learned how to train and use a Word2Vec model to created vectorized word embeddings!