## Word Embeddings in Action - GloVe

 - [Stanford NLP Group](https://nlp.stanford.edu/) have released their own algoritm for training word embedding like Word2Vec and it's called [Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) or GloVe for short.
 - Researchers at the moment prefer GloVe over Word2Vec based on results and in this notebook, you will learn how to utilize it on a text classification problem. 
 
With that expectation set, let's start without much ado!

![](../images/glove.jpg)

### Table of Contents
 
1. About the dataset
2. Comparing GloVe vs Word2Vec
3. Utilizing Stanford NLP's pretrained GloVe
    - Installing
    - Saving embedding matrix to disk
    - Finding most similar words given a context word
    - Contextual relationship between words 
4. Challenge : Build a text classification model using GloVe

### 1. About the dataset

The dataset that you are going to use is a collection of news articles from BBC across 5 major categories, namely:
 
 - Business
 - Entertainment
 - Politics
 - Sport
 - Tech

There are a total of 2225 articles in the dataset, which is a mix of all of the above categories. Let's load the dataset using pandas and have a quick look at some of the articles. 

**Note:** You can get the dataset [here.](https://trainings.analyticsvidhya.com/asset-v1:AnalyticsVidhya+LP_DL_2019+2019_T1+type@asset+block@bbc_news_mixed.csv)

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
bbc_news = pd.read_csv('../datasets/bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


In [None]:
# print first 2 articles
for art in bbc_news.text[:2]:
    print(art[:200])

Now that you have an idea of how your data looks like, let's see the count of each category in the dataset!

In [3]:
# category-wise count
bbc_news.label.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

### 2. Comparing GloVe vs Word2Vec

The following are major differences between GloVe and Word2Vec models - 

- GloVe works similarly as Word2Vec. While Word2Vec is a **predictive model** that predicts context given a word (or vice-versa based on if you are using skip-gram or cbow variant),

- GloVe learns by constructing a **co-occurrence matrix** (words x context) that basically counts how frequently a word appears in a context. Since it's going to be a gigantic matrix, we factorize this matrix to achieve a lower-dimension representation. There's a lot of details that goes in GloVe but that's the rough idea. 
 
- In word2vec, skip-gram models try to capture co-occurrence **one window at a time**.

- In Glove it tries to capture the counts of **overall statistics** how often it appears.

In practice, the main difference is that GloVe embeddings work better on some data sets, while word2vec embeddings work better on others. They both do very well at capturing the semantics of analogy, and that takes us, it turns out, a very long way toward lexical semantics in general.

### 3. Utilizing Stanford NLP's pretrained GloVe

It's time to quickly start using GloVe in your problem statement, but before you can do that you need to install GloVe pre-trained embeddings. 

#### a. Installing

 - Download the GloVe model from [Glove6B.zip](http://nlp.stanford.edu/data/glove.6B.zip). This is the smallest model available which is trained on Wikipedia's text containing 6 Billion tokens and a vocabulary of around 400,000 words. The file is **822 MB** in size so it will take some time to download.
 - Extract the zip file by the following command
 
 `unzip glove.6B.zip`
 - Once you have extracted the file, you will see that there are multiple text files
     1. **glove.6B.100d.txt** - Contains 100 dimension vectors for each word of the vocabulary.
     2. **glove.6B.200d.txt** - Contains 200 dimension vectors for each word of the vocabulary.
     3. **glove.6B.300d.txt** - Contains 300 dimension vectors for each word of the vocabulary.
     4. **glove.6B.50d.txt**  - Contains 50 dimension vectors for each word of the vocabulary.
     
 - Basicially based on your requirement, you can choose the dimensions of your vectors. In this case, you will work with the 300 dimensional one. 
 - Let's quickly load the vectors in python!

In [5]:
# load glove into a data frame 
df = pd.read_csv('../embeddings/glove.6B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
# show first 5 columns of the data frame 
df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.04656,0.21318,-0.007436,-0.45854,-0.035639,0.23643,-0.28836,0.21521,-0.13486,-1.6413,...,-0.013064,-0.29686,-0.079913,0.195,0.031549,0.28506,-0.087461,0.009061,-0.20989,0.053913
",",-0.25539,-0.25723,0.13169,-0.042688,0.21817,-0.022702,-0.17854,0.10756,0.058936,-1.3854,...,0.075968,-0.014359,-0.073794,0.22176,0.14652,0.56686,0.053307,-0.2329,-0.12226,0.35499
.,-0.12559,0.01363,0.10306,-0.10123,0.098128,0.13627,-0.10721,0.23697,0.3287,-1.6785,...,0.060148,-0.15619,-0.11949,0.23445,0.081367,0.24618,-0.15242,-0.34224,-0.022394,0.13684
of,-0.076947,-0.021211,0.21271,-0.72232,-0.13988,-0.12234,-0.17521,0.12137,-0.070866,-1.5721,...,-0.36673,-0.38603,0.3029,0.015747,0.34036,0.47841,0.068617,0.18351,-0.29183,-0.046533
to,-0.25756,-0.057132,-0.6719,-0.38082,-0.36421,-0.082155,-0.010955,-0.082047,0.46056,-1.8477,...,-0.012806,-0.59707,0.31734,-0.25267,0.54384,0.063007,-0.049795,-0.16043,0.046744,-0.070621


 - As you can see above, each word is reperesented by 300 real number values. These 300 values together form the glove vector for that particular word. 

 - Before you can use GloVe, it is advisable to convert it into a dictionary like structure. Where each word will be the key of the dictionary and the value would be the 300 dimensional vector. The following code does that, note that based on your system's RAM it might take some time for it to load all the vectors.

In [12]:
# make a dictionary of glove
glove = {key: val.values for key, val in df.T.items()}
# print shape of a vector
print('Shape of vector representation of \'cake\':', len(glove['cake']))

Shape of vector representation of 'cake': 300


#### b. Saving embedding matrix to disk

- You'd have noticed that it takes quite sometime to load the glove vectors and convert them to the dictionary format.    

- Therefore, there is an easier way to save this dictionary to your computer and load it in a shorter time whenever you want.
- We will use python's [pickle](https://docs.python.org/3/library/pickle.html) model to do that. Pickle is a file format that let's you save python objects directly to disk and re-load them in python in the same format they were initially.
- Pickle is also used in industry to save a trained model and deploy it.

In [14]:
import pickle

with open('glove.6B.300d.pkl', 'wb') as fp:
    # save the pickle file to disk
    pickle.dump(glove, fp)

#### d. Finding most similar words given a context word

 - Currently, we have created a dictionary of words and their glove vectors. Something like this,

![](../images/glove_dict.png)
 - It would be better if we add an index to each word in our glove dictionary. This will help us identify each word by it's index when finding most similar words. The following code returns an `inverse_dictionary` that contains index as key and word as value.

In [102]:
def generate_inverse_dictionary():
    # assign ids to each word
    ids = [x for x in range(len(glove))]
    # generate inverse dictionary
    inverse_dictionary = {v:k for k, v in zip(glove.keys(), ids)}
    
    return inverse_dictionary

# call inverse_dictionary
print('First 10 entries in the dictionary:') 
list(generate_inverse_dictionary().items())[:10]

First 10 entries in the dictionary:


[(0, 'the'),
 (1, ','),
 (2, '.'),
 (3, 'of'),
 (4, 'to'),
 (5, 'and'),
 (6, 'in'),
 (7, 'a'),
 (8, '"'),
 (9, "'s")]

 - Similarity score between two words is calculated by taking [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of their vectors. You don't need to worry much about what is cosine similarity right now, but do know that this is the same method that **gensim** uses under the hood.
 - Going by that logic, if you want to find all the similar words to a given word then you will have to find cosine similarity of this word with all the 400,000 words present in the corpus. 
 - This approach might work but will take too long to compute. A better way would be to use matrix multiplication to compute all the similarities and then sort the word indexes (that we created earlier in inverse_dictionary) in descending order, based on their similarity score with the given word.
 - You can then use the same inverse_dictionary to look up the words using their indexes.
![](../images/most_sim.png)
 - In the above example, we have taken four words 'hi', 'hello', 'no' and 'hey' and assumed their vector to be of length 3 each.
 - As you can see, matrix multiplication gave the similarity score of 'hey' with each word in the corpus. Utilizing this, we took the index of the word and got the word name from the `inverse_dictionary`.
 - The following code does that!

In [125]:
# find most similar words; given a context word
def most_similar(word_vec, topn=5):
    # fetch inverse dictionary
    inverse_dictionary = generate_inverse_dictionary()
    # fetch glove vectors
    word_vectors = list(glove.values())
    
    # compute cosine_similarity
    cosine_similarity = (np.dot(word_vectors, word_vec)
           / np.linalg.norm(word_vectors, axis=1)
           / np.linalg.norm(word_vec))
    
    # sort the word ids in descending order based on their similarity score
    word_ids = np.argsort(-cosine_similarity)
    
    # return the most similar words with their similarity score
    return [(inverse_dictionary[x], cosine_similarity[x]) for x in word_ids[1:topn]
if x in inverse_dictionary]


In [126]:
# find most similar words to king
most_similar(glove['king'])

[('queen', 0.6336468701479963),
 ('prince', 0.6196623000643996),
 ('monarch', 0.5899620887183682),
 ('kingdom', 0.5791266501891081)]

#### e. Contextual relationship between words

 - One of the impressive things about GloVe is it's ability to capture semantic relationship between words. That is the reason that you can do cool stuff like perform linear algebra on words and get an appropriate output. Have a look at the following example:

    `airplane - fly + drive = car`

 - If you pass the left hand side of the above equation to the model, it will give the right handside. Which makes sense because what would you get if you remove the ability to fly from an airplane? And add the ability to drive? You would get a car!
 - The underlying model is able to understand implicit relationship between airplane and fly and also how removing the medium of travel changes the machine used to travel. 
 - It is also able to understand **how what fly is to airplane similarly drive is to a car.**

In [128]:
# find airplane - fly + drive
result = glove['airplane'] - glove['fly'] + glove['drive']
# find most similar to result
most_similar(result)

[('car', 0.572202522968639),
 ('drives', 0.538060840892117),
 ('vehicle', 0.5147907611059586),
 ('truck', 0.4797217336287158)]

### 4. Challenge : Build a text classification model using GloVe

Now it is time for you to apply all the learnings so far and build a text classification model on the same BBC News Data set that we used in the Word2Vec notebook. Here are steps that you need to follow to do the same - 

1. Load glove from the disk.
2. Create X by generating glove vectors for each news article.
3. Label encode y by using LabelEncoder from sklearn.
4. Split the data into 'train' and 'test' sets.
5. Train a naive bayes model.
6. Compute the accuracy of the model to measure performance.

For your convenience, 1 and 2 is already done for you. You may proceed with 3 to 6.

**Note: Feel free to refer the previous Word2Vec notebook for ideas/inspiration/code for this assignment.**

1. Loading GloVe vectors from disk

In [5]:
import pickle

with open('glove.6B.300d.pkl', 'rb') as fp:
    # load glove from disk 
    glove = pickle.load(fp)

2. Create X by generating glove vectors for each news article

In [6]:
# returns vector reperesentation of a given word if it is present in vocabulary
def get_embedding_glove(doc_tokens):
    embeddings = []
    # iterate over tokens to extract their vectors    
    for tok in doc_tokens:
        if tok in glove:
            embeddings.append(glove[tok])
    # mean the vectors of individual words to get the vector of the statement
    return np.mean(embeddings, axis=0)

In [7]:
from gensim.utils import simple_preprocess

# preprocess all the articles of the data set
preprocessed_bbc = bbc_news.text.apply(lambda x: simple_preprocess(x))

# create X from glove
X = preprocessed_bbc.apply(lambda x: get_embedding_glove(x))
X = pd.DataFrame(X.tolist())
print('X shape:', X.shape)

X shape: (2225, 300)
