# GloVe: Global Vectors for Word Representation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Source: https://nlp.stanford.edu/projects/glove/

In [5]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn
import pickle

%matplotlib inline

#Import module to split the datasets
from sklearn.model_selection import train_test_split
# Import modules to evaluate the metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix,accuracy_score,roc_auc_score,roc_curve,auc

In [7]:
# Global parameters
#root folder
root_folder='.'
data_folder_name='../data'
glove_filename='glove.6B.50d.txt'

train_filename='train.csv'
# Variable for data directory
DATA_PATH = os.path.abspath(os.path.join(root_folder, data_folder_name))
glove_path = os.path.abspath(os.path.join(DATA_PATH, glove_filename))

# Both train and test set are in the root data directory
train_path = DATA_PATH
test_path = DATA_PATH

#Relevant columns
TEXT_COLUMN = 'text'
TARGET_COLUMN = 'target'

## Word2Vec

Recall that Word2Vec was introduced by Mikolav et al.'s 2013 paper.  This algorithm first represents any word in a training corpus by a vector (a list of numbers) -- perhaps with one-hot encoding. initialy, each of these vectors can be completely random.  Any 2 similar (in semantics) words should have word vectors that are "close" together.

We move the vector representations of similar words closer together by maximizing the predicted probability of the two words co-occurring.  As we do this over and over again, vectors for words that are often together will end up close, and vectors for words that are rearely together will end up far apart.


## GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus .  It was created by Pennington et al. in [Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/).  

The approach relies on constructing a global co-occurrence matrix of words in the corpus. It has a few steps:

- During training, record the word co-occurrences of all words using a moving context window
- Initialize each word vector randomly (same as word2vec)
- If any two words co-occurr more frequently than is justified by their frequency in the corpus, draw their vectors together. If they co-occurr less frequently, push their vectors apart.
    - As in word2vec, as we do this over and over the vectors of similar words will be drawn together.
    
    
## Loading a pre-trained word embedding

Gensim launched its own dataset storage, committed to long-term support, a sane standardized usage API and focused on datasets for unstructured text processing (no images or audio). This [Gensim-data repository](https://github.com/RaRe-Technologies/gensim-data) serves as that storage.

To use, simply install Gensim and use its download API. It will "talk" to this repository automagically.


More details on default downloadable Gensim data models can be found [here](https://github.com/RaRe-Technologies/gensim-data).

In [2]:
# use the default gensim Glove models

import gensim.downloader

model = gensim.downloader.load("glove-twitter-25") 

model.most_similar("cat")



IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





[('dog', 0.9590820074081421),
 ('monkey', 0.9203579425811768),
 ('bear', 0.9143137335777283),
 ('pet', 0.9108031392097473),
 ('girl', 0.8880630731582642),
 ('horse', 0.8872725963592529),
 ('kitty', 0.8870542049407959),
 ('puppy', 0.8867696523666382),
 ('hot', 0.886525571346283),
 ('lady', 0.8845519423484802)]

## Using Downloaded GloVe Model

Files with the pre-trained vectors Glove can be found in many sites like Kaggle or in the previous link of the Stanford University. We will use the glove.6B.100d.txt file containing the glove vectors trained on the Wikipedia and GigaWord dataset.

First we convert the GloVe file containing the word embeddings to the word2vec format for convenience of use. We can do it using the gensim library, a function called glove2word2vec.

If we have already downloaded a GloVe model, here is how to load it.

First, we need to run this following code once.  he function glove2word2vec saves the Glove embeddings in the word2vec format that will be loaded next

In [3]:
from gensim.scripts.glove2word2vec import glove2word2vec

In [8]:
#glove_input_file = glove_filename
word2vec_output_file = glove_filename+'.word2vec'
glove2word2vec(glove_path, word2vec_output_file)

  glove2word2vec(glove_path, word2vec_output_file)


(400000, 50)

So our vocabulary contains 400K words represented by a feature vector of shape 100. Now we can load the Glove embeddings in word2vec format and then analyze some analogies. In this way if we want to use a pre-trained word2vec embeddings we can simply change the filename and reuse all the code below.

In [13]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
word2vec_output_file = glove_filename+'.word2vec'
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

#Show a word embedding
print('King: ',model.get_vector('king'))

result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

print('Most similar word to King + Woman: ', result)

King:  [-0.32307  -0.87616   0.21977   0.25268   0.22976   0.7388   -0.37954
 -0.35307  -0.84369  -1.1113   -0.30266   0.33178  -0.25113   0.30448
 -0.077491 -0.89815   0.092496 -1.1407   -0.58324   0.66869  -0.23122
 -0.95855   0.28262  -0.078848  0.75315   0.26584   0.3422   -0.33949
  0.95608   0.065641  0.45747   0.39835   0.57965   0.39267  -0.21851
  0.58795  -0.55999   0.63368  -0.043983 -0.68731  -0.37841   0.38026
  0.61641  -0.88269  -0.12346  -0.37928  -0.38318   0.23868   0.6685
 -0.43321  -0.11065   0.081723  1.1569    0.78958  -0.21223  -2.3211
 -0.67806   0.44561   0.65707   0.1045    0.46217   0.19912   0.25802
  0.057194  0.53443  -0.43133  -0.34311   0.59789  -0.58417   0.068995
  0.23944  -0.85181   0.30379  -0.34177  -0.25746  -0.031101 -0.16285
  0.45169  -0.91627   0.64521   0.73281  -0.22752   0.30226   0.044801
 -0.83741   0.55006  -0.52506  -1.7357    0.4751   -0.70487   0.056939
 -0.7132    0.089623  0.41394  -1.3363   -0.61915  -0.33089  -0.52881
  0.16483  -

## Analyzing the vector space and find analogies

We would like extract some interesting features of our word embeddings,Now, our words are numerical vectors so we can measure and compare distances between words to show some of the properties that these embedding provide.

For example, we can compare some analogies. The most famous is the following: king – man + woman = queen. In other words, adding the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. In others words, subtracting the concept of man to the concept of King we get a representation of the "royalty". Then, if we sum to the woman word this concept we obtain the word "queen". Another example is: france – paris + rome = italy. In this case, the vector difference between paris and france captures the concept of country.

Now we will show some of thise analogies in different topics.

In [14]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print('King - Man + Woman = ',result)
result = model.most_similar(positive=['rome', 'france'], negative=['paris'], topn=1)
print('France - Paris + Rome = ',result)
result = model.most_similar(positive=['english', 'france'], negative=['french'], topn=1)
print('France - french + english = ',result)
result = model.most_similar(positive=['june', 'december'], negative=['november'], topn=1)
print('December - November + June = ',result)
result = model.most_similar(positive=['sister', 'man'], negative=['woman'], topn=1)
print('Man - Woman + Sister = ',result)


King - Man + Woman =  [('queen', 0.7698541283607483)]
France - Paris + Rome =  [('italy', 0.8295993208885193)]
France - french + english =  [('england', 0.7678162455558777)]
December - November + June =  [('july', 0.9814670085906982)]
Man - Woman + Sister =  [('brother', 0.8288711309432983)]


We can observe how the word vectors include information to relate countries with nationalities, months of the year, family relationships, etc.

But not always we get the expected results:

In [16]:
#But not always we get the expected result
result = model.most_similar(positive=['aunt', 'nephew'], negative=['niece'], topn=1)
print('France - Paris + Rome = ',result)

France - Paris + Rome =  [('uncle', 0.8936851620674133)]
