<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/embeddings/embeddings/GloVe%20Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GloVe Embeddings

GloVe is another commonly used method of obtaining pre-trained embeddings. GloVe aims to achieve two goals:

- Create word vectors that capture meaning in vector space
- Takes advantage of global count statistics instead of only local information

There are a lot of online material available to explain the concept about GloVe. So my focus here will be on, how to use pre-trained Glove word embeddings. I will provide relevant resources to look into more details.

## Resources:

- [Glove Paper Explaination](https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/)
- [Colyer blog on GloVe](https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/)
- [Code and Pretrained Embeddings](https://nlp.stanford.edu/projects/glove/)
- [Stanford Lecture](https://www.youtube.com/watch?v=ASn7ExxLZws)
- [GloVe Paper](https://www-nlp.stanford.edu/pubs/glove.pdf) 

## Difference between Word2Vec and GloVe

**Global information:** word2vec does not have any explicit global information embedded in it by default. GloVe creates a global co-occurrence matrix by estimating the probability a given word will co-occur with other words. This presence of global information makes GloVe ideally work better. Although in a practical sense, they work almost similar and people have found similar performance with both.

**Presence of Neural Networks:** GloVe does not use neural networks while word2vec does. In GloVe, the loss function is the difference between the product of word embeddings and the log of the probability of co-occurrence. We try to reduce that and use SGD but solve it as we would solve a linear regression. While in the case of word2vec, we either train the word on its context (skip-gram) or train the context on the word (continuous bag of words) using a 1-hidden layer neural network.

[source 1](https://www.quora.com/How-is-GloVe-different-from-word2vec)

[source 2](http://deeplearning.lipingyang.org/wp-content/uploads/2017/12/How-is-GloVe-different-from-word2vec_-Quora.pdf)

## Download the pre-trained glove file

I will be using glove.6B file which is trained on Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download). You can find the other files [here](https://nlp.stanford.edu/projects/glove/)

In [0]:
!wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2020-05-18 01:13:05--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2020-05-18 01:13:05--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2020-05-18 01:13:05--  http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [applic

In [0]:
!ls

glove.6B.zip  sample_data


## Upload the Data to Google Drive (Optional)

If the notebook shutdown means, the data which is present is also lost. Inorder to save the data we can sync it google drive by mounting the drive. Instead of downloading the glove file again we can simply mount the drive and get the data

In [0]:
from google.colab import drive

In [2]:
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# run for the first time
!mv glove.6B.zip "/content/drive/My Drive"

## Extract the contents

In [6]:
# update the path to zip file accordingly if not uploaded to drive means
!unzip "./drive/My Drive/glove.6B.zip"

Archive:  ./drive/My Drive/glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [7]:
!ls

drive		   glove.6B.200d.txt  glove.6B.50d.txt
glove.6B.100d.txt  glove.6B.300d.txt  sample_data


I will be using **glove.6B.300d.txt**. The same logic applies for all the other versions

# Using Gensim to load pre-trained Glove Embeddings

Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

Gensim includes streamed parallelized implementations of fastText,word2vec and doc2vec algorithms, as well as latent semantic analysis (LSA, LSI, SVD), non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), tf-idf and random projections. [source](https://en.wikipedia.org/wiki/Gensim)

References:

- [My code on Word2Vec](https://github.com/graviraja/100-Days-of-NLP/blob/master/embeddings/Word2Vec.ipynb)
- [Machine Learning Mastery blog on using gensim](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)


In [0]:
import gensim
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [0]:
glove_file_path = "./glove.6B.300d.txt"

In [11]:
# converting glove file to word2vec format so that it can be loaded by gensim 
word2vec_output_file = 'glove.6B.300d.txt.word2vec'
glove2word2vec(glove_file_path, word2vec_output_file)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


(400001, 300)

In [13]:
!ls

drive		   glove.6B.200d.txt  glove.6B.300d.txt.word2vec  sample_data
glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.50d.txt


In [12]:
# Note that the converted file is ASCII format, not binary, so we set binary=False when loading.
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Word Similarities

Here, we will see how similar are two words to each other

In [16]:
print(f'Similarity between night and nights: {model.similarity("night", "nights")}')
print(f'Similarity between reb and blue: {model.similarity("red", "blue")}')
print(f'Similarity between hello and hi: {model.similarity("hello", "hi")}')
print(f'Similarity between king and queen: {model.similarity("king", "queen")}')
print(f'Similarity between london and moscow: {model.similarity("london", "moscow")}')
print(f'Similarity between car and bike: {model.similarity("car", "bike")}')

Similarity between night and nights: 0.6768945455551147
Similarity between reb and blue: 0.6736692786216736
Similarity between hello and hi: 0.3302616477012634
Similarity between king and queen: 0.6336469054222107
Similarity between london and moscow: 0.39354825019836426
Similarity between car and bike: 0.4672122299671173


  if np.issubdtype(vec.dtype, np.int):


##Most Similar Words

Here, we will ask our model to find the words which are most similar

In [17]:
similar = model.most_similar("january")
for i in similar:
    print(i)

('february', 0.9652106761932373)
('december', 0.9620600938796997)
('october', 0.9580933451652527)
('november', 0.9528316855430603)
('september', 0.9462947845458984)
('august', 0.935489296913147)
('april', 0.9315787553787231)
('june', 0.928554356098175)
('july', 0.9246786832809448)
('march', 0.898531436920166)


  if np.issubdtype(vec.dtype, np.int):


## Odd-One-Out
Here, we ask our model to give us the word that does not belong to the list!

In [18]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


## Analogy difference

Which word is to women as king is to queen?

In [19]:
model.most_similar(positive=["women", "king"], negative=["queen"])

  if np.issubdtype(vec.dtype, np.int):


[('men', 0.6591772437095642),
 ('people', 0.46714138984680176),
 ('who', 0.46234187483787537),
 ('americans', 0.4615159332752228),
 ('young', 0.45295244455337524),
 ('those', 0.4465915262699127),
 ('minorities', 0.44377851486206055),
 ('athletes', 0.42091041803359985),
 ('them', 0.4203292429447174),
 ('others', 0.4188333749771118)]

In [0]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [21]:
analogy('japan', 'japanese', 'china')

  if np.issubdtype(vec.dtype, np.int):


'chinese'

# Training GloVe Embeddings

If the web datasets above don't match the semantics of your end use case, you can train word vectors on your own corpus.

```code
git clone http://github.com/stanfordnlp/glove
cd glove && make
./demo.sh
```

The demo.sh script downloads a small corpus, consisting of the first 100M characters of Wikipedia. It collects unigram counts, constructs and shuffles cooccurrence data, and trains a simple version of the GloVe model. It also runs a word analogy evaluation script in python to verify word vector quality. More details about training on your own corpus can be found by reading demo.sh  in [Offical Glove repo](https://github.com/stanfordnlp/GloVe)