## Training Embeddings Using Gensim
Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install gensim==3.6.0
# !pip install requests==2.23.0

# ===========================

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [4]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1) #using skipGram Architecture for training 

## Continuous Bag of Words (CBOW) 
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [7]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.index_to_key)
print(words)

#Acess vector for one word
print(model_cbow.wv['dog'])

Word2Vec<vocab=6, vector_size=100, alpha=0.025>
['man', 'dog', 'eats', 'bites', 'food', 'meat']
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7

In [9]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.wv.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.wv.similarity('eats', 'man'))

Similarity between eats and bites: -0.01349709
Similarity between eats and man: -0.052354358


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [11]:
#Most similarity
model_cbow.wv.most_similar('meat')

[('food', 0.13887985050678253),
 ('bites', 0.13149003684520721),
 ('eats', 0.06422408670186996),
 ('dog', 0.009391166269779205),
 ('man', -0.05987630784511566)]

In [13]:
# save model
model_cbow.save('output/model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('output/model_cbow.bin')
print(new_model_cbow)

Word2Vec<vocab=6, vector_size=100, alpha=0.025>


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [18]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.index_to_key)
print(words)

#Acess vector for one word
print(model_skipgram.wv['dog'])
print(len(model_skipgram.wv['dog']))

Word2Vec<vocab=6, vector_size=100, alpha=0.025>
['man', 'dog', 'eats', 'bites', 'food', 'meat']
[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7

In [17]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.wv.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.wv.similarity('eats', 'man'))

Similarity between eats and bites: -0.013518773
Similarity between eats and man: -0.0523451


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [20]:
#Most similarity
model_skipgram.wv.most_similar('meat')

[('food', 0.13887983560562134),
 ('bites', 0.13149002194404602),
 ('eats', 0.06406080722808838),
 ('dog', 0.009391166269779205),
 ('man', -0.059876300394535065)]

In [21]:
# save model
model_skipgram.save('output/model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('output/model_skipgram.bin')
print(new_model_skipgram)

Word2Vec<vocab=6, vector_size=100, alpha=0.025>


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 294MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [36]:
import os
import requests

os.makedirs('data/en', exist_ok= True)
# this will keep changing, need to go to this URL to figure it out:
# https://dumps.wikimedia.org/enwiki/latest/
# file_name = "data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2"
# file_name = "data/en/enwiki-latest-pages-articles-multistream-index1.txt-p1p41242.bz2"
file_name = "data/en/enwiki-latest-pages-articles-multistream22.xml-p44496246p44788941.bz2"
file_id = "11804g0GcWnBIVDahjo5fQyc05nQLXGwF"

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

if not os.path.exists(file_name):
    download_file_from_google_drive(file_id, file_name)
else:
    print("file already exists, skipping download")

print(f"File at: {file_name}")

file already exists, skipping download
File at: data/en/enwiki-latest-pages-articles-multistream22.xml-p44496246p44788941.bz2


In [37]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [38]:
#Preparing the Training data
# lemmatize has been deprecated
# wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
wiki = WikiCorpus(file_name, dictionary={})
sentences = list(wiki.get_texts())

#if you get a memory error executing the lines above
#comment the lines out and uncomment the lines below. 
#loading will be slower, but stable.
# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})
# sentences = list(wiki.get_texts())

#if you still get a memory error, try settings processes to 1 or 2 and then run it again.

### Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [39]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.01 hrs 


In [44]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_cbow.wv['film'])}")
print(word2vec_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec<vocab=40190, vector_size=100, alpha=0.025>
------------------------------
Length of vocabulary: 40190
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'as', 'is', 'for', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'also', 'were', 'which', 'are', 'first', 'her', 'this', 'has', 'she', 'be', 'references']
------------------------------
Length of vector: 100
[ 2.219304   -0.8354919   2.6662729  -2.4509525   2.9505837  -1.878043
  0.06650275 -0.9594213   0.9108776   2.0968235  -0.48863474 -3.2809408
 -1.4194704   0.64886856  1.7247549  -0.10405306  1.7054323   0.26500842
 -2.6201446  -0.10065644  1.7148116  -1.6619455  -0.8734362  -1.324647
 -0.43044254  1.4404098  -0.8076793   0.7346869   2.146768   -2.0302885
  1.1566433   1.562745    1.796327    1.3882915  -1.0332848  -2.7455528
  2.944826    0.6361229   1.995394   -4.32989    -3.4562297   2.2285547
  1.339047   -1.2293513  -0.32261008 -2.717089   -1.3406433   2.6678302
 -1.0551729   1.

In [45]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [46]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.02 hrs 


In [47]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_skipgram.wv['film'])}")
print(word2vec_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

Word2Vec<vocab=40190, vector_size=100, alpha=0.025>
------------------------------
Length of vocabulary: 40190
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'as', 'is', 'for', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'also', 'were', 'which', 'are', 'first', 'her', 'this', 'has', 'she', 'be', 'references']
------------------------------
Length of vector: 100
[ 0.32331878  0.5790854   0.43830907  0.20141493  0.40692985 -0.9307328
  0.38431042  0.47711086 -0.44711447 -0.27486363 -0.03668171 -0.35193586
  0.11449455 -0.0292649   0.02325125 -0.18134807  0.3840576  -0.6193043
 -0.41114882 -0.218237    0.21878621 -0.21981032  0.14389041 -0.05950952
  0.13657631  0.37229234 -0.61114234  0.18194164 -0.1850214   0.06693834
  0.82910615 -0.4441088   0.36599264 -0.0019779  -0.2821699  -0.19844653
  0.34508672  0.01054496 -0.26363865 -0.25678846 -0.08147195  0.15462402
 -0.15726426 -0.01737205 -0.22792116 -0.37410778 -0.37628466  0.04001839
 -0.48213

In [48]:
# save model
word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [49]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.03 hrs 


In [50]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow.wv['film'])}")
print(fasttext_cbow.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.wv.similarity('film', 'tiger'))
print("-"*30)

FastText<vocab=40190, vector_size=100, alpha=0.025>
------------------------------
Length of vocabulary: 40190
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'as', 'is', 'for', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'also', 'were', 'which', 'are', 'first', 'her', 'this', 'has', 'she', 'be', 'references']
------------------------------
Length of vector: 100
[  0.49952504   3.9394662    5.977159    -1.0057093   -3.5727525
   0.61577755   0.06744188   0.02997959   4.488094     0.9230299
   0.7005552    0.2315209   -2.176848     0.81778026  -1.2810814
   0.3064063    1.4155692   -1.1717366   -0.5766693    4.689038
  -2.199509     3.9081602    0.41604897   5.528003     1.2504398
   1.3626084   -2.6119337    6.7407537   -1.6044441    0.16712666
   3.2573516    1.6671691    2.1402783   -2.6542192   -1.5304126
  -0.9959465    0.358603     1.7427573    1.713649     2.8426118
  -3.6570632   -0.96193963   1.041286    -4.174572     0.61696625
  -4.

In [51]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.04 hrs 


In [53]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.index_to_key)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram.wv['film'])}")
print(fasttext_skipgram.wv['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.wv.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.wv.similarity('film', 'tiger'))
print("-"*30)

FastText<vocab=40190, vector_size=100, alpha=0.025>
------------------------------
Length of vocabulary: 40190
Printing the first 30 words.
['the', 'of', 'and', 'in', 'to', 'was', 'on', 'as', 'is', 'for', 'by', 'with', 'he', 'at', 'from', 'that', 'his', 'it', 'an', 'also', 'were', 'which', 'are', 'first', 'her', 'this', 'has', 'she', 'be', 'references']
------------------------------
Length of vector: 100
[-1.3109466e-01  7.1749598e-01  6.6718823e-01 -6.4493126e-01
  3.0446741e-01  3.5411862e-01  3.4830281e-01  4.4338515e-01
 -3.0940580e-01 -8.1276399e-01 -9.2982106e-02 -1.8297681e-01
 -1.6826162e-01 -1.3484047e-02 -2.5378418e-01 -2.0926823e-01
  4.9818209e-01  3.6550432e-02 -7.5924933e-01 -2.0646293e-02
 -2.9643467e-01  5.3815728e-01  9.8732240e-02  7.7871017e-02
 -1.8261248e-01 -1.2075171e-01 -3.8744116e-01 -2.5337830e-01
 -1.1243188e-01 -1.3551758e-01 -9.3174815e-02 -2.9903981e-01
  2.8691188e-01 -3.3972341e-01 -1.7674230e-01 -4.6721286e-01
  3.8444540e-01  5.1899862e-01 -3.0457595e

#### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.