## Training Embeddings Using Gensim
Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install gensim==3.6.0
!pip install requests==2.23.0

# ===========================



In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [4]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW) 
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [5]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 2.5211656e-03 -4.7110780e-03  2.8306653e-03  1.2988835e-03
 -2.8097455e-03  3.5701145e-03 -1.0838461e-03 -2.7250070e-03
 -8.1295683e-04  1.5667123e-03  2.4118358e-03 -4.7246739e-03
  4.0631713e-03 -2.5763465e-03 -1.6118506e-04 -1.7885152e-03
 -4.3171244e-03 -2.2182211e-03 -7.9603918e-04 -2.0051922e-03
 -4.0520830e-03  3.5601703e-03 -4.8916014e-03  8.3168899e-04
  1.8691848e-03 -7.2068983e-04  1.7822703e-03  4.8909439e-03
  1.4495224e-04  7.1767234e-04  4.7240019e-04 -4.7488464e-03
  3.2012935e-03 -2.2545501e-03  5.5779086e-04 -8.5462094e-04
 -2.0693108e-03 -3.2387851e-03 -4.7898539e-03  2.9532199e-03
 -2.6267513e-03  4.3456843e-03 -3.6019234e-03 -3.1687848e-03
 -4.4510062e-03  3.2532150e-03 -5.6775135e-04  4.9478044e-03
  3.7174521e-03 -1.2643889e-03  4.7942945e-03  4.2697912e-04
  3.9527160e-03  1.0574544e-06 -4.1726064e-03 -4.9871374e-03
  4.4140723e-03 -3.0756535e-03 -6.5692631e-04  1.4224187e

In [6]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))

Similarity between eats and bites: -0.17754136
Similarity between eats and man: 0.15367655


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [7]:
#Most similarity
model_cbow.most_similar('meat')

[('bites', 0.07873993366956711),
 ('dog', 0.07166969776153564),
 ('food', 0.015941759571433067),
 ('man', -0.06099303811788559),
 ('eats', -0.12372516840696335)]

In [8]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [9]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 2.5211656e-03 -4.7110780e-03  2.8306653e-03  1.2988835e-03
 -2.8097455e-03  3.5701145e-03 -1.0838461e-03 -2.7250070e-03
 -8.1295683e-04  1.5667123e-03  2.4118358e-03 -4.7246739e-03
  4.0631713e-03 -2.5763465e-03 -1.6118506e-04 -1.7885152e-03
 -4.3171244e-03 -2.2182211e-03 -7.9603918e-04 -2.0051922e-03
 -4.0520830e-03  3.5601703e-03 -4.8916014e-03  8.3168899e-04
  1.8691848e-03 -7.2068983e-04  1.7822703e-03  4.8909439e-03
  1.4495224e-04  7.1767234e-04  4.7240019e-04 -4.7488464e-03
  3.2012935e-03 -2.2545501e-03  5.5779086e-04 -8.5462094e-04
 -2.0693108e-03 -3.2387851e-03 -4.7898539e-03  2.9532199e-03
 -2.6267513e-03  4.3456843e-03 -3.6019234e-03 -3.1687848e-03
 -4.4510062e-03  3.2532150e-03 -5.6775135e-04  4.9478044e-03
  3.7174521e-03 -1.2643889e-03  4.7942945e-03  4.2697912e-04
  3.9527160e-03  1.0574544e-06 -4.1726064e-03 -4.9871374e-03
  4.4140723e-03 -3.0756535e-03 -6.5692631e-04  1.4224187e

In [10]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))

Similarity between eats and bites: -0.17754517
Similarity between eats and man: 0.15367937


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [11]:
#Most similarity
model_skipgram.most_similar('meat')

[('bites', 0.0787399411201477),
 ('dog', 0.07166968286037445),
 ('food', 0.01594177447259426),
 ('man', -0.060993026942014694),
 ('eats', -0.12379160523414612)]

In [12]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 294MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [13]:
import os
import requests

os.makedirs('data/en', exist_ok= True)
file_name = "data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2"
file_id = "11804g0GcWnBIVDahjo5fQyc05nQLXGwF"

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

if not os.path.exists(file_name):
    download_file_from_google_drive(file_id, file_name)
else:
    print("file already exists, skipping download")

print(f"File at: {file_name}")

File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2


In [14]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [15]:
#Preparing the Training data
wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

#if you get a memory error executing the lines above
#comment the lines out and uncomment the lines below. 
#loading will be slower, but stable.
# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})
# sentences = list(wiki.get_texts())

#if you still get a memory error, try settings processes to 1 or 2 and then run it again.

### Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [16]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.07 hrs 


In [17]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_cbow['film'])}")
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[ 2.99669266e-01 -2.30107337e-01 -2.83332348e+00  1.79103506e+00
 -2.36173677e+00  1.72570646e+00  1.64380169e+00 -1.66164470e+00
 -3.12124300e+00  3.81150126e-01  1.35081375e+00 -2.21470922e-01
  1.52899250e-01  1.70926428e+00  1.81681314e-03 -9.03216541e-01
 -4.65534389e-01  2.37153435e+00  3.28944230e+00 -4.95558918e-01
 -5.94740689e-01 -1.57181859e+00  4.11749452e-01  1.24216557e+00
 -4.07071066e+00 -2.27307582e+00 -1.66209900e+00 -2.24222684e+00
 -1.05932796e+00  1.05019844e+00 -1.72928619e+00 -4.00999689e+00
  1.46719

In [18]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [19]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.20 hrs 


In [20]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_skipgram['film'])}")
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.4237168  -0.27985552 -0.25927952  0.3400125  -0.6667841   0.5012484
 -0.05727514  0.04838533  0.29251036 -0.4952073  -0.11132739 -0.06375849
 -0.10024641 -0.42574143  0.41230604  0.24994987 -0.1766596   0.2103304
 -0.36034128 -0.07717225  0.3511364  -0.3076286  -0.3552336  -0.15659297
 -0.1312314  -0.05537846  0.02508126 -0.66197944  0.02561701 -0.49367282
  0.36917788 -0.5463784  -0.01422684 -0.03367303 -0.1291493  -0.07956319
 -0.6713077   0.46974298  0.02238307 -0.29419112 -0.18262397  0.01855985
  0.37098563  0.1552

In [21]:
# save model
word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [22]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.23 hrs 


In [23]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow['film'])}")
print(fasttext_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.7463835   0.19996189 -0.9979853   4.532985    0.9235125   2.189045
 -0.31808636  3.3484747   2.8904202   5.01124     0.67354566 -5.330121
  1.8081079  -0.06147059 -2.9210417   0.6564837   6.27072    -2.7427194
  3.8220415  -2.6083946  -0.57947993  1.2608125   3.7221222   3.707661
 -1.8702508  -2.0417852  -2.6469572  -1.7629712  -6.2687297  -0.79994273
 -2.7692544  -0.12231357 -0.887583    1.3647972  -0.46990368  2.2163184
  1.366861    4.930337   -3.1075075  -1.6002316   3.0699844   1.11122
 -6.8713775   1.8483716   0.4

In [24]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.34 hrs 


In [25]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram['film'])}")
print(fasttext_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.34062257  0.20034881 -0.32438838  0.4757775   0.09694169  0.44387358
  0.374473    0.07984764  0.12626715  0.14559336  0.17070621 -0.1242009
 -0.02354645  0.2784064   0.01321098  0.10675827  0.5270286   0.18560082
  0.1947721  -0.09488335 -0.70429     0.03798589  0.36020663  0.4540548
  0.82588804 -0.03594406 -0.7010492   0.08276882 -0.3600402  -0.09103949
 -0.20812891 -0.0266632   0.18376034 -0.03438908  0.18739544 -0.54230857
  0.11515257 -0.76714796 -0.38413116  0.20995826  0.26379701 -0.14478317
 -0.2507331  -0.2386

#### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.