## Training Embeddings Using Gensim
##### Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [0]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')


In [0]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 



## Continuous Bag of Words (CBOW) 
##### In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [3]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 4.67871130e-03  4.41844948e-03  1.11098308e-03  4.07150527e-03
  8.76159407e-04  4.25393833e-03 -4.97938413e-03  7.77888810e-04
  3.64920101e-03  3.54022253e-03  3.06759612e-03  4.41551115e-03
 -2.13391054e-03 -2.94187572e-03  4.54964954e-03 -6.14416786e-04
  4.92056180e-03  3.07814195e-03 -1.42575079e-03  3.88053618e-03
  3.01008997e-03 -3.33846710e-03 -1.26256561e-03 -9.83517166e-05
 -2.77526304e-03  1.94193004e-03  3.46393650e-03  2.50122650e-03
  1.29059562e-03  3.05426237e-03  2.23466159e-05 -1.49044390e-05
 -3.36047332e-03  4.53123264e-03  4.52359021e-03 -2.36581163e-05
  1.04024504e-04 -2.96281534e-03  4.14645439e-03 -2.10855160e-05
  3.47119896e-03  2.69659492e-03  8.16116866e-04  2.18500663e-03
 -3.57555761e-03 -3.45101487e-03 -2.20458256e-04  1.00601418e-03
  2.72019114e-03 -4.71354183e-03 -3.79113713e-03 -2.71429680e-03
  6.96644012e-04  3.51517531e-03 -9.30680661e-04  4.06164676e-03
 

In [4]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))


Similarity between eats and bites: -0.12645864
Similarity between eats and man: -0.042087413


##### From the above similarity scores we can conclude that eats is more similar to bites than man.

In [5]:
#Most similarity
model_cbow.most_similar('meat')

[('eats', 0.25278109312057495),
 ('dog', 0.16689801216125488),
 ('food', 0.109287329018116),
 ('bites', -0.014241887256503105),
 ('man', -0.077211394906044)]

In [6]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('word2vec_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


##SkipGram
###### In skipgram, the task is to predict the context words from the center word

In [7]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 4.67871130e-03  4.41844948e-03  1.11098308e-03  4.07150527e-03
  8.76159407e-04  4.25393833e-03 -4.97938413e-03  7.77888810e-04
  3.64920101e-03  3.54022253e-03  3.06759612e-03  4.41551115e-03
 -2.13391054e-03 -2.94187572e-03  4.54964954e-03 -6.14416786e-04
  4.92056180e-03  3.07814195e-03 -1.42575079e-03  3.88053618e-03
  3.01008997e-03 -3.33846710e-03 -1.26256561e-03 -9.83517166e-05
 -2.77526304e-03  1.94193004e-03  3.46393650e-03  2.50122650e-03
  1.29059562e-03  3.05426237e-03  2.23466159e-05 -1.49044390e-05
 -3.36047332e-03  4.53123264e-03  4.52359021e-03 -2.36581163e-05
  1.04024504e-04 -2.96281534e-03  4.14645439e-03 -2.10855160e-05
  3.47119896e-03  2.69659492e-03  8.16116866e-04  2.18500663e-03
 -3.57555761e-03 -3.45101487e-03 -2.20458256e-04  1.00601418e-03
  2.72019114e-03 -4.71354183e-03 -3.79113713e-03 -2.71429680e-03
  6.96644012e-04  3.51517531e-03 -9.30680661e-04  4.06164676e-03
 

In [8]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))


Similarity between eats and bites: -0.12645988
Similarity between eats and man: -0.042082705


##### From the above similarity scores we can conclude that eats is more similar to bites than man.

In [9]:
#Most similarity
model_skipgram.most_similar('meat')

[('eats', 0.25271496176719666),
 ('dog', 0.16689801216125488),
 ('food', 0.1092873215675354),
 ('bites', -0.014241892844438553),
 ('man', -0.077211394906044)]

In [0]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.


In [12]:
!mkdir -p data/en/
!wget -P data/en/ https://dumps.wikimedia.org/enwiki/20200120/enwiki-20200120-pages-articles-multistream14.xml-p6197599p7697599.bz2 

--2020-04-27 18:05:07--  https://dumps.wikimedia.org/enwiki/20200120/enwiki-20200120-pages-articles-multistream14.xml-p6197599p7697599.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433386883 (413M) [application/octet-stream]
Saving to: ‘data/en/enwiki-20200120-pages-articles-multistream14.xml-p6197599p7697599.bz2’


2020-04-27 18:06:39 (4.53 MB/s) - ‘data/en/enwiki-20200120-pages-articles-multistream14.xml-p6197599p7697599.bz2’ saved [433386883/433386883]



In [0]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [0]:
#Preparing the Training data
wiki = WikiCorpus('data/en/enwiki-20200120-pages-articles-multistream14.xml-p6197599p7697599.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())


###Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.
##### There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [18]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


CBOW Model Training Complete.
Time taken for training is:0.18 hrs 


In [14]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)


Word2Vec(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-1.3190242   1.1493553  -2.1161873  -0.54641986 -0.41654417 -2.575438
 -2.5161462   0.40805933 -2.1835763  -2.2166464   0.21488218  2.6489248
  2.1158683  -0.12322527  2.6342719   2.647537   -1.8351669  -1.0477378
 -0.72323066 -2.5107725  -0.8130578   2.2074096   1.0898734   3.5470395
  4.4321404   3.3782022   2.5920029   0.3260313   0.5037524  -0.43571192
  0.8108539  -2.0022514   1.314795   -3.6556513   3.982691   -0.09781197
 -0.7801926  -1.0659611  -2.1272135   1.6751018  -3.382622   -1.2983025
 -4.414968    1.1984603   2.6722977  -1.6703546   2.2960882  -1.2238866
  0.10023585  2.279665    0.93043685  1.4843141  -2.1319714  -1.1646842
 -0.61533064  0.9556172   3.0366893  -0.8903226   3.6449335  -1.0054256
 -0.168811   -1.1724004  -0.04154419 -1.9287016  -0.17199759 -0.6142712
 -0.7684636  -0.7188736   1.1901233  -0.8688696  -1.3473594  -0.6295633
 -1.1486841  -2.0717368   1

In [0]:
# save model
word2vec_cbow.save('word2vec_cbow.bin')

# # load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [16]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


SkipGram Model Training Complete
Time taken for training is:0.61 hrs 


In [19]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:", .similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)


Word2Vec(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-0.35990748  0.20537794 -0.08471059  0.3329436  -0.0092362  -0.07147974
 -0.30981717 -0.21653889  0.34842396 -0.28996146  0.45289078  0.64880157
 -0.14537139 -0.6593135   0.46777877  0.01224853 -0.30658954 -0.2638053
  0.09966122 -0.37593904  0.16245331 -0.10233496  0.5540646   0.27518794
  0.25898558  0.2793495  -0.2394661  -0.43676636 -0.43892923  0.04013366
 -0.38420838 -0.8375879  -0.10474149 -0.42934766 -0.07724728  0.32263622
 -0.08061744 -0.3264137  -0.13703062  0.40950722  0.6043124   0.29718038
 -0.24458958 -0.3338308   0.00118772 -1.0104022  -0.26405972  0.15964566
 -0.29186898 -0.21630155  0.32755786  0.24951488 -0.29366866  0.15838994
  0.45849612  0.2894191   0.6303711  -0.26540717 -0.24943331 -0.42347214
 -0.4279108   0.24273829 -0.14081118 -0.4153366  -0.3691591  -0.08119843
 -0.28623915  0.1835074  -0.19500887 -0.20063516  0.34779832  0.31729746
 -0.13560808 -0.2

In [0]:
# save model
word2vec_skipgram.save('model_skipgram.bin')

# # load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [15]:

#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


FastText CBOW Model Training Complete
Time taken for training is:0.39 hrs 


In [16]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(fasttext_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)


FastText(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-3.4158628   3.8855803   0.535912   -5.182645   -0.6096103   6.1879306
  1.9820771  -1.4699645   2.1031582  -1.8087566  -0.6309119  -1.205867
  1.5846244   0.22744241 -2.1966136   2.0511622  -0.6773196  -0.4715227
  3.9407995   6.199335   -2.6367686   1.1709683   2.7057931   4.855923
 -5.096699    7.433429    7.8346696   0.98290753  5.292873   -4.175929
 -4.130687    0.9335608  -5.3310313  -1.5800712   2.984793    0.28918087
  1.4197284  -0.89113504  1.6581714   1.1043363  -0.3220185   2.6870852
 -0.6005217  -2.289015    4.6048236  -0.65780896  2.1253297  -2.1278186
  0.41051725 -0.8623372   3.4963434   3.8041396  -1.9575641  -0.8581801
 -3.1491356  -2.6680999  -0.27547327 -0.25134414 -3.7401705  -0.40602767
 -4.5328755   1.3974704   5.138355   -0.500581   -3.237352    5.123986
 -1.2240024  -1.6047837  -1.6459205  -0.77467674  0.5509822   1.6679264
  4.729095    0.62219006 -2.73

In [17]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))


FastText SkipGram Model Training Complete
Time taken for training is:0.65 hrs 


In [18]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(fasttext_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)


FastText(vocab=161018, size=100, alpha=0.025)
------------------------------
------------------------------
[-4.01990078e-02  3.53444159e-01  2.54945934e-01  4.80403066e-01
 -3.21140260e-01  5.59775710e-01 -4.67194825e-01 -1.73437566e-01
 -1.64864406e-01 -2.49177516e-01 -4.84021157e-01  2.12465003e-01
 -2.15262547e-01  7.43608400e-02 -5.25858462e-01 -4.52829629e-01
  6.79721832e-02 -6.40648901e-02 -4.02468592e-01 -2.06037983e-01
 -4.81559843e-01 -3.20515335e-01  1.60730317e-01  5.23487292e-03
 -3.07535052e-01  7.72237599e-01 -5.03421545e-01  4.23307449e-01
 -6.49608374e-01 -1.15924791e-01 -6.47014454e-02 -3.46179813e-01
 -7.23404825e-01  1.36679158e-01  4.55667861e-02 -4.77901548e-01
  3.03246289e-01  3.38047385e-01  8.01058710e-02  1.11218736e-01
 -1.68238163e-01  2.86948115e-01 -1.24533847e-01  1.34248048e-01
  1.36137992e-01  8.41890052e-02  1.00599341e-01  4.17892247e-01
  2.27972612e-01  5.28719008e-01  9.70892459e-02  4.02288616e-01
 -1.97849020e-01 -5.34242280e-02 -3.59556358e-0

#### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to figure out why. A hint would be to refer the working of CBOW and skipgram.