## Training Embeddings Using Genism
##### Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling (explained in chapter 7).

In [0]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')


In [0]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 



## Continuous Bag of Words (CBOW) 
##### In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [29]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 1.8636956e-03  2.4535074e-03  4.3623373e-03 -3.1415620e-03
 -3.8634820e-03  3.2356704e-05 -3.7326266e-03  2.3274277e-03
 -8.7320386e-04  2.9599380e-03  2.7632974e-03 -4.2401198e-03
  1.5633155e-03 -8.4691070e-04 -1.1048439e-03 -1.4907877e-03
  2.8470224e-03  1.1687911e-03 -8.2325342e-04  4.2131748e-03
  3.5842666e-03 -4.8583136e-03 -3.7786432e-03  1.8674354e-03
 -1.3572449e-03 -2.8515710e-03 -3.1879877e-03  1.2325050e-03
  1.9252286e-03 -2.1336856e-03  5.1382539e-04 -3.5386824e-03
  3.3612479e-03 -4.2027882e-03 -1.3216434e-05  4.1396907e-03
 -2.0779634e-03  1.3280836e-03  2.7333535e-03  2.6307677e-04
  2.3499376e-04  1.9200939e-03 -3.8520133e-03 -2.3576233e-03
 -4.4420473e-03  2.2383786e-03 -4.6484326e-03 -2.7595456e-03
  4.0953648e-03 -1.6580598e-04 -3.6061259e-03 -3.1529986e-03
  3.1407124e-03  1.9609181e-03 -1.1433997e-03 -2.9797696e-03
 -1.8910962e-03 -1.1299493e-04 -1.3544862e-03  3.5765002e

In [30]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))


Similarity between eats and bites: 0.113624044
Similarity between eats and man: 0.05556522


##### From the above similarity scores we can conclude that eats is more similar to bites than man.

In [39]:
#Most similarity
model_cbow.most_similar('meat')

[('eats', 0.12531715631484985),
 ('food', 0.048376552760601044),
 ('bites', -0.015146855264902115),
 ('man', -0.058892399072647095),
 ('dog', -0.14050249755382538)]

In [40]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


##SkipGram

In [41]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])


Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 1.8636956e-03  2.4535074e-03  4.3623373e-03 -3.1415620e-03
 -3.8634820e-03  3.2356704e-05 -3.7326266e-03  2.3274277e-03
 -8.7320386e-04  2.9599380e-03  2.7632974e-03 -4.2401198e-03
  1.5633155e-03 -8.4691070e-04 -1.1048439e-03 -1.4907877e-03
  2.8470224e-03  1.1687911e-03 -8.2325342e-04  4.2131748e-03
  3.5842666e-03 -4.8583136e-03 -3.7786432e-03  1.8674354e-03
 -1.3572449e-03 -2.8515710e-03 -3.1879877e-03  1.2325050e-03
  1.9252286e-03 -2.1336856e-03  5.1382539e-04 -3.5386824e-03
  3.3612479e-03 -4.2027882e-03 -1.3216434e-05  4.1396907e-03
 -2.0779634e-03  1.3280836e-03  2.7333535e-03  2.6307677e-04
  2.3499376e-04  1.9200939e-03 -3.8520133e-03 -2.3576233e-03
 -4.4420473e-03  2.2383786e-03 -4.6484326e-03 -2.7595456e-03
  4.0953648e-03 -1.6580598e-04 -3.6061259e-03 -3.1529986e-03
  3.1407124e-03  1.9609181e-03 -1.1433997e-03 -2.9797696e-03
 -1.8910962e-03 -1.1299493e-04 -1.3544862e-03  3.5765002e

In [42]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))


Similarity between eats and bites: 0.11362622
Similarity between eats and man: 0.055570077


##### From the above similarity scores we can conclude that eats is more similar to bites than man.

In [43]:
#Most similarity
model_skipgram.most_similar('meat')

[('eats', 0.12524451315402985),
 ('food', 0.04837654158473015),
 ('bites', -0.015146857127547264),
 ('man', -0.058892399072647095),
 ('dog', -0.14050249755382538)]

In [44]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)
