# Word Embeddings

A word embedding is a class of approaches for representing words and documents using a
dense vector representation. It is an improvement over more the traditional bag-of-word model
encoding schemes where large sparse vectors were used to represent each word or to score each
word within a vector to represent an entire vocabulary. These representations were sparse
because the vocabularies were vast and a given word or document would be represented by a
large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents
the projection of the word into a continuous vector space. The position of a word within the
vector space is learned from text and is based on the words that surround the word when it is
used. The position of a word in the learned vector space is referred to as its embedding. Two
popular examples of methods of learning word embeddings from text include:
+ Word2Vec.
+ GloVe.

In addition to these carefully designed methods, a word embedding can be learned as part
of a deep learning model. This can be a slower approach, but tailors the model to a specific
training dataset.

In [1]:
# !pip install gensim

In [3]:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
        ['this', 'is', 'the', 'second', 'sentence'],
        ['yet', 'another', 'sentence'],
        ['one', 'more', 'sentence', 'love'],
        ['and', 'the', 'final', 'sentence', 'solve']]

In [5]:
# train model
cbow_model = Word2Vec(sentences, vector_size = 10, window = 3, min_count=1, sg=0)

In [6]:
# summarize the loaded model
print(cbow_model)

Word2Vec<vocab=16, vector_size=10, alpha=0.025>


In [7]:
list(cbow_model.wv.key_to_index.keys())   # this is the vocab

['sentence',
 'the',
 'is',
 'this',
 'solve',
 'final',
 'and',
 'love',
 'more',
 'one',
 'another',
 'yet',
 'second',
 'word2vec',
 'for',
 'first']

In [8]:
# access vector for one word
cbow_model.wv.get_vector('love')

array([ 0.05455598,  0.08345091, -0.0145442 , -0.09208831,  0.04371774,
        0.00572208,  0.07440059, -0.00813585, -0.0263755 , -0.08752632],
      dtype=float32)

In [9]:
for key in cbow_model.wv.key_to_index.keys():
    print(key, ':', cbow_model.wv.get_vector(key))

sentence : [-0.00536351  0.00238484  0.05107331  0.09015599 -0.09308276 -0.0711995
  0.06464671  0.08974326 -0.0501915  -0.03765175]
the : [ 0.07379383 -0.01529509 -0.04534442  0.06555367 -0.04861008 -0.0181848
  0.02882639  0.00993654 -0.08292154 -0.0944667 ]
is : [ 0.07311766  0.05070262  0.06757693  0.00762866  0.06350891 -0.03405366
 -0.00946401  0.05768573 -0.07521638 -0.03936104]
this : [-0.07512096 -0.00929068  0.09539422 -0.07316343 -0.02336625 -0.01939589
  0.08080077 -0.0592867   0.00042713 -0.04753667]
solve : [-0.09603605  0.05007694 -0.08758304 -0.04394896 -0.00034404 -0.00295622
 -0.07661133  0.09616364  0.04980589  0.09235525]
final : [-0.08158192  0.04498189 -0.04134833  0.00827747  0.08496136 -0.04464175
  0.04521902 -0.06785722 -0.03552099  0.09398862]
and : [-0.0157806   0.00323172 -0.04137019 -0.0768177  -0.01509309  0.02468751
 -0.00885536  0.05536246 -0.02745937  0.02261946]
love : [ 0.05455598  0.08345091 -0.0145442  -0.09208831  0.04371774  0.00572208
  0.074400

In [10]:
cbow_model.wv.get_vector('analytics')

KeyError: "Key 'analytics' not present"

In [11]:
# save model
cbow_model.save('model.bin')

In [12]:
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec<vocab=16, vector_size=10, alpha=0.025>


In [13]:
cbow_model.wv.get_vector('love')

array([ 0.05455598,  0.08345091, -0.0145442 , -0.09208831,  0.04371774,
        0.00572208,  0.07440059, -0.00813585, -0.0263755 , -0.08752632],
      dtype=float32)

In [14]:
sg_model = Word2Vec(sentences, vector_size=10, window = 3, min_count=1, sg=1)

# access vector for one word
sg_model.wv.get_vector('love')

array([ 0.05455508,  0.08348128, -0.01442463, -0.09193361,  0.04362334,
        0.00568476,  0.07447571, -0.00811199, -0.02645334, -0.0874837 ],
      dtype=float32)

In [None]:
# [ 0.05455598,  0.08345091, -0.0145442 , -0.09208831,  0.04371774,
#         0.00572208,  0.07440059, -0.00813585, -0.0263755 , -0.08752632],

### Some computations using Word Embeddings

In [15]:
from gensim.models import KeyedVectors
# load the google word2vec model
path = r'D:\OneDrive\Google Drive Files\Training\1 MASTER\NLP Master\Word Embedding\WV -1'
filename = path + r'\GoogleNews-vectors-negative300.bin'
# filename = r'C:\Users\dell\Google Drive\DUMP\Desktop\Nomura NLP\Word Embedding\WV -1\GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [18]:
model.get_vector('king')

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [19]:
len(model.get_vector('king'))

300

In [20]:
model.get_vector('queen')

array([ 0.00524902, -0.14355469, -0.06933594,  0.12353516,  0.13183594,
       -0.08886719, -0.07128906, -0.21679688, -0.19726562,  0.05566406,
       -0.07568359, -0.38085938,  0.10400391, -0.00081635,  0.1328125 ,
        0.11279297,  0.07275391, -0.046875  ,  0.06591797,  0.09423828,
        0.19042969,  0.13671875, -0.23632812, -0.11865234,  0.06542969,
       -0.05322266, -0.30859375,  0.09179688,  0.18847656, -0.16699219,
       -0.15625   , -0.13085938, -0.08251953,  0.21289062, -0.35546875,
       -0.13183594,  0.09619141,  0.26367188, -0.09472656,  0.18359375,
        0.10693359, -0.41601562,  0.26953125, -0.02770996,  0.17578125,
       -0.11279297, -0.00411987,  0.14550781,  0.15625   ,  0.26757812,
       -0.01794434,  0.09863281,  0.05297852, -0.03125   , -0.16308594,
       -0.05810547, -0.34375   , -0.17285156,  0.11425781, -0.09033203,
        0.13476562,  0.27929688, -0.04980469,  0.12988281,  0.17578125,
       -0.22167969, -0.01190186,  0.140625  , -0.18164062,  0.11

In [21]:
# What is the woman equivalent of King ????
a = model.get_vector('king') + model.get_vector('woman') - model.get_vector('man')
model.cosine_similarities(a, [model.get_vector('queen')])

array([0.7300517], dtype=float32)

In [22]:
# calculate: (king - man) + woman = ?  (Queen)
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)
print(result)

[('queen', 0.7118193507194519), ('monarch', 0.6189674735069275), ('princess', 0.5902431011199951)]


In [23]:
# Who is the God of Cricket in India ?
result = model.most_similar(positive=['God', 'cricket', 'India'], negative=[], topn=3)
print(result)

[('cricketing', 0.657913863658905), ('Sachin', 0.6445838809013367), ('cricketers', 0.6345413327217102)]


In [24]:
# what is the female equivalent of the word "man"
result = model.most_similar(positive=['man', 'female'], negative=['male'], topn=3)
print(result)

[('woman', 0.7685461640357971), ('teenage_girl', 0.5872832536697388), ('lady', 0.5742953419685364)]


In [25]:
#Checking how similarity works. 
print (model.similarity('strawberry', 'mango'))

0.63029367


In [26]:
print(model.similarity('novel', 'book'))

0.6121936


In [27]:
print(model.similarity('novel', 'mango'))

0.06800091


In [29]:
# Finding the odd one out.
model.doesnt_match('breakfast cereal dinner lunch'.split())
# model.doesnt_match('mango apple banana rose'.split())

'cereal'

In [40]:
result = model.most_similar(positive=['dog', 'newborn'], topn=3)
print(result)

[('puppy', 0.793572187423706), ('pup', 0.7656400799751282), ('kitten', 0.730669379234314)]


## Using Stanford’s GloVe Embedding

Stanford researchers also have their own word embedding algorithm like Word2Vec called `Global
Vectors` for Word Representation, or `GloVe` for short. 

You can download the GloVe pre-trained word vectors and load them easily with `Gensim`. The first step is to convert the GloVe file format to the Word2Vec file format. The only difference is the addition of a small header line. This can be done by calling the `glove2word2vec()` function.  Once converted, the file can be loaded just like Word2Vec file above. 

You can download the smallest GloVe pre-trained model from the GloVe
website. It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional
vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary. The
direct download link is here http://nlp.stanford.edu/data/glove.6B.zip

In [39]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# convert glove to word2vec format
path = r'D:\OneDrive\Google Drive Files\Training\1 MASTER\NLP new\Word Embeddings'

glove_input_file = path + '\glove.6B.100d.txt'
word2vec_output_file = 'word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 100)

In [36]:
# load the converted model
filename = 'word2vec.txt'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

You now have a copy of the `GloVe` model in `Word2Vec` format with the filename
`glove.6B.100d.txt.word2vec`. Now we can load it and perform the same `(king - man) + woman = ?` test as in the previous section.

In [37]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.7698541283607483)]


### Further Reading:

#### Word Embeddings
+ Word Embedding on Wikipedia.https://en.wikipedia.org/wiki/Word2vec
+ Word2Vec on Wikipedia. https://en.wikipedia.org/wiki/Word2vec
+ Google Word2Vec project. https://code.google.com/archive/p/word2vec/
+ Stanford GloVe project. https://nlp.stanford.edu/projects/glove/

### Articles
+ Messing Around With Word2Vec, 2016. https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/
+ Vector Space Models for the Digital Humanities, 2015. http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html
+ Gensim Word2Vec Tutorial, 2014. https://rare-technologies.com/word2vec-tutorial/

# Facebook's FastText

`fastText` is the improvised version of `word2vec`. `word2vec` basically considers words to build the representation. But `fastText` takes each character while computing the representation of the word.

In [31]:
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
['nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

In [32]:
from gensim.models import FastText
fast = FastText(sentences,vector_size=20, window=1, min_count=1, workers=5, min_n=1, max_n=2)

In [33]:
fast.wv.get_vector('future')

array([ 0.00718044,  0.00634451,  0.01015092,  0.00278108,  0.00071975,
        0.01481973, -0.01144717,  0.0085934 ,  0.00387139, -0.00861204,
       -0.01795045, -0.00222266,  0.0043997 ,  0.01099374,  0.00549521,
       -0.02154304,  0.02005067, -0.00923354,  0.00634542,  0.00346849],
      dtype=float32)

In [36]:
fast.wv.get_vector('vidhya')

array([-0.00516996,  0.00863833,  0.00496077,  0.0065316 ,  0.01097381,
        0.00640103,  0.0061447 ,  0.00292935, -0.00900661,  0.01208093,
        0.00329964,  0.00117949, -0.00334365,  0.00119693,  0.0070916 ,
       -0.00667899,  0.00937506, -0.0102112 ,  0.00654462, -0.00943527],
      dtype=float32)

In [37]:
fast.wv.key_to_index

{'nlp': 0,
 'I': 1,
 'future': 2,
 'love': 3,
 'will': 4,
 'learn': 5,
 'in': 6,
 '2': 7,
 'months': 8,
 'is': 9,
 'learning': 10,
 'machine': 11,
 'time': 12,
 'and': 13,
 'solves': 14,
 'lot': 15,
 'of': 16,
 'industry': 17,
 'problems': 18,
 'uses': 19,
 'saves': 20}

In [38]:
len(fast.wv.key_to_index)

21