## Word Vectors in Gensim Library


* **Gensim** is an NLP library similar to Spacy, but it's mainly using for topic modelling.

* All gensim models are listed on this page: https://github.com/RaRe-Technologies/gensim-data

In [3]:
# From 'Gensim' library will import the appropraiate word vector.
from gensim import downloader as api

wv = api.load('word2vec-google-news-300') # Here we specify which dataset or which kind of word embedding you want to 
                                           # download. So one of the word embedding that is available in gensim is a model
                                           # that is trained on Google news.



In [4]:
# So this model has a function called 'similarity', which can find the similarity between two words.
wv.similarity(w1="great", w2="good")

0.72915095

In [5]:
# There is another function called 'most_similar' which will list you the list of similar words to the given word.
wv.most_similar("good")

[('great', 0.7291510105133057),
 ('bad', 0.7190051078796387),
 ('terrific', 0.6889115571975708),
 ('decent', 0.6837348341941833),
 ('nice', 0.6836092472076416),
 ('excellent', 0.6442928910255432),
 ('fantastic', 0.6407778263092041),
 ('better', 0.6120728850364685),
 ('solid', 0.5806034207344055),
 ('lousy', 0.5764203071594238)]

In [7]:
# We can do mathematics with the vectors.
# For examples Frence - Paris + Berlin = Germany  Or King - man + Woman = Queen
# Let's see:
wv.most_similar(positive=["France", "Berlin"], negative=["Paris"])

[('Germany', 0.7901253700256348),
 ('Austria', 0.6026812791824341),
 ('German', 0.6004959940910339),
 ('Germans', 0.5851002931594849),
 ('Poland', 0.5847075581550598),
 ('Hungary', 0.5271855592727661),
 ('BBC_Tristana_Moore', 0.5249711275100708),
 ('symbol_RSTI', 0.5245768427848816),
 ('Belgium', 0.5221248269081116),
 ('Germnay', 0.5199405550956726)]

In [9]:
# For Queen:
wv.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [10]:
# It has another 'api' called 'doesnt_match', if you give set of keywords, it will tell you which keyword doesn't match with
# others.
wv.doesnt_match(["facebook", "cat", "google", "microsoft"])

'cat'

In [11]:
# Or 
wv.doesnt_match(["dog", "cat", "google", "mouse"])

'google'

### Gensim: Glove
Stanford's page on GloVe: https://nlp.stanford.edu/projects/glove/

In [12]:
# Let's download another model called 'glove-twitter-25'.
glv = api.load("glove-twitter-25")



In [13]:
# Now when we execute the 'most_similar' api, it's giving us different result. Because it gives result based on Twitter.
glv.most_similar("good")

[('too', 0.9648016691207886),
 ('day', 0.9533665180206299),
 ('well', 0.9503170847892761),
 ('nice', 0.9438973665237427),
 ('better', 0.9425962567329407),
 ('fun', 0.941892683506012),
 ('much', 0.9413353800773621),
 ('this', 0.9387555122375488),
 ('hope', 0.9383508563041687),
 ('great', 0.9378516674041748)]

In [15]:
# ...
glv.most_similar("facebook")

[('twitter', 0.9480051398277283),
 ('google', 0.9231430888175964),
 ('instagram', 0.9184154272079468),
 ('internet', 0.9143875241279602),
 ('youtube', 0.9113808274269104),
 ('tumblr', 0.9077149033546448),
 ('link', 0.8995786905288696),
 ('fb', 0.8734269142150879),
 ('post', 0.8671452403068542),
 ('site', 0.8642723560333252)]

In [16]:
# Let's try 'doesnt_match' method:
glv.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [17]:
# ...
glv.doesnt_match("facebook cat google microsoft".split())

'cat'

In [18]:
# ...

glv.doesnt_match("banana grapes orange human".split())

'human'

* Now we're probably getting an idea that there are trained word vectors available on the internet and different libraries such as **Spacy**, **Gensim**, **Pytorch**. You can load those vectors using these libraries. And then when you talk about **word2Vec**, and **Glove**, so these are all techniques (algorithms).
* And when you see like 'glove-twitter-25', so it's simple composed of two terms: 'glove' mean algorithm and 'twitter-25' means the dataset on which 'glove' is trained.