## Gensim Tutorial

## Part A:

In [0]:
import nltk
import numpy as np
import gensim
from gensim.models import Word2Vec
from nltk.data import find
import matplotlib
from sklearn.manifold import TSNE
import string
from urllib.request import urlopen
import nltk, re, pprint
from nltk import word_tokenize
from nltk import tokenize

## Vizualise
%matplotlib inline
from matplotlib import pyplot as plt

Google’s pre-trained model here - It’s 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.

In [0]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2019-03-25 21:10:26--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.112.29
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.112.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2019-03-25 21:10:55 (54.6 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [0]:
word_vectors = model.wv
word_vectors.vocab

In [8]:
word_pairs = [
              ('India', 'Pakistan') ,
              ('massive', 'puny'), 
              ('California', 'Sacramento'), 
              ('Iceland', 'volcano'), 
              ('fast', 'faster')
            ]
for pair in word_pairs:
    print(model.similarity(pair[0], pair[1]))

0.6706861
0.33377072
0.6366068
0.3833339
0.5216066


  if np.issubdtype(vec.dtype, np.int):


In [9]:
words = ['woman', 'man']
for word in words:
  print(len(model[word]), model[word])

300 [ 2.43164062e-01 -7.71484375e-02 -1.03027344e-01 -1.07421875e-01
  1.18164062e-01 -1.07421875e-01 -1.14257812e-01  2.56347656e-02
  1.11816406e-01  4.85839844e-02 -9.71679688e-02 -3.43750000e-01
 -6.29882812e-02 -1.25000000e-01 -2.70996094e-02  9.42382812e-02
 -1.87500000e-01 -5.34667969e-02  6.25000000e-02 -3.05175781e-02
 -2.90527344e-02 -4.80957031e-02 -5.51757812e-02 -4.08203125e-01
  1.01318359e-02 -2.32421875e-01 -1.70898438e-01  2.63671875e-01
  3.49609375e-01 -2.11914062e-01  1.43554688e-01 -6.22558594e-03
 -2.25585938e-01 -1.05468750e-01 -1.16210938e-01  1.23046875e-01
  3.06640625e-01 -4.88281250e-02 -9.57031250e-02  1.99218750e-01
 -1.57226562e-01 -2.80761719e-02  1.58203125e-01 -2.42919922e-02
  1.29882812e-01 -8.98437500e-02 -7.61718750e-02  3.54003906e-02
 -3.06396484e-02  1.52343750e-01  5.24902344e-02  1.60980225e-03
  5.56640625e-02  3.95507812e-02 -7.71484375e-02 -7.12890625e-02
 -9.22851562e-02 -7.03125000e-02  2.03125000e-01  1.53198242e-02
  2.98828125e-01  1.7

Q3:  Now using Numpy extract the magnitude (L2 norm) of the vectors  and subtract among the word pairs in Q1. Do the magnitudes of the vectors matter? Justify your answer?


In [10]:
# np.linalg.norm(model[word], ord=2)

for pair in word_pairs:
  print("{}: {:.4f}, {:.4f}".format(
        pair, 
        np.linalg.norm(model[pair[0]], ord=2) - np.linalg.norm(model[pair[1]], ord=2), 
        np.linalg.norm(model[pair[0]] - model[pair[1]], ord=2)))
# model[word].shape

('India', 'Pakistan'): -0.4565, 2.3936
('massive', 'puny'): -0.7695, 3.0797
('California', 'Sacramento'): -0.5296, 2.5566
('Iceland', 'volcano'): -1.0918, 4.2345
('fast', 'faster'): -0.3869, 2.6692


In [14]:
np.linalg.norm(model[pair[0]], ord=2)

2.513517

Q4: What is the relationship between the magnitude of individual vectors, the vectors themselves and the cosine distance for any pair of words. Choose any tuple in Q1 and provide your answer. 


It seems like the magnitudes of the vectors don’t matter. There seems to be no direct relationship between the closeness of the words and their L2 norm magnitude.
This is because a word (and its respective semantic meaning) is represented by a vector that has length (L2 norm) and direction. When we update the word embeddings during the training phase, we are correcting both the length and direction of the vector. It is therefore possible for us to use cosine similarity and vector subtraction/addition to get words of similar/opposite meanings. Simply doing L2 norm arithmetic will be meaningless because it ignored the critical direction information.

Q5: Time to dabble with the power of Word2Vec. Find the 2 closest words  for the following condition   
> (King- Queen) <br />
> (bigger - big + small ) <br />

In [0]:
print(model.most_similar(positive=["king","man"], negative=["queen"], topn=2))
print(model.most_similar(positive=["bigger","small"], negative=["big"], topn=2))

  if np.issubdtype(vec.dtype, np.int):


[('boy', 0.5393532514572144), ('guy', 0.47399765253067017)]
[('larger', 0.7402471899986267), ('smaller', 0.732999324798584)]


Q6: Explore and come up with a similar analogy that holds for (bigger - big + small )

In [0]:
print(model.most_similar(positive=["taller","Short"], negative=["tall"], topn=2))

  if np.issubdtype(vec.dtype, np.int):


[('shorter', 0.43716010451316833), ('Long', 0.4175928831100464)]
