# Word2Vec Basic Appreciation



##### Introduction

For this exercise, we will make use of the pretrained word vectors that is provided by Google.
We also take the opportunity to introduce the GENSIM package - which is widely used for topic modelling and other NLP related tas.

- We will borrow Google's massive set of pre-trained word vectors. You can download the dataset GoogleNews-vectors-negative300.bin.gz from https://code.google.com/archive/p/word2vec/ .


- For GENSIM, please run the following commands at the command trerminal:
pip install --upgrade gensim
OR
conda install -c conda-forge gensim 

In [1]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd


State the path of google pretrained vectors. Note that this is a huge file!

In [2]:

google_vec_file = "/Users/tanpohkeam/Downloads/GoogleNews-vectors-negative300.bin.gz"


#### Loading Google Word2Vec Vectors
* Load the Google vectors into an object `google_model` using `gensim` 
* The KeyedVectors module is GENSIM implementation of word vector.
* The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors.
* We will load the Google vectors in the next line. This step will take awhile, as it has to load 3 million vectors into the appropriate Word2Vec forma).


In [3]:
# Load the Google vectors
google_model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

Google's model contains an extensive vocabulary of 3 million words with 300 dimensions.

In [5]:
type(google_model.vocab)

dict

* Confirm that you have 3 million vectors of length 300.

In [8]:
# Number of Vectors
print("Number of vectors : ", len(google_model.vocab.keys()))

Number of vectors :  3000000


In [11]:
# Size of the Vectors
print("Size of vector : " , google_model.vector_size )

Size of vector :  300


### Exploring Word2Vec 
* To see the vector representing a string, you just pass in the word as a parameter using the word_vec method.
* The next line will display the representation of "cat". 

In [16]:
# Word Vectors

np.set_printoptions(linewidth=120)
vector = google_model.word_vec('make')
print(vector)

[-0.11328125 -0.03686523  0.09423828  0.00799561  0.02490234 -0.16699219  0.03662109  0.07324219  0.21386719
 -0.03857422 -0.09863281 -0.09033203 -0.06884766  0.34765625  0.05151367 -0.03369141  0.15332031  0.10595703
  0.09765625  0.05078125 -0.03173828  0.01855469  0.24804688  0.18359375  0.00363159 -0.1484375  -0.16113281
 -0.00836182 -0.09082031  0.04736328  0.0012207   0.12597656 -0.13964844  0.04418945 -0.05395508 -0.05639648
  0.07275391  0.14648438  0.19433594  0.15820312  0.17871094 -0.07568359  0.15625    -0.1171875   0.04125977
 -0.03039551 -0.04785156 -0.00136566  0.03149414  0.02624512  0.04370117  0.08984375  0.05859375 -0.21191406
 -0.12011719  0.16503906 -0.06884766 -0.13476562 -0.05517578 -0.03442383  0.02209473 -0.02722168 -0.09082031
 -0.01647949 -0.07275391  0.09667969  0.00915527  0.24902344 -0.06884766  0.06835938  0.09667969  0.11669922
  0.0378418  -0.05200195 -0.16601562 -0.16699219  0.11962891  0.16503906 -0.02929688  0.04882812  0.00328064
 -0.23632812 -0.038

Below is another way to access the vector for a word

In [17]:
vector2 = google_model['make']
print(vector2)


[-0.11328125 -0.03686523  0.09423828  0.00799561  0.02490234 -0.16699219  0.03662109  0.07324219  0.21386719
 -0.03857422 -0.09863281 -0.09033203 -0.06884766  0.34765625  0.05151367 -0.03369141  0.15332031  0.10595703
  0.09765625  0.05078125 -0.03173828  0.01855469  0.24804688  0.18359375  0.00363159 -0.1484375  -0.16113281
 -0.00836182 -0.09082031  0.04736328  0.0012207   0.12597656 -0.13964844  0.04418945 -0.05395508 -0.05639648
  0.07275391  0.14648438  0.19433594  0.15820312  0.17871094 -0.07568359  0.15625    -0.1171875   0.04125977
 -0.03039551 -0.04785156 -0.00136566  0.03149414  0.02624512  0.04370117  0.08984375  0.05859375 -0.21191406
 -0.12011719  0.16503906 -0.06884766 -0.13476562 -0.05517578 -0.03442383  0.02209473 -0.02722168 -0.09082031
 -0.01647949 -0.07275391  0.09667969  0.00915527  0.24902344 -0.06884766  0.06835938  0.09667969  0.11669922
  0.0378418  -0.05200195 -0.16601562 -0.16699219  0.11962891  0.16503906 -0.02929688  0.04882812  0.00328064
 -0.23632812 -0.038

In [32]:
vector = google_model.get_vector('make')
print(vector)

[-0.11328125 -0.03686523  0.09423828  0.00799561  0.02490234 -0.16699219  0.03662109  0.07324219  0.21386719
 -0.03857422 -0.09863281 -0.09033203 -0.06884766  0.34765625  0.05151367 -0.03369141  0.15332031  0.10595703
  0.09765625  0.05078125 -0.03173828  0.01855469  0.24804688  0.18359375  0.00363159 -0.1484375  -0.16113281
 -0.00836182 -0.09082031  0.04736328  0.0012207   0.12597656 -0.13964844  0.04418945 -0.05395508 -0.05639648
  0.07275391  0.14648438  0.19433594  0.15820312  0.17871094 -0.07568359  0.15625    -0.1171875   0.04125977
 -0.03039551 -0.04785156 -0.00136566  0.03149414  0.02624512  0.04370117  0.08984375  0.05859375 -0.21191406
 -0.12011719  0.16503906 -0.06884766 -0.13476562 -0.05517578 -0.03442383  0.02209473 -0.02722168 -0.09082031
 -0.01647949 -0.07275391  0.09667969  0.00915527  0.24902344 -0.06884766  0.06835938  0.09667969  0.11669922
  0.0378418  -0.05200195 -0.16601562 -0.16699219  0.11962891  0.16503906 -0.02929688  0.04882812  0.00328064
 -0.23632812 -0.038

### Exercise 1:
  1. Write the code to show the word vector for "cow"
  2. Write the code to show the word vector for a well known personality such as Lee Kuan Yew
  3. Write the code to show the word vector for a compound term such as artificial intelligence

In [18]:
# Your code

In [24]:
# Answer

words = ['cow', 'Lee_Kuan_Yew', 'artificial_intelligence']
for w in words:
   v = google_model[w]
   print("Vector for " , w, "is :" )
   print(v)
   print(' ')


Vector for  cow is :
[ 0.18945312 -0.07519531 -0.15625     0.19921875 -0.18457031  0.20703125 -0.04125977 -0.01428223  0.00363159
 -0.09570312  0.10693359 -0.5859375  -0.08300781 -0.08007812 -0.32421875 -0.03662109 -0.20898438  0.24511719
 -0.25585938 -0.08837891  0.12255859 -0.10742188 -0.00454712  0.06176758  0.00466919  0.04174805 -0.21582031
 -0.04443359  0.28125    -0.2109375  -0.02441406 -0.01190186  0.08154297 -0.03955078 -0.32226562  0.16992188
 -0.078125    0.00653076  0.28320312  0.33398438 -0.06591797 -0.07910156 -0.02441406  0.09179688  0.09082031
 -0.1640625  -0.04223633  0.26953125  0.12792969  0.25       -0.23535156 -0.33203125  0.27929688  0.10107422
 -0.05786133 -0.09814453 -0.06884766 -0.2734375   0.57421875  0.04736328  0.125      -0.07177734 -0.10107422
 -0.02355957  0.11132812 -0.39648438  0.02026367  0.0625     -0.20996094 -0.09521484  0.12255859 -0.11035156
 -0.24902344  0.03222656 -0.09423828  0.11132812  0.01184082  0.01672363 -0.15722656 -0.02368164  0.2597656

### Exercise 2:
Suppose you have a bag of words and need to convert them to word vectors that is needed for downstream NLP task.
Complete the codes below to prepare the equivalent bag of vectors. Each word-vector should be presented in a dictionary format

In [30]:
bag_of_words = ["happy", "sad", "computer", "milk"]
bag_of_vectors =  []

for w in bag_of_words:
    # continue with your codes

SyntaxError: unexpected EOF while parsing (<ipython-input-30-2dd5c57933a0>, line 5)

In [29]:
# Answer
bag_of_words = ["happy", "sad", "computer", "milk"]
bag_of_vectors =  []

for w in bag_of_words:
    v = google_model[w]
    bag_of_vectors.append({w : v})
print (bag_of_vectors)

[{'happy': array([-5.18798828e-04,  1.60156250e-01,  1.60980225e-03,  2.53906250e-02,  9.91210938e-02, -8.59375000e-02,
        3.24218750e-01, -2.17285156e-02,  1.34765625e-01,  1.10351562e-01, -1.04980469e-01, -2.90527344e-02,
       -2.38037109e-02, -4.02832031e-02, -3.68652344e-02,  2.32421875e-01,  3.20312500e-01,  1.01074219e-01,
        5.83496094e-02, -2.91824341e-04, -3.29589844e-02,  2.11914062e-01,  4.32128906e-02, -8.59375000e-02,
        2.81250000e-01, -1.78222656e-02,  3.79943848e-03, -1.71875000e-01,  2.06054688e-01, -1.85546875e-01,
        3.73535156e-02, -1.21459961e-02,  2.04101562e-01, -3.80859375e-02,  3.61328125e-02, -8.15429688e-02,
        8.44726562e-02,  9.37500000e-02,  1.44531250e-01,  7.42187500e-02,  2.51953125e-01, -7.91015625e-02,
        8.69140625e-02,  1.58691406e-02,  1.09375000e-01, -2.23632812e-01, -5.15747070e-03,  1.68945312e-01,
       -1.36718750e-01, -2.51464844e-02, -3.85742188e-02, -1.33056641e-02,  1.38671875e-01,  1.76757812e-01,
        