# Pre-trained Word Embedding Vectors to Find the Word Analogies

* This method is based on using pre-trained word embedding vectors.

In this method,for each word in a dictionary, a vector representation is produced - "word embeddings".

This is a basic task in natural language processing. It is used in GRE and SAT exams in the US to help gauge human English ability!


### Loading pre-trained word vectors

The file `pretrained_word2vec_vectors.txt` contains 100-dimensional word vectors for $N$ words. It is pretrained using the gensim package and the word2vec model.
The first statement of the model has the number of words and dimensions for wach word.

Representing the model in the form of dictionary

In [5]:
filename = 'C:\\Users\\pooji\\Documents\\Notebooks\\95-865 Spring 2018 Mini-3 - Final Exam\\pretrained_word2vec_vectors.txt'
import numpy as np
embeddings_index= {}
with open(filename) as f:
    # each row represents a word vector
    for line in f:
        values = line.split()
        # the first part is word
        word = values[0]
        # the rest of the values form the embedding vector
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))


Found 15174 word vectors.


Removing the entry of the dictionary that has the information about the number of words and the dimensions of each vector (first line in the document).

In [7]:
k = list(embeddings_index.keys())
k[0]
embeddings_index.pop('15173')

array([ 100.], dtype=float32)

Read the word vectors into a numpy array `w2v` with 100 columns and $N$ rows.

Construct a list `w2v_words` of $N$ words ordered according to the row index of their vector in `w2v`.

In [93]:
w2v_words = [i for i,k in enumerate(embeddings_index.keys())]
l = list(embeddings_index.values())
w2v = np.array(l)
print(w2v.shape)

(15173, 100)


Normalize each vector by its L2 norm . After normalization, each row of `w2v` will have an L2 norm of 1.0.


In [94]:
norm = np.array([np.linalg.norm(v) for v in w2v])
for i in range((w2v.shape[0])):
    w2v[i] = w2v[i]/norm[i]

In [47]:
w2v[0]

array([ 0.05401   ,  0.02648962, -0.13368909, -0.04295449, -0.11561567,
        0.06392961,  0.18777601, -0.03032635, -0.03834768,  0.05719716,
        0.04116647,  0.11574388,  0.05172624, -0.14440101,  0.04050237,
        0.0799432 , -0.1046636 , -0.20766737,  0.01127687, -0.17266241,
       -0.00200854, -0.07678851,  0.22929633,  0.07413639, -0.01899905,
        0.02255801,  0.00419143, -0.11205757,  0.013129  ,  0.10333455,
       -0.20436054,  0.03783913,  0.09512347, -0.07432272,  0.07157487,
        0.17787179,  0.02194348,  0.16391374,  0.21379983, -0.04498867,
        0.05303564, -0.12802587, -0.0914081 , -0.0825167 ,  0.24286807,
       -0.07098598, -0.20567337, -0.09574313,  0.0079521 ,  0.08500044,
        0.09422348, -0.04991087,  0.09977558, -0.14894287,  0.0044581 ,
        0.14001128,  0.14962064,  0.02814773,  0.06051338, -0.14996679,
       -0.077668  , -0.02580158,  0.07543638, -0.01375122, -0.09847303,
        0.06246294,  0.02284519, -0.12523273, -0.10002857,  0.15

Testing whether your normalization is correct:

In [95]:
np.allclose(np.array([np.linalg.norm(v) for v in w2v]),
                   np.ones(w2v.shape[0]))

True

### Predicting word analogies for new set using the pretrained model

The `questions-words.txt` file is the test set that we are trying to predict the word analogies and test the accuracy. For instance,

```
Athens Greece Baghdad Iraq
```

This is - "Athens is to Greece, as Baghdad is to ?".

The task is to predict the last word in each line. In this scenario, I am trying to assess how well our model can predict the word using pretrained model. Since, I am aware of the last word in each line, I will determine the number of successes and failures. 

Let `a b c d` be the four words in an analogy task. Let `v` be the word embedding, so `v[w]` is the vector for word `w`. 

Compute `pred = v[c] + (v[b] - v[a])`. This will be word vector which has the closest representation to c.

We assess the closeness using the cosine distance of this computed vector with every vector in the model, and find the word `y` with the second highest cosine distance value. This is because the closes word to `pred` will most likely be `c`; hence, we use the second-closest instead. 

If `y` is equal to `d`, count the task as a success. If not, count it as a failure.


In [97]:
suc = 0 ##Number of succ
fail = 0 ## Number of failures
filename2 = 'C:\\Users\\pooji\\Documents\\Notebooks\\95-865 Spring 2018 Mini-3 - Final Exam\\questions-words.txt'
with open(filename2) as f:
    for line in f:
        words =line.split()
## Index in the keys list corresponds to the index in the w2v vector
        word1_index = list(embeddings_index.keys()).index(words[1])
        word2_index = list(embeddings_index.keys()).index(words[2])
        word0_index = list(embeddings_index.keys()).index(words[0])
        
        pred = w2v[word2_index] + (w2v[word1_index] - w2v[word0_index])
        cos = {}
## cos is the cosine distance dictionary that has each word and its distance from the pred.
        for i in range(w2v.shape[0]):
            cos_sim = np.dot(pred,w2v[i])
            cos[i] = cos_sim
## to find the second highest cosine distance, sort it in descending order and find the element in the 1st position
        k = sorted(list(cos.values()), reverse = True)[1]
        ind = list(cos.values()).index(k)
        pred_word = list(embeddings_index.keys())[ind]
## to find the accuracy of the model, comparing predicted value with the test set
        if pred_word == words[3]:
            suc = suc + 1
        else:
            fail = fail + 1


In [98]:
print(words)
print(pred_word)


['think', 'thinks', 'talk', 'talks']
talk


In [99]:
print(suc)
print(fail)

38
5894
