# The Word2Vec Model
Welcome! Through the following lines of code, I have attempted to demonstrate the Word2Vec NN model using pytorch. You would actually need a much larger training corpus (>=10,000 words), than the one used in this code, in order to obtain a reasonably accurate set of word embeddings. You would also require a different type of softmax layer (hierarchical softmax) for faster computation. 

If you wish to modify certain parameters I would recommend following the guidelines given before each code cell.



In [1]:
# Importing required modules
import torch
from torch.autograd import Variable
import numpy as np
import torch.nn.functional as F

### The Training Corpus


*   Defines the list containing the sentences that the model is to be trained on.
*   You can add/ edit/ delete sentences as per your preference, however DO NOT use punctuation or capitalization _(in order to avoid repetition of words)_.





In [2]:
 corpus = ['he is a king','she is a queen',
    'he is a man','i would like some mango juice',
    'she is a woman','i would like some orange juice','i would like some apple juice','nairobi is the capital of kenya',
    'delhi is the capital of india','oslo is the capital of norway'   
]

### Vocabulary and Word Indexing


*   Splits corpus sentences into individual word sequences
*   Generates a vocabulary of all words that occur in the training corpus
*   Assigns indices to words
*   Converts sequences of words into their respective index sequences




In [3]:
#returns list of sentences and distinct word vocabulary
def split_words(corpus):
  tokens = []
  vocab=set()
  for x in corpus:
    sep=x.split(' ')
    tokens.append(sep)
    vocab=vocab|set(sep)
  vocab=list(vocab)
  return tokens,vocab
sequences,vocab = split_words(corpus)
#maps word to index and index to word
word_to_ind={}
ind_to_word={}
ind=0
for word in vocab:
  word_to_ind[word]=ind
  ind_to_word[ind]=word
  ind+=1
#indexed sequences
seq_inds=[[word_to_ind[word] for word in sequence] for sequence in sequences]

### Window Selection and Context-Target Generation
*  The window size and mode for context words' range can be chosen:-
   *   Window mode bi-directional ('bi_dir') looks at words before and after the target word in sequence 2 times that of window size. 
   *   Window mode uni-directional ('uni_dir') looks at words only before the target word in sequence.
*  The loop iterates over the entire corpus and applies the pair_words function on all sequences.



In [4]:
window_size=2
mode='bi_dir'  


def pair_words_single(seq_ind,window):
  pairs=[]
  for target_pos in range(len(seq_ind)):
    context_start=max(target_pos-window,0)
    context_end=target_pos
    for context_pos in range(context_start,context_end):
      pairs.append((seq_ind[context_pos],seq_ind[target_pos]))
  return pairs
def pair_words_dual(seq_ind,window):
  pairs=[]
  for target_pos in range(len(seq_ind)):
    context_start=max(target_pos-window,0)
    context_end=min(target_pos+window+1,len(seq_ind))
    for context_pos in range(context_start,context_end):
      if context_pos!=target_pos:
        pairs.append((seq_ind[context_pos],seq_ind[target_pos]))
  return pairs
def pair_words(seq_ind,window_size=2,mode='bi_dir'):       #returns context-target pairs for a sequence of words/indices, given a window size(default=2) and mode(default=bi_dir)
  if mode=='uni_dir':
    return pair_words_single(seq_ind,window_size)
  else:
    return pair_words_dual(seq_ind,window_size)

all_pairs=[]
for seq_ind in seq_inds:                #loop iterates over all indexed sequences to compute all possible target-context pairs wrt each sentence
  pairs_per_sequence=pair_words(seq_ind,window_size,mode)
  for pair in pairs_per_sequence:
    all_pairs.append(pair)
all_pairs=np.array(all_pairs)  #context target pairs

### Training the Model

*   The embedding dimensions, number of epochs and learning rate can be chosen
    *  Having many embedding dimensions helps if there is a large vocabulary
*   Randomly initializes W1-Embedding matrix, W2-softmax weight
*   We won't be multplying the W1 matrix by the one-hot context vector but rather chose the corresponding column from W1 using the word's index since this produces the same effect and is less computationally wasteful
*   The F.log_softmax converts activation z2 into a softmax probability output. The  F.nll_loss computes loss of softmax output w.r.t target on-hot 


In [5]:
embedding_dim=20
num_epochs = 401
learning_rate = 0.003


vocab_size=len(vocab)
W1 = Variable(torch.randn(embedding_dim, vocab_size).float(), requires_grad=True)  #embedding matrix
W2 = Variable(torch.randn(vocab_size, embedding_dim).float(), requires_grad=True)  #weights for softmax layer

losses=[]
for epoch in range(num_epochs):
    loss_val = 0
    for context, target in all_pairs:
        y_true = Variable(torch.from_numpy(np.array([target])).long())

        z1 = W1[:,context]
        z2 = torch.matmul(W2, z1)
    
        log_softmax = F.log_softmax(z2, dim=0)

        loss = F.nll_loss(log_softmax.view(1,-1), y_true)
        loss_val += loss
        loss.backward()
        W1.data -= learning_rate * W1.grad.data
        W2.data -= learning_rate * W2.grad.data

        W1.grad.data.zero_()
        W2.grad.data.zero_()
    l_epoch=loss_val/len(all_pairs)
    losses.append(l_epoch)
    if epoch % 50 == 0:                 # displays loss at every 50th epoch
        print(f'Loss at epoch {epoch}: {l_epoch}')

Loss at epoch 0: 9.069747924804688
Loss at epoch 50: 2.003243923187256
Loss at epoch 100: 1.6007941961288452
Loss at epoch 150: 1.537070870399475
Loss at epoch 200: 1.5180995464324951
Loss at epoch 250: 1.509350299835205
Loss at epoch 300: 1.5043885707855225
Loss at epoch 350: 1.5012295246124268
Loss at epoch 400: 1.4990565776824951


### Displaying Loss Vs Epoch Number


In [6]:
import matplotlib.pyplot as plt
plt.plot([epoch for epoch in range(num_epochs)],losses)
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.show()

<Figure size 640x480 with 1 Axes>

### Displaying the embedding vector
* Using pandas dataframe to display the embedding matrix 

In [7]:
import pandas as pd
embedding_matrix=np.array(W1.data)

dfl=pd.DataFrame(embedding_matrix)
dfl.columns=[ind_to_word[i] for i in range(vocab_size)]
dfl.index=['dim'+str(i+1) for i in range(embedding_dim)]
print('Embedding Matrix')
dfl

Embedding Matrix


Unnamed: 0,orange,is,oslo,would,delhi,like,mango,nairobi,india,he,...,juice,king,queen,man,woman,capital,a,the,i,some
dim1,-0.718601,0.177996,1.464059,0.965732,-1.536459,-1.226401,-0.162112,1.062486,-0.438659,0.427767,...,0.710056,-1.382153,-0.400128,1.42053,0.060002,-0.193844,1.360463,0.697182,-1.107796,-1.458363
dim2,0.070449,0.199047,-1.117345,0.055307,-1.58551,-1.457895,-1.333045,0.112163,-0.692107,0.056013,...,-0.878259,0.5906,-0.693093,0.270018,0.794295,-1.218075,-0.045988,-1.225881,-0.091753,1.094044
dim3,-0.558962,0.894313,1.236595,0.112119,-0.270379,-1.341065,-1.733485,0.249879,-0.449963,0.206287,...,0.323446,-0.803001,1.68949,-0.425242,-0.588031,-0.002898,0.702282,-0.901426,-0.802542,-0.092557
dim4,0.323133,-1.027089,0.898222,-0.428105,-0.678932,2.001344,0.155436,1.355423,0.356824,-0.40735,...,0.45851,-1.195301,-0.377396,1.151165,-0.780778,0.19155,-0.094678,-0.331814,-0.339736,1.595707
dim5,0.734673,0.593885,-0.681481,0.3095,0.175075,0.093503,1.155967,-1.014479,1.073746,-1.158621,...,-0.533552,0.001511,-0.229285,0.339894,-0.00045,1.151553,-0.324833,0.883462,-0.544843,-1.489952
dim6,-2.317082,0.179021,-0.450853,1.126296,-1.798695,0.49999,-0.874035,-0.917235,-1.056295,-0.958606,...,-0.916634,-0.249868,-0.657378,0.484556,-0.979836,1.379127,-1.130657,-0.058161,0.89474,-0.691502
dim7,-0.6486,0.712807,-0.812688,-1.760792,-0.361765,-1.276452,-1.487523,1.187946,1.033824,0.166655,...,-1.219593,0.088364,-0.947047,-0.685236,-0.295114,0.947386,0.20477,0.729436,-0.692726,0.112763
dim8,1.267705,0.307476,0.034075,-0.882373,-0.263934,-0.680435,0.12083,0.51326,1.487563,0.327581,...,-0.082723,-0.769587,1.277233,-2.561692,-0.170014,-0.209685,-0.93782,-0.907287,-0.34071,-0.542404
dim9,0.512797,-0.554603,0.048291,-0.496424,0.878773,-0.24552,-0.915685,-0.121788,0.738846,0.161169,...,-2.000108,-0.063608,-0.666742,-0.045698,0.769146,0.982649,1.220369,0.660188,1.159985,0.651269
dim10,-0.932376,-0.888959,-0.272708,0.39767,-1.100998,-0.066146,0.77062,0.216297,0.052948,1.198908,...,0.312746,-1.307688,1.383398,-0.159594,0.056675,0.267677,0.239457,-1.798468,-1.260083,0.974359


### Conclusion:
If you've used the default/pre-defined parameters you might have noticed that the embedding vector isn't quite accurate. This is because, the model is limited in it's ability to generalise over a large corpus of training sentences.

This can be understood by the fact even us humans would find it hard to pick up similar words and analogies given a limited set of sentences from a completely unfamiliar language.



Thank You!!