# Implementing word2vec from scratch

In this notebook we will implement word2vec (skipgram method) from scratch using numpy. To see detailed notes, go to lecture 1 notebook of CS224N - [Lecture-1-Introduction-and-Word-Vectors](Lecture-1-Introduction-and-Word-Vectors.ipynb). Here we jump directly to code.

As a first step, let's first generate the training data we are going to use train the word2vec.

## Variables

We will use the following variables in our implementation of `word2vec`.

- `V` : Number of unique words in the corpus (vocabulary)
- `x` : Input layer (One hot encoding of our input word)
- `N` : Number of neurons in the hidden layer of neural network i.e. the dimension of the vector representation of the word.
- `W` : Weights between input layer and hidden layer.
- `W'`: Weights between hidden layer and output layer.
- `y` : A softmax layer having probabilities of each word in our corpus
- `alpha` : Learning rate of the training. 

![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2FiX7rd0bVNJ.png?alt=media&token=5ac14c23-1165-4a20-be6c-c1dac015d50d)



## Implementation details

As described in the lecture notebook, the idea is to go through the corpus and treat each token as center word, and tokens around it as context words

- Input Layer (`x`): It is one hot encoding of the current center word. It's size of is *V* x 1, where *V* is the size of the vocabulary/unique words in corpus.
- Weight Matrix (`W`) : It is the matrix of the representation of the all center words in the vocabulary. There will be *V* rows (one for each word in vocabulary) and each row vector has N dimension, hence its size is V x N.
- Hidden Layer (`h`): It is nothing but representation of the center word in column vector. It is calculated by 
    $$
        h = W^T.x
    $$
    h's dimension will be N x 1.
- Weight Matrix 2 (`W'`): It is the matrix which stores the context word representation of each word in vocabulary. However, here each word is represented as column. There will be *V* columns of *N* dimension for each word in vocabulary. Thus shape will be N x V.

- The intermediate output layer(`u`): This is the output we get when we find similary between the hidden layer `h` (center word represented as column vector) and the matrix representing the words as context words. 
    $$
        u = W'^T.h
    $$

    Note that the dimension of W' is N x V and dimension of h is N x 1, thus u will be V x 1.

    Let $u_j$ represents the $j^{th}$ neuron of the layer $u$ and $W_j$ be the $j^{th}$ word in vocabulary, where j is the $j^{th}$ word in vocabulary.
    Thus $u_j$ is basically telling us score of how similar the center word (represented by h) and $j^th$ word is.

- Output Layer (`y`):. We apply the softmax on the matrix u to get the output layer. The softmax converts the integer values to probability distribution. Thus $y_j$ value is telling us how similar the center word (represented by h) and $j^th$ word is. We compare `y`  with training data `y_train`. `y_train` is essentially a vector which is built from training data in such a way that if a word appears in context of center word, it's corresponding score is 1 in the `y_train` else 0. Comparing `y` and `y_train` lets us calculate the loss.

- Forward propagation: The process of moving getting the output layer `y` from `x` is forward propagation.
- Backward propagation: In this process, based on the loss we calculate using gradient descent, we propagate the updated values in backward direction and update the `W` and `W'` to get `y` as close as possible to `y_train`.

- Initialization: We initialize the parameters we want to learn i.e `W` and `W'` to be randomly initialized.



In [6]:
import numpy as np
from collections import defaultdict

class word2vec(object):
    def __init__(self):
        self.N = 20
        self.X_train = []
        self.y_train = []
        self.context_size = 5
        self.alpha = 0.001
        self.tokens = []
        self.token_index = {}
    
    def initialize(self, V, corpus):
        self.V = V
        self.W = np.random.uniform(-0.8, 0.8, (self.V, self.N))
        self.W1 = np.random.uniform(-0.8, 0.8, (self.N, self.V))
        self.tokens = corpus
        for i in range(len(corpus)):
            self.token_index[corpus[i]] = i
            
    def softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum()
    
    def forward_pass(self, X):
        self.h = np.dot(self.W.T, X).reshape(self.N, 1)
        self.u = np.dot(self.W1.T, self.h)
        self.y = self.softmax(self.u)
        return self.y
    
    def backward_pass(self, x_train, y_train):
        error = self.y - np.asarray(y).reshape(self.V, 1)
        dldW1 = np.dot(
        )

Let's just briefly test this

In [5]:
corpus = ['''You could never convince a monkey to give you a banana by promising him limitless bananas after death in monkey heaven''',
'''History is something that very few people have been doing while everyone else was ploughing fields and carrying water buckets.''']
(X, y, word_index, index_word) = generate_training_data(corpus, context_size=3)
vocab_size = len(index_word)
print("Vocabulary size:", vocab_size)
print("Shape of training data: ", X.shape, y.shape)
print("Word to index: ", word_index)

38
Vocabulary size: 38
Shape of training data:  (1, 216) (1, 216)
Word to index:  {'a': 0, 'after': 1, 'and': 2, 'banana': 3, 'bananas': 4, 'been': 5, 'buckets.': 6, 'by': 7, 'carrying': 8, 'convince': 9, 'could': 10, 'death': 11, 'doing': 12, 'else': 13, 'everyone': 14, 'few': 15, 'fields': 16, 'give': 17, 'have': 18, 'heaven': 19, 'him': 20, 'history': 21, 'in': 22, 'is': 23, 'limitless': 24, 'monkey': 25, 'never': 26, 'people': 27, 'ploughing': 28, 'promising': 29, 'something': 30, 'that': 31, 'to': 32, 'very': 33, 'was': 34, 'water': 35, 'while': 36, 'you': 37}


We will need each word to be represented as one-hot encoding representation. So let's do that too.

In [14]:
def word2OneHot(word, word_index):
    vocab_size = len(word_index)
    y_one_hot = np.zeros((vocab_size))
    y_one_hot[word_index[word]] = 1
    return y_one_hot

In [13]:
word2OneHot('monkey', word_index)

(38,)


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0.])