# Implementing word2vec from scratch

In this notebook we will implement word2vec (skipgram method) from scratch using numpy. To see detailed notes, go to lecture 1 notebook of CS224N - [Lecture-1-Introduction-and-Word-Vectors](Lecture-1-Introduction-and-Word-Vectors.ipynb). Here we jump directly to code.

As a first step, let's first generate the training data we are going to use train the word2vec.

## Generating training data.

As described in the lecture notebook, the idea is to go through the corpus and treat each token as center word, and tokens around it as context words. We are not going to use fancy methods of tokenization here, but using those, the word2vec can perform amazing. We also maintain the two dictionaries to map word to index and index to word (This is the how we represent o as index of outside word and c as index of center word and map it to j in the index formula) 

In [6]:
import numpy as np
from collections import defaultdict

def generate_training_data(corpus, context_size):
    word_counts = defaultdict(int);
    tokens = []
    X, y = [], []
    for sentence in corpus:
        word_list = sentence.split()
        for word in word_list:
            word = word.lower()
            word_counts[word] += 1
            tokens.append(word)
    N = len(word_counts.keys())
    words_list = sorted(list(word_counts.keys()), reverse=False)
    word_index = dict((word, i) for (i, word) in enumerate(words_list))
    index_word = dict((i, word) for (i, word) in enumerate(words_list))
    
    for i in range(N):
        context_indices = list(range(max(0, i - context_size), i)) + \
            list(range(i+1, min(i + context_size + 1, N)))
        for j in context_indices:
            X.append(word_index[tokens[i]])
            y.append(word_index[tokens[j]])
        
    X = np.array(X)
    X = np.expand_dims(X, axis=0)
    y = np.array(y)
    y = np.expand_dims(y, axis=0)
    
    return (X,y, word_index, index_word)

Let's just briefly test this

In [5]:
corpus = ['''You could never convince a monkey to give you a banana by promising him limitless bananas after death in monkey heaven''',
'''History is something that very few people have been doing while everyone else was ploughing fields and carrying water buckets.''']
(X, y, word_index, index_word) = generate_training_data(corpus, context_size=3)
vocab_size = len(index_word)
print("Vocabulary size:", vocab_size)
print("Shape of training data: ", X.shape, y.shape)
print("Word to index: ", word_index)

38
Vocabulary size: 38
Shape of training data:  (1, 216) (1, 216)
Word to index:  {'a': 0, 'after': 1, 'and': 2, 'banana': 3, 'bananas': 4, 'been': 5, 'buckets.': 6, 'by': 7, 'carrying': 8, 'convince': 9, 'could': 10, 'death': 11, 'doing': 12, 'else': 13, 'everyone': 14, 'few': 15, 'fields': 16, 'give': 17, 'have': 18, 'heaven': 19, 'him': 20, 'history': 21, 'in': 22, 'is': 23, 'limitless': 24, 'monkey': 25, 'never': 26, 'people': 27, 'ploughing': 28, 'promising': 29, 'something': 30, 'that': 31, 'to': 32, 'very': 33, 'was': 34, 'water': 35, 'while': 36, 'you': 37}


We will need each word to be represented as one-hot encoding representation. So let's do that too.

In [14]:
def word2OneHot(word, word_index):
    vocab_size = len(word_index)
    y_one_hot = np.zeros((vocab_size))
    y_one_hot[word_index[word]] = 1
    return y_one_hot

In [13]:
word2OneHot('monkey', word_index)

(38,)


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0.])