# Word2vec from Scratch with Python and NumPy
From article:  
https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/

The goal with word2vec and most NLP embedding schemes is to translate text into vectors so that they can then be processed using operations from linear algebra.  
Vectorizing text data allows us to then create predictive models that use these vectors as input to perform something useful.

#### CBOW
Given a word in a sentence, w(t) (aka _center word_ or _target word_), CBOW uses the context or surrounding words as input and tries to predict the target word.

#### Skip-gram
skip-gram use a center word to predict the context words.

__Skip-gram__ has been shown to produce bettwer word-embeddings than __CBOW__.

#### one-hot encoding
Because we can't send text data directly through a matrix, we need to employ _one-hot encoding_.  
This means we have a vector of length _v_ where v is the total number of unique words in the text corpus. Each word corresponds to a single position in this vector, so when embedding the word v_n, everywhere in vector v is zero except v_n, which equals 1. 

After _one-hot encoding_, we can feed the data into network and train it.  
Network archetecture:  

Input Layer(Vx1) x W1(VxN) x Hidden Layer(Nx1) x W1'(NxV) x Output Layer(CxV)

In [1]:
import numpy as np
import re
from collections import defaultdict

In [30]:
class word2vec():
    def __init__(self):
        self.n = settings['n']
        self.eta = settings['learning_rate']
        self.epochs = settings['epochs']
        self.window = settings['window_size']
        pass
    
    # generate training data
    def generate_training_data(self, settings, corpus):
        
        # generate word counts
        word_counts = defaultdict(int)
        for row in corpus:
            for word in row:
                word_counts[word] += 1
        
        self.v_count = len(word_counts.keys())
        
        # generate lookup dictionaries
        self.words_list = sorted(list(word_counts.keys()), reverse=False)
        self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
        self.index_word = dict((i, word) for i, word in enumerate(self.words_list))
        
        training_data = []
        
        # cycle through each sentence in corpus
        for sentence in corpus:
            sent_len = len(sentence)
            
            # cycle through each word in sentence
            for i, word in enumerate(sentence):
                w_target = self.word2onehot(sentence[i])
                
                # cycle through context window
                w_context = []
                for j in range(i-self.window, i+self.window+1):
                    if j != i and j <= sent_len-1 and j >= 0:
                        w_context.append(self.word2onehot(sentence[j]))
                training_data.append([w_target, w_context])
        return np.array(training_data)
    
    # convert word to one-hot encoding
    def word2onehot(self, word):
        word_vec = [0 for i in range(0, self.v_count)]
        word_index = self.word_index[word]
        word_vec[word_index] = 1
        return word_vec
    
    # forward pass
    def forward_pass(self, x):
        h   = np.dot(self.w1.T, x)
        u = np.dot(self.w2.T, h)
        y_c = self.softmax(u)
        return y_c, h, u

    # softmax activation function
    def softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum(axis=0)

    # train w2v model
    def train(self, training_data):
        # initialize weight matrices
        self.w1 = np.random.uniform(-.8, .8, (self.v_count, self.n))  # context matrix
        self.w2 = np.random.uniform(-.8, .8, (self.n, self.v_count))  # embedding matrix

        # cycle through each epoch
        for i in range(0, self.epochs):

            self.loss = 0

            # cycle through each training sample
            for w_t, w_c in training_data:
                y_pred, h, u = self.forward_pass(w_t)                            # forward pass
                EI = np.sum([np.subtract(y_pred, word) for word in w_c], axis=0) # calculate error
                self.backprop(EI, h, w_t)                                        # backpropagation

                self.loss += -np.sum([u[word.index(1)] for word in w_c]) + len(w_c)*np.log(np.sum(np.exp(u)))
            
            if i % 1000 == 0:
                print('EPOCH: ' + str(i) + ' LOSS: ' + str(self.loss))
        pass

    # backpropagation
    def backprop(self, e, h, x):
        d1_dw2 = np.outer(h, e)
        d1_dw1 = np.outer(x, np.dot(self.w2, e.T))

        # update weights
        self.w1 = self.w1 - (self.eta * d1_dw1)
        self.w2 = self.w2 - (self.eta * d1_dw2)
    
    # input a word, returns a vector (if available)
    def word_vec(self, word):
        w_index = self.word_index[word]
        v_w     = self.w1[w_index]
        return v_w
    
    
    # input a vector, returns nearest word(s)
    def vec_sim(self, vec, top_n):

        # CYCLE THROUGH VOCAB
        word_sim = {}
        for i in range(self.v_count):
            v_w2 = self.w1[i]
            theta_num = np.dot(vec, v_w2)
            theta_den = np.linalg.norm(vec) * np.linalg.norm(v_w2)
            theta = theta_num / theta_den

            word = self.index_word[i]
            word_sim[word] = theta

        words_sorted = sorted(word_sim.items(), key=lambda sim:(word, sim), reverse=True)

        for word, sim in words_sorted[:top_n]:
            print (word, sim)
            
        pass

In [31]:
settings = {}
settings['n'] = 5
settings['window_size'] = 2
settings['min_count'] = 0
settings['epochs'] = 5000
settings['neg_samp'] = 10
settings['learning_rate'] = .01
np.random.seed(0)

corpus = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

# initialize w2v model
w2v = word2vec()

training_data = w2v.generate_training_data(settings, corpus)

w2v.train(training_data)

EPOCH: 0 LOSS: 68.37096376709991
EPOCH: 1000 LOSS: 41.24645176265884
EPOCH: 2000 LOSS: 41.13428630385451
EPOCH: 3000 LOSS: 41.10145846484696
EPOCH: 4000 LOSS: 41.080509291741805


In [33]:
w2v.vec_sim('fox', 10)

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'