#### How word2vec works:

- Take a 3 layer neural network. (1 input layer + 1 hidden layer + 1 output layer)
- Feed it a word and train it to predict its neighbouring word.
- Remove the last (output layer) and keep the input and hidden layer.
- Now, input a word from within the vocabulary. The output given at the hidden layer is the ‘word embedding’ of the input word.

In [None]:
import tensorflow as tf
import numpy as np

corpus_raw = 'He is the king . The king is royal . She is the royal  queen '

# convert to lower case
corpus_raw = corpus_raw.lower()

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)

words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
print('VOCAB SIZE:', vocab_size)

We need to convert this to an **input output pair** such that if we input a word, it should it predict that the neighbouring words: the n words before and after it, where n is the parameter window_size

![](https://cdn-images-1.medium.com/max/800/1*yiH5sZI-IBxDSQMKhvbcHw.png)

Before doing this, we will create a dictionary which translates words to integers and integers to words. This will come in handy later.

Now, we will generate our training data:

In [None]:
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
    sentences.append(sentence.split())

WINDOW_SIZE = 2

data = []
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

In [None]:
### This basically gives a list of word, word pairs. (we are considering a window size of 2)
print(data)

We have our training data. But it needs to be represented in a way a computer can understand i.e., with numbers. That’s where our word2int dict comes handy.

Let’s go one step further and convert these numbers into one hot vectors.

In [None]:
# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp

x_train = [] # input word
y_train = [] # output word

for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))

# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

#### Build the TF model
We take our training data and convert into the embedded representation.
![](https://cdn-images-1.medium.com/max/800/1*Os5hj9qg1t6sr0S3DF4gyA.jpeg)

Next, we take what we have in the embedded dimension and make a prediction about the neighbour. To make the prediction we use softmax.

![](https://cdn-images-1.medium.com/max/800/1*KxWiUoe-FXPpBdATP-IHOw.jpeg)

Putting it all together:

![](https://cdn-images-1.medium.com/max/800/1*cnzY08TWRxG3lMKExbslHw.jpeg)

In [None]:
EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))


sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!

# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))

# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_iters = 10000
# train for n_iter iterations

for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))

vectors = sess.run(W1 + b1)

#### Why one hot vectors?

![](https://cdn-images-1.medium.com/max/800/1*neaOXEbp6h6kgOKVsMwLhw.png)

When we multiply the one hot vectors with W1 , we basically get access to the row of the of W1 which is in fact the embedded representation of the word represented by the input one hot vector. So W1is essentially acting as a look up table.

Here’s a quick function to find the closest vector to a given vector. We will then query these vectors with ‘king’, ‘queen’ and ‘royal’

In [None]:
def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index


In [None]:
print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])

In [None]:
from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors) 

In [None]:
vectors

In [None]:
from sklearn import preprocessing

normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, 'l2')

print(vectors)

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
print(words)

for word in words:
    print(word, vectors[word2int[word]][1])
    ax.annotate(word, xy=(vectors[word2int[word]][0],vectors[word2int[word]][1] ))
ax.set_xlim([-500, 500])
ax.set_ylim([-500, 500])
plt.show()