# Breaking Down Embeddings

### Introduction

### The skipgram problem

With the skipgram model, we'll train a model that predicts for each word, what certain nearby words are.  And the goal is that by training a model to discover what words are nearby, the vector associated with a word will encapsulate information about what word is typically nearby, and thus information about the word itself.

Before moving onto how we accomplish this task, let's make sure we have down our feature and target data.  Let's take a look at the last line in the diagram below.  We can see that the highlighted word is "fox".  This word will be our input feature.  We see that from the text, we have four observations where it is the input, and with `quick`, `brown`, `jumps` and `over` each taking a turn as the target.  Notice that we've now set up a classification problem, where given an input word we predict whether another word in our vocabulary is the specified word in the window.

<img src="./skipgram.png" width="60%">

Now, this model is implemented using a shallow neural network.  Let's see how.

### Implementing Skipgram

1. Use one hot encoding to retreive the word vector

Our first step is to represent each word in our corpus, our vocabulary, with  a one hot vector.  Then, we retreive a corresponding word embedding vector through matrix multiplication.

$w_{e} = x \cdot W_e$

In [13]:
import numpy as np
vec_25 = np.eye(25000)[25]

In [14]:
We = np.random.randn(25000, 300)

In [17]:
w_e = vec_25.dot(We)
w_e.shape

(300,)

In [18]:
w_e[:10]

array([ 0.64997304,  1.29472606,  0.80842406, -1.96071706,  0.82823269,
       -0.86204791, -0.03670862, -0.59243377,  0.14133675, -0.3104924 ])

In [20]:
We[25][:10]

array([ 0.64997304,  1.29472606,  0.80842406, -1.96071706,  0.82823269,
       -0.86204791, -0.03670862, -0.59243377,  0.14133675, -0.3104924 ])

2. Pass to output layer

Now that we have a $w_e$ to represent every word in our vocabulary, and a mechanism to select each word, $x \cdot W_e$, the next step is to predict whether any other word is in our window.  For this we need an output layer.  This output layer will consist of a linear layer and a softmax function.

In [27]:
import torch.nn as nn
import torch.nn.functional as F

output_layer = nn.Linear(300, 25000)

In [26]:
from torch import tensor
w_e_tensor = tensor(w_e).float()
lin_outputs = output_layer(w_e_tensor)

lin_outputs

tensor([-0.4250, -0.6150, -0.0122,  ..., -0.6570, -0.7117, -0.5218],
       grad_fn=<AddBackward0>)

Then we just pass these outputs through a softmax function, to turn these numbers into a prediction that any word in our vocabulary is within the window.

In [35]:
preds = F.softmax(lin_outputs, dim=0)
preds

tensor([2.2347e-05, 1.8481e-05, 3.3767e-05,  ..., 1.7721e-05, 1.6777e-05,
        2.0286e-05], grad_fn=<SoftmaxBackward>)

In [36]:
preds.shape

torch.Size([25000])

This embedding is used to predict the probability of a nearby word.  So this is the entire Word2vec model.

* $\text{dog}$
* $x_i = [0, 0, 0, 0, 1, 0, 0]_{25000}$
* $e_{300} = E_{25000x300} \cdot x_{i}$
* $o_2 = e_{300} \cdot O_{300x2500} $
* $preds_{25000} = softmax(o_2)$

The skipgram model is kept purposely simple.  This is because the purpose isn't really to have a model that is as accurate as possible, but rather whose embeddings are as representative as possible.  Notice that the only items for us to tweak are the embedding layers and the output layer.  So this embedding layer has a strong influence on our predictions.

Notice also that also because the performance of our model depends on how well the embedding layer predicts if a word is within the window, we are really defining our words in terms of it's surrounding words.  For example, if two different words have similar surrounding words, note that we would expect their embeddings to be fairly similar.

One way of thinking about this assumption that we make in the skipgram model is that "a word is defined by the company it keeps".

### Summary