# Word Embeddings: Encoding Lexical Semantics
Word embeddings are dense vectors of real numbers, one per word in a vocabulary. In NLP, it is almost always the case that your features are words. We use one-hot encoding to represent words, where the inputs are $|V|$-dimensional, where $V$ is the vocabulary.<br>
 ie. we represent word w by
$\begin{bmatrix} 0, 0,\cdots, 1,\cdots,0, 0 \end{bmatrix}$
where the 1 is in a location unique to w. Any other word will have a 1 in some other location, and a 0 in all other locations.

There is an enormous drawback to this representation, besides just how huge it is. It basically treats words as independent entities with no relation to each other.<br>

Suppose we are building a language model. Suppose we have sentences like
- The mathematician ran to the store.
- The physicist ran to the store.
- The mathematican solved the open problem.

Our langauge might be ok with `The physicist solved the open problem.` but it would be better if we could use he following two facts:
- We have seen mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.
- We have seen mathematician in the same role in this new unseen sentence as we are now seeing physicist.<br>
and then infer that physicist is actually a good fit for the sentence. This is what we mean by a notion of similarity; we mean semantic similarity, not simply having orthogonal representations.It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven't.

## Getting Dense Word Embeddings
If each attribute is a dimension, then we might give each word a vector, like this:<br>
$q_{mathematician} = [\overbrace{2.3}_{\text{can run}} \overbrace{9.4}_{\text{likes coffee}}, \overbrace{-5.5}_{\text{majored in physics}},\cdots]$<br>
$q_{physicist} = [\overbrace{2.5}_{\text{can run}} \overbrace{9.1}_{\text{likes coffee}}, \overbrace{6.4}_{\text{majored in physics}},\cdots]$

Then we can get a measure of similarity between thee words by doing:
$Similarity(mathematician, physicist) = q_{mathematician}.q_{physicist}$
<br>
Although it is more common to normalize by lengths:
$Similarity(mathematician, physicist) = \frac{q_{mathematician}.q_{physicist}}{||q_{mathematician}|| ||q_{physicist}||} = cos(\phi)$
where $\phi$ is the angle between the two vectors. That way, extremely similar words will have similarity of 1, and very dissimilar words will have similarity of -1.
We can think of sparse one-hot vectors as a special case of these new vectors defined here, where each word has similarity 0, and we gave each word some unique semantic attribut. These new vectors are dense, which is to say their entries are non-zero.<br>
But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes? Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself. So why not just let the word embeddings be parameters in our model, and then be updated during training? This is exactly what we will do. We will have some latent semantic attributes that the network can, in principle, learn. Note that the word embeddings will probably not be interpretable. That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means. They are similar in some latent semantic dimension, but this probably has no interpretation to us.<br>
In summary, word embeddings are a representation of the *semantics* of a word, efficiently encoding semantic information that might be relevant to the task at hand.

# Word embeddings in Pytorch
Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings. These will be keys into a lookup table. That is, embeddings are stored as a |V| \times D∣V∣×D matrix, where DD is the dimensionality of the embeddings, such that the word assigned index ii has its embedding stored in the ii’th row of the matrix. In all of my code, the mapping from words to indices is a dictionary named word_to_ix.


In [199]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fd69c1c5670>

In [200]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype = torch.long)
hello_embed = embeds(lookup_tensor)

print(hello_embed, lookup_tensor, sep='\n')

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)
tensor([0])


# An Example: N-Gram Language Modeling
In an n-gram language model, given a sequence of words w, we want to compute
$P(w_i | w_{i-1}, w_{i-2},\cdots ,w_{i-n+1})$
Where $w_{i}$ is the ith word of the sequence.
We will compute the loss function on some training examples and update the parameters with backpropagation.

In [201]:
CONTEXT_SIZE = 3
EMBEDDING_DIM = 10

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now
# build a list of tuples
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)

ngrams = [
    (
        [test_sentence[i-j-1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    ) for i in range(CONTEXT_SIZE, len(test_sentence))
]

vocab = set(test_sentence)
word_to_ix = {word : i for i, word in enumerate(vocab)}

In [202]:
class NGramLanguageModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size) -> None:
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [203]:
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[513.1639378070831, 510.34439420700073, 507.5419828891754, 504.7551739215851, 501.98272228240967, 499.2240331172943, 496.4798057079315, 493.7489833831787, 491.0306079387665, 488.3209731578827]
tensor([-0.8247, -0.4093, -2.3699, -0.0218, -0.3094, -1.0465, -0.9334,  0.2457,
         0.1783,  0.7626], grad_fn=<SelectBackward0>)


# Computing Word Embeddings : Continuous Bag-of-Words
The Continuous Bag-of-words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given a context of words before and a few words after the target word. This is distinct from language modelling, since CBOW is not sequential and doesnt have to be probabalistic.Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initial the embeddings of some more complicated model. Usually this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

The CBOW model is as follows:<br>
Given a target word $w_i$ and an N context window on each side, $w_{i-1},\cdots, w_{i-N}$ and $w_{i+1},\cdots, w_{i+N}$, referring to all context words collectively as $C$, CBOW tries to minimize
$-logp(w_i|C) = -logSoftmax(A(\sum_{w \epsilon C}q_w)+b)$
where $q_w$ is the embedding of word $w$.

In [208]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
         [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(
        self,vocab_size, embedding_dim, context_size = CONTEXT_SIZE):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, 128)
        self.out = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        x = torch.sum(self.embeddings(inputs), dim=0).view(1,-1)
        x = F.relu(self.linear(x))
        x = self.out(x)
        x = F.log_softmax(x, dim=1)
        return x
        
# Create your model and train. Here are some functions to help you make
# the data ready for use by your module.


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


tensor([36,  8, 13, 22])

In [209]:
losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, 100)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(100):
    total_loss = 0
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = make_context_vector(context, word_to_ix)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

print(model.embeddings.weight[word_to_ix["manipulate"]])

[232.7938220500946, 225.48803400993347, 218.46842622756958, 211.68482422828674, 205.0910131931305, 198.64530396461487, 192.34060168266296, 186.14204287528992, 180.03821539878845, 174.03275001049042, 168.10800909996033, 162.25235557556152, 156.47664922475815, 150.75634771585464, 145.10827374458313, 139.5689457654953, 134.09585565328598, 128.7139936685562, 123.42626509070396, 118.23131731152534, 113.10155031085014, 108.08978912234306, 103.19601333141327, 98.42329469323158, 93.77638283371925, 89.27157053351402, 84.90581057965755, 80.69897428154945, 76.64146822690964, 72.71896083652973, 68.97243192791939, 65.37726899981499, 61.936557561159134, 58.657966271042824, 55.54424685239792, 52.57526287436485, 49.77652046829462, 47.10767520964146, 44.600095234811306, 42.23728417605162, 40.01233260333538, 37.91750793159008, 35.9555834159255, 34.110297821462154, 32.38407306373119, 30.769475392997265, 29.25427435338497, 27.843179754912853, 26.51744243502617, 25.281926542520523, 24.122259888798, 23.0455

In [212]:
with torch.no_grad():
    context = ['manipulate', 'other', 'things', 'called']# target is 'abstract'
    context_vector = make_context_vector(context, word_to_ix)
    predict = model(context_vector)
    print(ix_to_word[predict.argmax(dim=1).item()])

with torch.no_grad():
    context = ['of','rules','a','program.']# target is 'abstract'
    context_vector = make_context_vector(context, word_to_ix)
    predict = model(context_vector)
    print(ix_to_word[predict.argmax(dim=1).item()])

abstract
called
