In [None]:
%matplotlib inline

import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sklearn.metrics
import sklearn.model_selection
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Word embeddings

Up to now we've represented sentences as bag of words vectors, which is fine if you don't care about the sentence structure and only care about the words used, but when you do care about sentence structure, you need to use a different approach.

Note that a word is a linguistic object whilst in NLP we talk about **tokens** instead.
A text is broken up into a sequence of tokens, which are strings in the text that are not further decomposed.
They can be of any complexity you choose such as whole words, parts of words, characters, or even bytes.

## One-hot vectors

Let's start by looking at just one token before thinking about entire sentences.
In the context of inputs to neural networks, how can you refer to a single element out of a finite set of elements?
Remember that all inputs to a neural network must be numeric.

One way would be by using a single index number; for example 0 would refer to the first element in the set, 1 would refer to the second, and so on.
The problem would this approach is that:

1. Neural networks don't work well with large and varied inputs and you should keep your inputs close to zero.
1. It's implying to the neural network that the first element is more similar to the second than it is to the third, since the numbers are closer.

A better way is to do something similar to the bag-of-words vector, but with just one word, that is, a vector of 0s with a single 1 somewhere depending on which token is used.
This is called a one-hot vector.
So the first token in a vocabulary of 3 is represented as `[1, 0, 0]` whilst the third token in a vocabulary of 5 is represented as `[0, 0, 1, 0, 0]`.
The nice thing about one-hot vectors is that you can easily generate them using an identity matrix, that is, a square matrix consisting of zeros except for the diagonal which consists of ones:

In [None]:
identity_matrix = np.eye(4)
print(identity_matrix)

For a given vocabulary size, which would correspond to the length of identity matrix's side, you can get any one-hot vector for any index from the identity matrix by picking a row from it:

In [None]:
print(identity_matrix[0])

Or even a sequence of one-hot vectors:

In [None]:
one_hots = identity_matrix[[2, 2, 0]]
print(one_hots)

This last part is done to get a sentence of tokens: by representing each token in the sentence as separate vectors that are put together in a matrix.
The simplest way to feed such a sentence into a neural network is to **flatten** the matrix of one-hot vectors into a single vector.
This can be done by **reshaping** the 3x4-matrix into a 12-vector:

In [None]:
num_tokens = one_hots.shape[0]
vec_size = one_hots.shape[1]
print(one_hots.reshape([num_tokens*vec_size]))

As an example, let's consider a fixed vocabulary consisting of `['cat', 'dog', 'bit', 'scratched', 'the']`.
The sentence 'the dog bit the cat' would be represented as follows:

In [None]:
vocab = ['cat', 'dog', 'bit', 'scratched', 'the']
identity_matrix = np.eye(len(vocab))
tokens = 'the dog bit the cat'.split(' ')

print('tokens:', tokens)

indexes = [vocab.index(token) for token in tokens]
print('indexes:', indexes)
print()

one_hots = identity_matrix[indexes]
print('one-hots:')
print(one_hots)
print()

print('flattened one-hots:')
num_tokens = one_hots.shape[0]
print(one_hots.reshape([num_tokens*len(vocab)]))

As already shown in the previous exercises, the vocabulary is chosen from the tokens used in the training set.
Either the $v$ most frequent tokens in the training set are used, where $v$ is a chosen vocabulary size, or all the tokens that appear at least $f$ times in the training set, where $f$ is a chosen minimum frequency.
When not using bags of words, you don't ignore stop words because those are important for understanding the structure of a sentence.
The reason why we use the most frequent tokens in the training set as a vocabulary is because, the more times a neural network sees a token being used in the training set, the more likely it will learn how it's used.
A token that is only used once will not allow the neural network to understand the kinds of contexts it typically appears in, so it might as well be ignored.

But then, what do we do with the tokens that are not in the vocabulary?

## Special tokens

Your vocabulary will not consist of just strings extracted from the text but will also include made-up tokens that exist only for our convenience.
One example of this is the **unknown token**, also known as the **out-of-vocabulary token** (**OOV**), used in place of any token that is not in the vocabulary.
Let's see an example:

In [None]:
vocab = ['<UNK>', 'cat', 'dog', 'bit', 'scratched', 'the']
identity_matrix = np.eye(len(vocab))
tokens = 'the handsome dog bit the beautiful cat'.split(' ')

print('tokens:', tokens)

cleaned_tokens = [token if token in vocab else '<UNK>' for token in tokens]

print('cleaned tokens:', cleaned_tokens)

indexes = [vocab.index(token) for token in cleaned_tokens]
print('indexes:', indexes)
print()

one_hots = identity_matrix[indexes]
print('one-hots:')
print(one_hots)
print()

print('flattened one-hots:')
num_tokens = one_hots.shape[0]
print(one_hots.reshape([num_tokens*len(vocab)]))

In order for the unknown token to be useful, it must be used in the training set as well, which means that you can't have all the tokens in the training set be in the vocabulary, or else your neural network will not learn how to use it.

Another special token you'll typically need is the **pad token**, a token that means 'no token'.
The reason you'll need this is because your neural network is going to expect a certain number of tokens as input.
The flattened vector representing the text needs to be a certain size, and that size is the number of tokens times the one-hot vector size.
Changing the number of tokens will change the flattened vector size, and a neural network can only have a fixed input vector size.
This is because the input vector is going to be multiplied with the weight matrix of the first layer, which is only compatible with a certain size of vector.
So if the number of tokens in your text is going to be shorter than this size, then you'll need to fill the missing tokens with something (there's nothing to do about having too many tokens other than drop some of them).
So we use the pad token to add extra tokens to the end of the sentence (or the beginning, just be consistent) in order to make it have the expected number of tokens, like this:

In [None]:
expected_len = 10
vocab = ['<PAD>', '<UNK>', 'cat', 'dog', 'bit', 'scratched', 'the']
identity_matrix = np.eye(len(vocab))
tokens = 'the handsome dog bit the beautiful cat'.split(' ')

print('tokens:', tokens)

cleaned_tokens = [token if token in vocab else '<UNK>' for token in tokens]

print('cleaned tokens:', cleaned_tokens)

print('expected number of tokens:', expected_len)

padded_tokens = cleaned_tokens + ['<PAD>']*(expected_len - len(cleaned_tokens))

print('padded tokens:', padded_tokens)

indexes = [vocab.index(token) for token in padded_tokens]
print('indexes:', indexes)
print()

one_hots = identity_matrix[indexes]
print('one-hots:')
print(one_hots)
print()

print('flattened one-hots:')
num_tokens = one_hots.shape[0]
print(one_hots.reshape([num_tokens*len(vocab)]))

And now we can have everything we need to process a text of tokens with a neural network!
Although we still need to be able to accept a batch of texts rather than a single text at a time.
Let's see an example in PyTorch.

Let's try making a toy neural network perform sentiment analysis (recognise if a sentence is speaking positively about something).
We'll use a very simple set of sentences constructed in such a way that it is impossible to predict the answer by just looking for a single token.

In [None]:
train_x = [
    'I like it .'.split(' '),
    'I hate it .'.split(' '),
    'I don\'t hate it .'.split(' '),
    'I don\'t like it .'.split(' '),
]
train_y = torch.tensor([
    [1],
    [0],
    [1],
    [0],
], dtype=torch.float32, device=device)

max_len = max(len(text) for text in train_x)
print('max_len:', max_len)

vocab = ['<PAD>'] + sorted({token for text in train_x for token in text})
print('vocab:', vocab)
print()

identity_matrix = np.eye(len(vocab))

train_x_padded = [text + ['<PAD>']*(max_len - len(text)) for text in train_x]
print('train_x_padded:')
print(train_x_padded)
print()

train_x_indexed = np.array([[vocab.index(token) for token in text] for text in train_x_padded], np.int64)
print('train_x_indexed:')
print(train_x_indexed)
print()

train_x_one_hots = torch.tensor(identity_matrix[train_x_indexed], dtype=torch.float32, device=device)
print('train_x_one_hots:')
print(train_x_one_hots)
print()

print('train_x_flattened:')
train_x_flattened = torch.flatten(train_x_one_hots, 1, 2) # The first dimension Flatten from the second dimension to the third dimension of the 3D tensor.
print(train_x_flattened)

Note how the use of batches of texts rather than a single text makes `train_x_one_hots` a three-dimensional tensor.
It's filled with the one-hot vectors (1D) of 5 tokens (2D) of 4 texts (3D), as shown below:

![](3d_tensor_tokens.png)

The 3D tensor was then flattened into a 2D tensor by merging all the one-hot vectors of the same text together into a single vector.
This 2D tensor is what the neural network will see as an input.
We used PyTorch's `flatten` instead of `reshape` because it is easier to use.

In [None]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, expected_len, hidden_size):
        super().__init__()
        self.layer1 = torch.nn.Linear(vocab_size*expected_len, hidden_size)
        self.layer2 = torch.nn.Linear(hidden_size, 1)

    def forward(self, x):
        hidden = torch.nn.functional.leaky_relu(self.layer1(x), 0.1)
        return self.layer2(hidden)

model = Model(len(vocab), max_len, hidden_size=2)
model.to(device)

optimiser = torch.optim.SGD(model.parameters(), lr=1.0)

print('epoch', 'error')
errors = []
for epoch in range(1, 100+1):
    optimiser.zero_grad()
    logits = model(train_x_flattened)
    error = torch.nn.functional.binary_cross_entropy_with_logits(logits, train_y)
    errors.append(error.detach().cpu().tolist())
    error.backward()
    optimiser.step()

    if epoch%10 == 0:
        print(epoch, errors[-1])
print()

with torch.no_grad():
    print('text', 'prediction')
    output = torch.sigmoid(model(train_x_flattened))[:, 0].cpu().numpy()
    for (text, y) in zip(train_x, output):
        print(text, y)

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('epoch')
ax.set_ylabel('$E$')
ax.plot(range(1, len(errors) + 1), errors, color='blue', linestyle='-', linewidth=3)
ax.grid()

## The embedding matrix

One-hot vectors, while usable, are a huge waste of space.
If your vocabulary is a thousand tokens long, you're keeping 1000 numbers in memory just to specify a single token.
Keep in mind that vocabularies with tens of thousands of tokens are normal nowadays.
Can we do better?

It turns out that we don't really need one-hot vectors at all.
All we need is a way to represent tokens as vectors.
One-hot vectors are called **sparse vectors**, because they are big and contain very little information.
Alternatively we can create **dense vectors**, which are small vectors that contain a lot of information.
These dense token vectors, referred to as **word embeddings**, will consist of a lot of random-looking small numbers that can be both positive and negative.
They are not directly interpretable like one-hot vectors, but they work really well.

All we need to do is replace the first layer with a single matrix called an **embedding matrix** and the act of turning tokens into vectors is called embedding the tokens into vector space.
The embedding matrix will have a number of rows equal to the vocabulary size and a number of columns equal to the hidden layer size.

Let's see how this is done in PyTorch:

In [None]:
vocab_size = 4
hidden_size = 2

embedding_matrix = torch.nn.Embedding(vocab_size, hidden_size)
embedding_matrix.to(device)
print('embedding_matrix:')
print(embedding_matrix.weight.data)
print()

indexed_tokens = torch.tensor([[0, 1, 2], [0, 3, 3]], dtype=torch.int64, device=device)
print('indexed_tokens (two texts of three tokens each):')
print(indexed_tokens)
print()

embedded = embedding_matrix(indexed_tokens)
print('embedded tokens:')
print(embedded)

The result is a 3D tensor, just like with one-hot vectors, but the vector size can now be much smaller (just 2 numbers here).

`torch.nn.Embedding` will now replace `torch.nn.Linear` as the first layer.
Its job is to replace token indexes with token vectors.
The numbers in this matrix are optimised in the same way as an other parameter in the neural network.
Note that what is being optimised are the rows in the matrix that correspond to tokens used in the training set.
In fact, if a token in the vocabulary is never used in the training set, the row corresponding to that token will not be changed at all during training.

We can now flatten the resulting 3D tensor into a 2D one where each text is a flat vector, no need for any other operations in the first layer.
We don't need to use an activation function for this layer because the token vectors are all learned independently from each other and are free to move into any non-linear configuration they need to.

In [None]:
flattened = torch.flatten(embedded, 1, 2)
print('flattened:')
print(flattened)

Since the resulting flattened vectors are now much smaller than the flattened one-hot vectors, we have significantly reduced the size of our model, which will make training faster.

Let's use this embedding matrix in the previous toy neural network.

In [None]:
train_x = [
    'I like it .'.split(' '),
    'I hate it .'.split(' '),
    'I don\'t hate it .'.split(' '),
    'I don\'t like it .'.split(' '),
]
train_y = torch.tensor([
    [1],
    [0],
    [1],
    [0],
], dtype=torch.float32, device=device)

max_len = max(len(text) for text in train_x)

vocab = ['<PAD>'] + sorted({token for text in train_x for token in text})

train_x_padded = [text + ['<PAD>']*(max_len - len(text)) for text in train_x]

train_x_indexed = torch.tensor([[vocab.index(token) for token in text] for text in train_x_padded], dtype=torch.int64, device=device)
print('train_x_indexed:')
print(train_x_indexed)

In [None]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, max_len, embed_size):
        super().__init__()
        self.layer1 = torch.nn.Embedding(vocab_size, embed_size) # Using Embedding instead of Linear.
        self.layer2 = torch.nn.Linear(max_len*embed_size, 1)

    def forward(self, x):
        hidden = self.layer1(x)
        flattened = torch.flatten(hidden, 1, 2)
        return self.layer2(flattened)

model = Model(len(vocab), max_len, embed_size=2)
model.to(device)

optimiser = torch.optim.SGD(model.parameters(), lr=1.0)

print('epoch', 'error')
train_errors = []
for epoch in range(1, 100+1):
    optimiser.zero_grad()
    logits = model(train_x_indexed)
    train_error = torch.nn.functional.binary_cross_entropy_with_logits(logits, train_y)
    train_errors.append(train_error.detach().cpu().tolist())
    train_error.backward()
    optimiser.step()

    if epoch%10 == 0:
        print(epoch, train_errors[-1])
print()

with torch.no_grad():
    print('text', 'output')
    output = torch.sigmoid(model(train_x_indexed))[:, 0].cpu().tolist()
    for (text, y) in zip(train_x, output):
        print(text, y)

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('epoch')
ax.set_ylabel('$E$')
ax.plot(range(1, len(errors) + 1), errors, color='blue', linestyle='-', linewidth=3)
ax.grid()

And now we're processing sentences more efficiently.
Of course this method still isn't ideal because it assumes that there is a maximum length to a sentence and the model parameters must still grow with the sentence length (the weight matrix of the output layer needs to grow with the number of tokens).
Ideally there would be a way to process a sentence by looking at a few tokens at a time instead of having the entire sentence processed at once, just like we do when reading.
We'll look at better ways to process sentences with neural networks in the following topics.

Finally, here is a more efficient way to convert tokens into indexes (which we'll refer to as **indexifying** the tokens).
It is faster because it makes use of a dictionary to look up the index of tokens and also avoids inserting pad tokens.
We also avoid putting everything immediately in the PyTorch tensor in order to allow for batching.

In [None]:
train_x = [
    'I like it .'.split(' '),
    'I hate it .'.split(' '),
    'I don\'t hate it .'.split(' '),
    'I don\'t like it .'.split(' '),
]

max_len = max(len(text) for text in train_x)

vocab = ['<PAD>', '<UNK>'] + sorted({token for text in train_x for token in text})
token2index = {t: i for (i, t) in enumerate(vocab)} # Use a dictionary to find the index of tokens.
pad_index = token2index['<PAD>']
unk_index = token2index['<UNK>']

train_x_indexed_np = np.full((len(train_x), max_len), pad_index, np.int64) # Make the indexes tensor full of pad indexes.
for i in range(len(train_x)):
    for j in range(len(train_x[i])):
        train_x_indexed_np[i, j] = token2index.get(train_x[i][j], unk_index) # Use 'get' to get the token if it's in the vocabulary and return a default value if it isn't (the unknown token).
train_x_indexed = torch.tensor(train_x_indexed_np, device=device)
print('train_x_indexed:')
print(train_x_indexed)

## Exercises

### 1) Using embeddings

Rewrite the movie reviews classification program from the previous topic using full texts.
Preprocessing has been done for you.
Don't forget to calculate the test set accuracy after training.

In [None]:
min_freq = 3 # The minimum number of times a token must occur in the training set for it to be included in the vocabulary.

train_df = pd.read_csv('../data_set/sentiment/train.csv')
test_df = pd.read_csv('../data_set/sentiment/test.csv')

train_x = train_df['text']
train_y = train_df['class']
test_x = test_df['text']
test_y = test_df['class']
categories = ['neg', 'pos'] # neg -> 0, pos -> 1
cat2idx = {cat: i for (i, cat) in enumerate(categories)}

train_y_indexed = torch.tensor(
    train_y.map(cat2idx.get).to_numpy()[:, None], # Make the binary labels be a single column matrix.
    dtype=torch.float32, device=device
)
test_y_indexed = test_y.map(cat2idx.get).to_numpy()[:, None]

nltk.download('punkt')
train_x_tokens = [nltk.word_tokenize(text) for text in train_x]
test_x_tokens = [nltk.word_tokenize(text) for text in test_x]
max_len = max(max(len(text) for text in train_x_tokens), max(len(text) for text in test_x_tokens)) # Get the maximum length from both the training set and testing set.

print('First train_x_tokens:')
print(train_x_tokens[0])
print()

frequencies = collections.Counter(token for text in train_x_tokens for token in text)
vocabulary = sorted(frequencies.keys(), key=frequencies.get, reverse=True)
while frequencies[vocabulary[-1]] < min_freq:
    vocabulary.pop()
vocab = ['<PAD>', '<UNK>'] + vocabulary
token2index = {token: i for (i, token) in enumerate(vocab)}
pad_index = token2index['<PAD>']
unk_index = token2index['<UNK>']

print('First 10 vocabulary items:')
print(vocab[:10])
print()

train_x_indexed_np = np.full((len(train_x_tokens), max_len), pad_index, np.int64)
for i in range(len(train_x_tokens)):
    for j in range(len(train_x_tokens[i])):
        train_x_indexed_np[i, j] = token2index.get(train_x_tokens[i][j], unk_index)
train_x_indexed = torch.tensor(train_x_indexed_np, device=device)

test_x_indexed_np = np.full((len(test_x_tokens), max_len), pad_index, np.int64)
for i in range(len(test_x_tokens)):
    for j in range(len(test_x_tokens[i])):
        test_x_indexed_np[i, j] = token2index.get(test_x_tokens[i][j], unk_index)
test_x_indexed = torch.tensor(test_x_indexed_np, device=device)

print('First train_x_indexed:')
print(train_x_indexed[0])
print()

print('First train_y_indexed:')
print(train_y_indexed[0])