<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/rnn/rnn_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Networks

In [56]:
import numpy as np

from numpy.random import randint
from collections import OrderedDict
from torch.utils import data

np.random.seed(42)

## Representing text as tokens

Let's define our dataset samples $x \in \mathrm{R}^d$, where $d$ is the feature space dimension.

With time sequences our data can be represented as $x \in \mathrm{R}^{t \, \times \, d}$, where $t$ is the sequence length. 
This emphasises sequence dependence and that the samples along the sequence are not independent and identically distributed (i.i.d.).
We will model functions as $\mathrm{R}^{t \, \times \, d} \rightarrow \mathrm{R}^c$, where $c$ is the amount of classes in the output.

There are several ways to represent sequences. With text, the challenge is how to represent a word as a feature vector in $d$ dimensions, as we are required to represent text with decimal numbers in order to apply neural networks to it.

Initially, we will use a simple one-hot encoding but for categorical variables that can take on many values (e.g. words in the English language).

### One-hot encoding over vocabulary

One way to represent a fixed amount of words is by making a one-hot encoded vector, which consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify each word.

| vocabulary    | one-hot encoded vector   |
| ------------- |--------------------------|
| Paris         | $= [1, 0, 0, \ldots, 0]$ |
| Rome          | $= [0, 1, 0, \ldots, 0]$ |
| Copenhagen    | $= [0, 0, 1, \ldots, 0]$ |

Representing a large vocabulary with one-hot encodings often becomes inefficient because of the size of each sparse vector.
To overcome this challenge it is common practice to truncate the vocabulary to contain the $k$ most used words and represent the rest with a special symbol, $\mathtt{UNK}$, to define unknown/unimportant words.
This often causes entities such as names to be represented with $\mathtt{UNK}$ because they are rare.

Consider the following text
> I love the corny jokes in Spielberg's new movie.

where an example result would be similar to
> I love the corny jokes in $\mathtt{UNK}$'s new movie.

### Generating a dataset

We generate sequences of the form:

`a b EOS`,

`a a b b EOS`,

`a a a a a b b b b b EOS`

where `EOS` is a special character denoting the end of a sequence. The task is to predict the next token $t_n$, i.e. `a`, `b`, `EOS` or the unknown token `UNK` given a sequence of tokens $\{ t_{1}, t_{2}, \dots , t_{n-1}\}$, and we are to process sequences in a sequential manner. As such, the network will need to learn that e.g. 5 `b`s and an `EOS` token will be preceded by 5 `a`s.

In [66]:
CHARS = ['a', 'b']
UNKNOWN = 'U'
EOS = 'E'
VOCAB = [UNKNOWN, EOS] + CHARS
VOCAB_SIZE = len(VOCAB)
NUM_SENTENCES = 2**8
P_TRAIN = int(NUM_SENTENCES * 0.8)
P_VAL = int(NUM_SENTENCES * 0.1)
P_TEST = int(NUM_SENTENCES * 0.1)

In [69]:
def generate_dataset(num_sequences):
  """
  Generated a number of sequences as out dataset.
  """
  generate_random_token = lambda num_tokens: (
      ''.join([c * num_tokens for c in CHARS]) + EOS
  )
  return [generate_random_token(randint(1, 12)) for _ in range(num_sequences)]

In [70]:
sequences = generate_dataset(NUM_SENTENCES)
sequences[:5]

['aaaaabbbbbE',
 'aaaaabbbbbE',
 'aaaaaaaaaaabbbbbbbbbbbE',
 'aaaaaaabbbbbbbE',
 'aaaaaaaaabbbbbbbbbE']

## Representing tokens as indices

To build a one-hot encoding, we need to assign each possible word in our vocabulary an index. We do that by creating two dictionaries: one that allows us to go from a given word to its corresponding index in our vocabulary, and one for the reverse direction. Let's call them `word_to_idx` and `idx_to_word`. The keyword `vocab_size` specifies the maximum size of our vocabulary. If we try to access a word that does not exist in our vocabulary, it is automatically replaced by the `UNK` token or its corresponding index.

In [71]:
word_to_idx = OrderedDict((word, index) for index, word in enumerate(VOCAB)) 
idx_to_word = OrderedDict((index, word) for index, word in enumerate(VOCAB))
vocab_size = len(VOCAB)

## Partitioning the dataset

In [72]:
class Dataset(data.Dataset):
  def __init__(self, inputs, targets):
    self.inputs = inputs
    self.targets = targets

  def __len__(self):
    return len(self.targets)

  def __getitem__(self, index):
    X = self.inputs[index]
    y = self.targets[index]
    return X, y

In [100]:
inputs = [sequences[i][:-1] for i in range(len(sequences))]
targets = [sequences[i][1:] for i in range(len(sequences))]
train_set = Dataset(inputs[:P_TRAIN], targets[:P_TRAIN])
val_set = Dataset(inputs[P_TRAIN:P_TRAIN + P_VAL], targets[P_TRAIN:P_TRAIN + P_VAL])
test_set = Dataset(inputs[-P_TEST:], targets[-P_TEST:])

print(f'We have {len(train_set)} samples in the training set.')
print(f'We have {len(val_set)} samples in the validation set.')
print(f'We have {len(test_set)} samples in the test set.')

We have 204 samples in the training set.
We have 25 samples in the validation set.
We have 25 samples in the test set.


In [104]:
inputs[1]

'aaaaabbbbb'

In [105]:
train_set.targets[1]

'aaaabbbbbE'