In [10]:
import torch
import torch.nn as nn

# Road to Quijote-like text: RNNs

Our goal is to generate Quijote-like text by implementing and training models on the Quijote novel. The idea is to implement the models from scratch and write down every step that I go trough.

## Loading and preparing the data

The first step is to load and prepare the dataset. First, we load the whole Quijote into a string:

In [3]:
with open("el_quijote.txt", "r") as f:
    text = f.read()
len(text)

1038397

Essentially, our goal is to predict the next character given a preceding sequence of characters, or **context**. We will understand it better with the following example: 

In [4]:
context_size = 8 # The number of characters that we want for context

context = text[:context_size]

print("(Context) -> (Next Character):")

for c in text[context_size:20]:
    print(f"'{context}' -> '{c}'")
    context = context[1:] + c

(Context) -> (Next Character):
'DON QUIJ' -> 'O'
'ON QUIJO' -> 'T'
'N QUIJOT' -> 'E'
' QUIJOTE' -> ' '
'QUIJOTE ' -> 'D'
'UIJOTE D' -> 'E'
'IJOTE DE' -> ' '
'JOTE DE ' -> 'L'
'OTE DE L' -> 'A'
'TE DE LA' -> ' '
'E DE LA ' -> 'M'
' DE LA M' -> 'A'


Therefore, we will train a model with a dataset composed by a set of context sequences with its corresponding next characters extracted from the Quijote. In our case, we are working at **character-level**, but there are other approaches that work with words or different tokens. In fact, state-of-the-art models use different tokenization schemes (for example, see the [ChatGPT tokenizer](https://platform.openai.com/tokenizer)).

DL models don't know how to directly read raw characters, so we need to properly encode them into integers. We will assign a unique integer to each different character:

In [5]:
# obtain the different characters present in the text
characters = list(sorted(set(text)))

ctoi = {c:i for i, c in enumerate(characters)} # dictionary that maps a given character into its respective integer
itoc = {i:c for c, i in ctoi.items()}

example = text[:11]
translated_example = [ctoi[c] for c in example]

print(f"STRING -> INTEGERS -> BACK TO STRING")
print(f"{example} -> {translated_example} -> {''.join(itoc[i] for i in translated_example)}")

STRING -> INTEGERS -> BACK TO STRING
DON QUIJOTE -> [27, 38, 37, 1, 40, 44, 32, 33, 38, 43, 28] -> DON QUIJOTE


Now, we will build the training and validation datasets. We will take the full text, and split it into sequences of length `context_size` and its preceding character. 

In [9]:
context_size = 3

def build_dataset(text):
    X, Y = [], []
    context = text[:context_size]

    for c in text[context_size:]:
        X.append(torch.tensor([ctoi[c] for c in context], dtype=torch.float32))
        Y.append(torch.tensor(ctoi[c], dtype=torch.float32))
        context = context[1:] + c
    
    X = torch.stack(X)
    Y = torch.stack(Y)

    return X, Y

n = int(0.8*len(text))

Xtr, Ytr = build_dataset(text[:n])
Xval, Yval = build_dataset(text[n:])
print(f"Xtr = {Xtr.shape}, Ytr = {Ytr.shape}, Xval = {Xval.shape}, Yval = {Yval.shape}")

Xtr = torch.Size([830714, 3]), Ytr = torch.Size([830714]), Xval = torch.Size([207677, 3]), Yval = torch.Size([207677])


Finally, it is time to build the RNN model! It will be composed by three key components.:

1. We will create an [`Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer, whose task will be to map each different token into a given vector. Intuitively, mapping the tokens into a vector space allows the model to group tokens into different regions according to its similarity or dissimilarity. It can be simply thought as a lookup table that stores the (learned) embeddings for each character in the text.

    More formally, if we have a batch of sequences $X$ of shape $(B, T)$, where $B$ is the batch dimension and $T$ is the number of tokens in each sequence (i.e. `context_length`), we will lookup the value for every token on every sequence, thus we will obtain a tensor of shape ($B, T, C)$, where $C$ is the size of the embedding vector (i.e. `num_embedding`).

2. We will create an Elman RNN layer as shown in the [PyTorch Documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). Essentially, for every element in the input sequence, the layer will compute the following function:

    $$h_t = \tanh(x_tW_{ih}^T + b_{ih} + h_{t-1}W_{hh}^T + b_{hh})$$

    where $h_t$ is the hidden state at time $t$, $x_t$ is the input at time $t$. The idea is to iteratively compute an associated output $h_t$ for each token $x_t, i \in \{1, 2, 3, ..., n_{embed}\}$ by taking into account both the current token and the hidden state at the previous step. Intuitively, this enables the RNN to model past temporal dependencies or features.  

3. Finally, after applying one or more layers, we project the feature vector to a tensor of shape $(B, n_{chars})$, containing the prediction of the model for each sequence in the batch, where `n_chars` is the total number of unique characters present in the text. We can interpret this values as the **unnormalized log probabilities, or logits, of the next character**. For the sake of example and assuming that we are working with a single example, the output of the model will be:

    $$l = (l_0, l_1, ..., l_n)$$

    where $l_0$ will be the unnormalized log probability of the next character in the sequence being the one associated with 0. Exponentiating the logits, we obtain the unnormalized probabilities:

    $$p_i = e^{l_i}$$

    Then, we properly normalize the probabilities so that they sum to one:

    $$p_i = \dfrac{e^{l_i}}{\sum_j e^{l_j}}$$ 

    Essentially, we want to **maximize** the likelihood of the correct character being predicted by the model, i.e., if the next character is $y_i$, then we want to manipulate the weights of the model so that $p_{y_i}$ is **increased**. Maximizing the likelihood of our training distribution is **equivalent** to minimizing the **negative log-likelihood, or [cross entropy](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)**. To sum up, our model has to output a vector with the same length as the number of different characters in the text, and we will interpret the ouput as the logits of the next possible characters.

We will build every component by using `PyTorch`: 

In [43]:
class Embedding(nn.Module):
    pass

(torch.Size([830714, 3]), torch.Size([207677, 3]))