In [35]:
import torch

# Road to Quijote-like text: RNNs

Our goal is to generate Quijote-like text by implementing and training models on the Quijote novel. The idea is to implement the models from scratch and write down every step that I go trough.

## Loading and preparing the data

The first step is to load and prepare the dataset. First, we load the whole Quijote into a string:

In [3]:
with open("el_quijote.txt", "r") as f:
    text = f.read()
len(text)

1038397

Essentially, our goal is to predict the next character given a preceding sequence of characters, or **context**. We will understand it better with the following example: 

In [18]:
context_size = 8 # The number of characters that we want for context

context = text[:context_size]

print("(Context) -> (Next Character):")

for c in text[context_size:20]:
    print(f"'{context}' -> '{c}'")
    context = context[1:] + c

(Context) -> (Next Character):
'DON QUIJ' -> 'O'
'ON QUIJO' -> 'T'
'N QUIJOT' -> 'E'
' QUIJOTE' -> ' '
'QUIJOTE ' -> 'D'
'UIJOTE D' -> 'E'
'IJOTE DE' -> ' '
'JOTE DE ' -> 'L'
'OTE DE L' -> 'A'
'TE DE LA' -> ' '
'E DE LA ' -> 'M'
' DE LA M' -> 'A'


Therefore, we will train a model with a dataset composed by a set of context sequences with its corresponding next characters extracted from the Quijote. In our case, we are working at **character-level**, but there are other approaches that work with words or different tokens. In fact, state-of-the-art models use different tokenization schemes (for example, see the [ChatGPT tokenizer](https://platform.openai.com/tokenizer)).

DL models don't know how to directly read raw characters, so we need to properly encode them into integers. We will assign a unique integer to each different character:

In [34]:
# obtain the different characters present in the text
characters = list(sorted(set(text)))

ctoi = {c:i for i, c in enumerate(characters)} # dictionary that maps a given character into its respective integer
itoc = {i:c for c, i in ctoi.items()}

example = text[:11]
translated_example = [ctoi[c] for c in example]

print(f"STRING -> INTEGERS -> BACK TO STRING")
print(f"{example} -> {translated_example} -> {''.join(itoc[i] for i in translated_example)}")

STRING -> INTEGERS -> BACK TO STRING
DON QUIJOTE -> [27, 38, 37, 1, 40, 44, 32, 33, 38, 43, 28] -> DON QUIJOTE


In [42]:
context_size = 3

def build_dataset(text):
    X, Y = [], []
    context = text[:context_size]

    for c in text[context_size:]:
        X.append(torch.tensor([ctoi[c] for c in context], dtype=torch.float32))
        Y.append(torch.tensor(ctoi[c], dtype=torch.float32))
        context = context[1:] + c
    
    X = torch.stack(X)
    Y = torch.stack(Y)

    return X, Y

n = int(0.8*len(text))

Xtr, Ytr = build_dataset(text[:n])
Xval, Yval = build_dataset(text[n:])

In [43]:
Xtr.shape, Xval.shape

(torch.Size([830714, 3]), torch.Size([207677, 3]))