**Chapter 14 – Natural Language Processing with RNNs and Attention**

_This notebook contains all the sample code and solutions to the exercises in chapter 14._

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-mlp/blob/main/14_nlp_with_rnns_and_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-mlp/blob/main/14_nlp_with_rnns_and_attention.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Setup

This project requires Python 3.10 or above:

In [1]:
import sys

assert sys.version_info >= (3, 10)

Are we using Colab or Kaggle?

In [2]:
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

If using Colab, the TorchMetrics library is not pre-installed so we must install it manually:

In [3]:
if IS_COLAB:
    %pip install -q torchmetrics

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/983.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.2/983.2 kB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25h

We also need PyTorch ≥ 2.6.0:

In [4]:
from packaging.version import Version
import torch

assert Version(torch.__version__) >= Version("2.6.0")

This chapter can be very slow without a hardware accelerator, so if we can find one, let's use it:

In [5]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

device

'cuda'

Let's issue a warning if there's no hardware accelerator available:

In [6]:
if device == "cpu":
    print("Neural nets can be very slow without a hardware accelerator.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware "
              "accelerator.")
    if IS_KAGGLE:
        print("Go to Settings > Accelerator and select GPU.")

As we did in earlier chapters, let's define the default font sizes to make the figures prettier:

In [7]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

Let's use the same `evaluate_tm()` and `train()` functions as in the previous chapters:

In [8]:
import torchmetrics

def evaluate_tm(model, data_loader, metric):
    model.eval()
    metric.reset()
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric.update(y_pred, y_batch)
    return metric.compute()

def train(model, optimizer, loss_fn, metric, train_loader, valid_loader,
          n_epochs, patience=2, factor=0.5, epoch_callback=None):
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="max", patience=patience, factor=factor)
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.0
        metric.reset()
        model.train()
        if epoch_callback is not None:
            epoch_callback(model, epoch)
        for index, (X_batch, y_batch) in enumerate(train_loader):
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
            train_metric = metric.compute().item()
            print(f"\rBatch {index + 1}/{len(train_loader)}", end="")
            print(f", loss={total_loss/(index+1):.4f}", end="")
            print(f", {train_metric=:.2%}", end="")
        history["train_losses"].append(total_loss / len(train_loader))
        history["train_metrics"].append(train_metric)
        val_metric = evaluate_tm(model, valid_loader, metric).item()
        history["valid_metrics"].append(val_metric)
        scheduler.step(val_metric)
        print(f"\rEpoch {epoch + 1}/{n_epochs},                      "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.2%}, "
              f"valid metric: {history['valid_metrics'][-1]:.2%}")
    return history

As we will build and download pretty big models, we will need to free the GPU RAM regularly to avoid running out of space. For this, we will delete the models and tensors as we go (using `del`), then we will call the `free_vram()` function below: it calls Python's garbage collector, and if we're using a CUDA GPU it also calls `torch.cuda.empty_cache()`:

In [9]:
import gc

def free_vram(device):
    gc.collect()
    if device == "cuda":
        torch.cuda.empty_cache()

**WARNING**: When running a Jupyter/Colab notebook, the output of each cell gets saved in the `Out` dictionary, so if the output of a cell is a large model or tensor, then it's not enough to delete the variable, you must also clear the `Out` dictionary using `Out.clear()`.

# Generating Shakespearean Text Using a Character RNN

## Creating the Training Dataset

Let's download the Shakespeare data from Andrej Karpathy's [char-rnn project](https://github.com/karpathy/char-rnn/)

In [10]:
from pathlib import Path
import urllib.request

def download_shakespeare_text():
    path = Path("datasets/shakespeare/shakespeare.txt")
    if not path.is_file():
        path.parent.mkdir(parents=True, exist_ok=True)
        url = "https://homl.info/shakespeare"
        urllib.request.urlretrieve(url, path)
    return path.read_text()

shakespeare_text = download_shakespeare_text()

In [11]:
# extra code – shows a short text sample
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [12]:
vocab = sorted(set(shakespeare_text.lower()))
"".join(vocab)

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [13]:
char_to_id = {char: index for index, char in enumerate(vocab)}
id_to_char = {index: char for index, char in enumerate(vocab)}

In [14]:
char_to_id["a"]

13

In [15]:
id_to_char[13]

'a'

In [16]:
import torch

def encode_text(text):
    return torch.tensor([char_to_id[char] for char in text.lower()])

def decode_text(char_ids):
    return "".join([id_to_char[char_id.item()] for char_id in char_ids])

In [17]:
encoded = encode_text("Hello, world!")
encoded

tensor([20, 17, 24, 24, 27,  6,  1, 35, 27, 30, 24, 16,  2])

In [18]:
decode_text(encoded)

'hello, world!'

In [19]:
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):
    def __init__(self, text, window_length):
        self.encoded_text = encode_text(text)
        self.window_length = window_length

    def __len__(self):
        return len(self.encoded_text) - self.window_length

    def __getitem__(self, idx):
        if idx >= len(self):
            raise IndexError("dataset index out of range")
        end = idx + self.window_length
        window = self.encoded_text[idx : end]
        target = self.encoded_text[idx + 1 : end + 1]
        return window, target

In [20]:
# extra code – a simple example using CharDataset
to_be_dataset = CharDataset("To be or not to be", window_length=10)
for x, y in to_be_dataset:
    print(f"x={x}, y={y}")
    print(f"    decoded: x={decode_text(x)!r}, y={decode_text(y)!r}")

x=tensor([32, 27,  1, 14, 17,  1, 27, 30,  1, 26]), y=tensor([27,  1, 14, 17,  1, 27, 30,  1, 26, 27])
    decoded: x='to be or n', y='o be or no'
x=tensor([27,  1, 14, 17,  1, 27, 30,  1, 26, 27]), y=tensor([ 1, 14, 17,  1, 27, 30,  1, 26, 27, 32])
    decoded: x='o be or no', y=' be or not'
x=tensor([ 1, 14, 17,  1, 27, 30,  1, 26, 27, 32]), y=tensor([14, 17,  1, 27, 30,  1, 26, 27, 32,  1])
    decoded: x=' be or not', y='be or not '
x=tensor([14, 17,  1, 27, 30,  1, 26, 27, 32,  1]), y=tensor([17,  1, 27, 30,  1, 26, 27, 32,  1, 32])
    decoded: x='be or not ', y='e or not t'
x=tensor([17,  1, 27, 30,  1, 26, 27, 32,  1, 32]), y=tensor([ 1, 27, 30,  1, 26, 27, 32,  1, 32, 27])
    decoded: x='e or not t', y=' or not to'
x=tensor([ 1, 27, 30,  1, 26, 27, 32,  1, 32, 27]), y=tensor([27, 30,  1, 26, 27, 32,  1, 32, 27,  1])
    decoded: x=' or not to', y='or not to '
x=tensor([27, 30,  1, 26, 27, 32,  1, 32, 27,  1]), y=tensor([30,  1, 26, 27, 32,  1, 32, 27,  1, 14])
    decoded: x=

In [21]:
window_length = 50
batch_size = 512  # reduce if your GPU cannot handle such a large batch size
train_set = CharDataset(shakespeare_text[:1_000_000], window_length)
valid_set = CharDataset(shakespeare_text[1_000_000:1_060_000], window_length)
test_set = CharDataset(shakespeare_text[1_060_000:], window_length)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)
test_loader = DataLoader(test_set, batch_size=batch_size)

## Embeddings

In [22]:
import torch.nn as nn

torch.manual_seed(42)
embed = nn.Embedding(5, 3)  # 5 categories × 3D embeddings
embed(torch.tensor([[3, 2], [0, 2]]))

tensor([[[ 0.2674,  0.5349,  0.8094],
         [ 2.2082, -0.6380,  0.4617]],

        [[ 0.3367,  0.1288,  0.2345],
         [ 2.2082, -0.6380,  0.4617]]], grad_fn=<EmbeddingBackward0>)

## Building and Training the Char-RNN Model

Now let's create our Shakespeare model:

In [23]:
class ShakespeareModel(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=10, hidden_dim=128,
                 dropout=0.1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X):
        embeddings = self.embed(X)
        outputs, _states = self.gru(embeddings)
        return self.output(outputs).permute(0, 2, 1)

torch.manual_seed(42)
model = ShakespeareModel(len(vocab)).to(device)

**Warning**: the following code may take a while to run, especially without a GPU.

In [24]:
n_epochs = 20
xentropy = nn.CrossEntropyLoss()
optimizer = torch.optim.NAdam(model.parameters())
accuracy = torchmetrics.Accuracy(task="multiclass",
                                 num_classes=len(vocab)).to(device)

history = train(model, optimizer, xentropy, accuracy, train_loader, valid_loader,
                n_epochs)

Epoch 1/20,                      train loss: 1.6044, train metric: 51.28%, valid metric: 51.62%
Epoch 2/20,                      train loss: 1.3852, train metric: 56.70%, valid metric: 53.03%
Epoch 3/20,                      train loss: 1.3554, train metric: 57.44%, valid metric: 53.56%
Epoch 4/20,                      train loss: 1.3412, train metric: 57.79%, valid metric: 53.53%
Epoch 5/20,                      train loss: 1.3328, train metric: 58.01%, valid metric: 53.49%
Epoch 6/20,                      train loss: 1.3271, train metric: 58.15%, valid metric: 53.79%
Epoch 7/20,                      train loss: 1.3230, train metric: 58.26%, valid metric: 53.74%
Epoch 8/20,                      train loss: 1.3200, train metric: 58.32%, valid metric: 54.57%
Epoch 9/20,                      train loss: 1.3172, train metric: 58.39%, valid metric: 54.55%
Epoch 10/20,                      train loss: 1.3153, train metric: 58.45%, valid metric: 53.95%
Epoch 11/20,                      train

In [25]:
torch.save(model.state_dict(), "my_shakespeare_model.pt")

In [26]:
model.eval()  # don't forget to switch the model to evaluation mode!
text = "To be or not to b"
encoded_text = encode_text(text).unsqueeze(dim=0).to(device)
with torch.no_grad():
    Y_logits = model(encoded_text)
    predicted_char_id = Y_logits[0, :, -1].argmax().item()
    predicted_char = id_to_char[predicted_char_id]  # correctly predicts "e"

In [27]:
predicted_char

'e'

## Generating Shakespearean Text

In [28]:
torch.manual_seed(42)
probs = torch.tensor([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
samples = torch.multinomial(probs, replacement=True, num_samples=8)
samples

tensor([[0, 0, 0, 0, 1, 0, 2, 2]])

In [29]:
import torch.nn.functional as F

def next_char(model, text, temperature=1):
    encoded_text = encode_text(text).unsqueeze(dim=0).to(device)
    with torch.no_grad():
        Y_logits = model(encoded_text)
        Y_probas = F.softmax(Y_logits[0, :, -1] / temperature, dim=-1)
        predicted_char_id = torch.multinomial(Y_probas, num_samples=1).item()
    return id_to_char[predicted_char_id]

In [30]:
def extend_text(model, text, n_chars=80, temperature=1):
    for _ in range(n_chars):
        text += next_char(model, text, temperature)
    return text

In [31]:
print(extend_text(model, "To be or not to b", temperature=0.01))

To be or not to be the state,
and the heavens to the stronger than the state, and the strength of


In [32]:
print(extend_text(model, "To be or not to b", temperature=0.4))

To be or not to be so deserved the grace
to the sorrow is the earth of the truth of his face the 


In [33]:
print(extend_text(model, "To be or not to b", temperature=100))

To be or not to bmhf:my:r,k;s-h cqvvnfnfsut&-oq'ryoeen?x-hp:d,y&wv f3,dzrdzj-pilv?xpzh,fborp;'?$u


## Training a stateful RNN

Until now, we have only used _stateless RNNs_: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws the state away as it is not needed anymore. What if we instructed the RNN to preserve this final state after processing a training batch and use it as the initial state for the next training batch? This way the model could learn long-term patterns despite only backpropagating through short sequences. This is called a _stateful RNN_. Let's go over how to build one.

The model itself requires very little change: we only need to add a new `hidden_states` attribute, initialized to `None`, then save the hidden states after each batch is processed, and use them as the initial hidden states for the next batch. Note that we must call `detach()` on these states to ensure we don't backpropagate over this training iteration's computation graph at the next iteration (this would cause an error).

In [34]:
class StatefulShakespeareModel(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=10, hidden_dim=128,
                 dropout=0.1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, vocab_size)
        self.hidden_states = None

    def forward(self, X):
        embeddings = self.embed(X)
        outputs, hidden_states = self.gru(embeddings, self.hidden_states)
        self.hidden_states = hidden_states.detach()
        return self.output(outputs).permute(0, 2, 1)

The main difficulty with stateful RNNs is preparing the dataset. Indeed, a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. To be more precise, the _n_<sup>th</sup> window in batch _k_ must start exactly where the _n_<sup>th</sup> window in batch _k_ – 1 stopped. For example, suppose the full encoded text is `[1, 2, 3, .., 59, 60, 61]` and you want to use a window length of 4, and a batch size of 5. The dataset could contain 3 batches like these, in this order:

```
Batch #1:
X=[[1,2,3,4], [13,14,15,16], [25,26,27,28], [37,38,39,40], [49,50,51,52]]
Y=[[2,3,4,5], [14,15,16,17], [26,27,28,29], [38,39,40,41], [50,51,52,53]]

Batch #2:
X=[[5,6,7,8], [17,18,19,20], [29,30,31,32], [41,42,43,44], [53,54,55,56]]
y=[[6,7,8,9], [18,19,20,21], [30,31,32,33], [42,43,44,45], [54,55,56,57]]

Batch #3:
X=[[9,10,11,12], [21,22,23,24], [33,34,35,36], [45,46,47,48], [57,58,59,60]]
y=[[10,11,12,13], [22,23,24,25], [34,35,36,37], [46,47,48,49], [58,59,60,61]]
```

Let's write a `StatefulCharDataset` class that organizes the data like this.

In [35]:
from torch.utils.data import Dataset, DataLoader

class StatefulCharDataset(Dataset):
    def __init__(self, text, window_length, batch_size):
        self.encoded_text = encode_text(text)
        self.window_length = window_length
        self.batch_size = batch_size
        n_consecutive_windows = (len(self.encoded_text) - 1) // window_length
        n_windows_per_slot = n_consecutive_windows // batch_size
        self.length = n_windows_per_slot * batch_size
        self.spacing = n_windows_per_slot * window_length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        if idx >= len(self):
            raise IndexError("dataset index out of range")
        start = ((idx % self.batch_size) * self.spacing
                 +(idx // self.batch_size) * self.window_length)
        end = start + self.window_length
        window = self.encoded_text[start : end]
        target = self.encoded_text[start + 1 : end + 1]
        return window, target

Now let's create the data loaders. Note that we must *not* shuffle the batches, even for the training set. We must also ensure that all batches have exactly the same number of windows, even the very last batch: for this reason, we must set `drop_last=True` when creating the data loaders.

In [36]:
batch_size = 128
stateful_train_set = StatefulCharDataset(shakespeare_text[:1_000_000],
                                         window_length, batch_size)
stateful_train_loader = DataLoader(stateful_train_set, batch_size=batch_size,
                                   drop_last=True)
stateful_valid_set = StatefulCharDataset(shakespeare_text[1_000_000:1_060_000],
                                         window_length, batch_size)
stateful_valid_loader = DataLoader(stateful_valid_set, batch_size=batch_size,
                                   drop_last=True)
stateful_test_set = StatefulCharDataset(shakespeare_text[1_060_000:],
                                        window_length, batch_size)
stateful_test_loader = DataLoader(stateful_test_set, batch_size=batch_size,
                                  drop_last=True)

During training, we should reset the hidden states at the start of each epoch. We could rewrite the whole training loop just for that, but it's cleaner to just pass a callback function to the `train()` function:

**Warning**: the following cell may take a long time to run, especially without a GPU.

In [37]:
torch.manual_seed(42)

stateful_model = StatefulShakespeareModel(len(vocab)).to(device)

n_epochs = 10
xentropy = nn.CrossEntropyLoss()
optimizer = torch.optim.NAdam(stateful_model.parameters())
accuracy = torchmetrics.Accuracy(task="multiclass",
                                 num_classes=len(vocab)).to(device)

def reset_hidden_states(model, epoch):
    model.hidden_states = None

history = train(stateful_model, optimizer, xentropy, accuracy, stateful_train_loader,
                stateful_valid_loader, n_epochs, epoch_callback=reset_hidden_states)

Epoch 1/10,                      train loss: 2.4741, train metric: 29.85%, valid metric: 39.39%
Epoch 2/10,                      train loss: 1.8786, train metric: 44.47%, valid metric: 44.68%
Epoch 3/10,                      train loss: 1.6980, train metric: 49.23%, valid metric: 48.11%
Epoch 4/10,                      train loss: 1.6011, train metric: 51.71%, valid metric: 49.94%
Epoch 5/10,                      train loss: 1.5430, train metric: 53.16%, valid metric: 51.04%
Epoch 6/10,                      train loss: 1.5043, train metric: 54.13%, valid metric: 51.94%
Epoch 7/10,                      train loss: 1.4762, train metric: 54.85%, valid metric: 52.50%
Epoch 8/10,                      train loss: 1.4540, train metric: 55.37%, valid metric: 53.06%
Epoch 9/10,                      train loss: 1.4365, train metric: 55.84%, valid metric: 53.45%
Epoch 10/10,                      train loss: 1.4226, train metric: 56.16%, valid metric: 53.97%


In [38]:
torch.save(stateful_model.state_dict(), "my_stateful_shakespeare_model.pt")

Let's try generating some text with our stateful RNN. The `hidden_states` are reset once at the beginning, then preserved across character generation:

In [39]:
def extend_text_with_stateful_rnn(model, text, n_chars=80, temperature=1):
    model.hidden_states = None
    rnn_input = text
    for _ in range(n_chars):
        char = next_char(model, rnn_input, temperature)
        text += char
        rnn_input = char
    return text + "…"

In [40]:
torch.manual_seed(42)
stateful_model.eval()
print(extend_text_with_stateful_rnn(stateful_model, "To be or not to b",
                                    temperature=0.1))

To be or not to be the common
that the strange in the common the commons and the first,
and the s…


In [41]:
print(extend_text_with_stateful_rnn(stateful_model, "To be or not to b", temperature=0.4))

To be or not to be the great and their earl
and the seeming the finest thou shall have been to be


In [42]:
print(extend_text_with_stateful_rnn(stateful_model, "To be or not to b", temperature=1))

To be or not to be fout:
then he shall foul hours of night from then, were
desire? to him renerge


Let's free some GPU RAM:

In [43]:
del accuracy, embed, encoded, encoded_text, optimizer, probs, samples, x, y
del shakespeare_text, stateful_test_loader, stateful_train_loader, Y_logits
del stateful_valid_loader, test_loader, train_loader, valid_loader, xentropy

Out.clear()  # clear Jupyter's `Out` variable which saves all the cell outputs
free_vram(device)


# Sentiment Analysis

## Loading the IMDB Dataset

In [44]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
split = imdb_dataset["train"].train_test_split(train_size=0.8, seed=42)
imdb_train_set, imdb_valid_set = split["train"], split["test"]
imdb_test_set = imdb_dataset["test"]

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [45]:
imdb_train_set[1]["text"]

"'The Rookie' was a wonderful movie about the second chances life holds for us and also puts an emotional thought over the audience, making them realize that your dreams can come true. If you loved 'Remember the Titans', 'The Rookie' is the movie for you!! It's the feel good movie of the year and it is the perfect movie for all ages. 'The Rookie' hits a major home run!"

In [46]:
imdb_train_set[1]["label"]

1

In [47]:
imdb_train_set[16]["text"]

"Lillian Hellman's play, adapted by Dashiell Hammett with help from Hellman, becomes a curious project to come out of gritty Warner Bros. Paul Lukas, reprising his Broadway role and winning the Best Actor Oscar, plays an anti-Nazi German underground leader fighting the Fascists, dragging his American wife and three children all over Europe before finding refuge in the States (via the Mexico border). They settle in Washington with the wife's wealthy mother and brother, though a boarder residing in the manor is immediately suspicious of the newcomers and spends an awful lot of time down at the German Embassy playing poker. It seems to take forever for this drama to find its focus, and when we realize what the heart of the material is (the wise, honest, direct refugees teaching the clueless, head-in-the-sand Americans how the world has suddenly changed), it seems a little patronizing--the viewer is quite literally put in the relatives' place, being lectured to. Lukas has several speeches 

In [48]:
imdb_train_set[16]["label"]

0

## Tokenization Using the `tokenizers` Library

### Training a BPE Tokenizer on the IMDB Dataset

In [49]:
import tokenizers

bpe_model = tokenizers.models.BPE(unk_token="<unk>")
bpe_tokenizer = tokenizers.Tokenizer(bpe_model)
bpe_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
special_tokens = ["<pad>", "<unk>"]
bpe_trainer = tokenizers.trainers.BpeTrainer(vocab_size=1000,
                                             special_tokens=special_tokens)
train_reviews = [review["text"].lower() for review in imdb_train_set]
bpe_tokenizer.train_from_iterator(train_reviews, bpe_trainer)

In [50]:
tokenizers.pre_tokenizers.Whitespace().pre_tokenize_str("Hello, world!!!")

[('Hello', (0, 5)), (',', (5, 6)), ('world', (7, 12)), ('!!!', (12, 15))]

### Encoding and Decoding Text

In [51]:
some_review = "what an awesome movie! 😊"
bpe_encoding = bpe_tokenizer.encode(some_review)
bpe_encoding

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [52]:
bpe_encoding.tokens

['what', 'an', 'aw', 'es', 'ome', 'movie', '!', '<unk>']

In [53]:
bpe_token_ids = bpe_encoding.ids
bpe_token_ids

[303, 139, 373, 149, 240, 211, 4, 1]

In [54]:
bpe_tokenizer.get_vocab()["what"]

303

In [55]:
bpe_tokenizer.token_to_id("what")

303

In [56]:
bpe_tokenizer.id_to_token(305)

'ough'

In [57]:
bpe_tokenizer.decode(bpe_token_ids)

'what an aw es ome movie !'

In [58]:
bpe_encoding.offsets

[(0, 4), (5, 7), (8, 10), (10, 12), (12, 15), (16, 21), (21, 22), (23, 24)]

### Handling Batches

In [59]:
bpe_tokenizer.encode_batch(train_reviews[:3])

[Encoding(num_tokens=281, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=114, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=285, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

In [60]:
bpe_tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
bpe_tokenizer.enable_truncation(max_length=500)

In [61]:
bpe_encodings = bpe_tokenizer.encode_batch(train_reviews[:3])
bpe_batch_ids = torch.tensor([encoding.ids for encoding in bpe_encodings])
bpe_batch_ids

tensor([[159, 402, 176, 246,  61, 782, 156, 737, 252,  42, 239,  51, 154, 460,
         917,  17, 272, 156, 737, 576, 215, 976, 275,  42, 199,  44, 554,  42,
         192, 585,  57, 160, 259, 170, 157, 143, 138, 159, 402,  11, 589, 152,
           5, 819, 168, 230,   5, 521, 924, 981, 962, 250,  61,  10,  60, 426,
         526, 959,  60, 138, 199, 150, 319,  15, 363, 141, 957, 694,  47, 696,
          61, 875, 138, 960, 337, 414, 140, 157, 385, 174, 433, 161, 221, 145,
         213,  17, 549,  15, 151,  10,  60,  55, 416, 146, 407, 144, 182, 303,
         151, 141,  17, 138, 547, 538, 528, 768,  54, 335,  42, 203,  44, 270,
          46, 153, 876, 141, 919, 233, 522, 172, 141, 719, 162, 807, 279,  17,
         138,  45,  66,  55, 188, 989, 156, 378, 698, 301, 296, 689, 212, 558,
         926, 148,  17,  44, 270,  46, 141,  47, 279, 302, 171, 152, 787,  15,
         153, 522, 172, 766, 205, 156, 234, 677, 161, 139, 513, 146, 370, 251,
         219, 162, 197, 162, 166,  50, 265,  47, 266

In [62]:
attention_mask = torch.tensor([encoding.attention_mask
                               for encoding in bpe_encodings])
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [63]:
lengths = attention_mask.sum(dim=-1)
lengths

tensor([281, 114, 285])

### BBPE Tokenization

The 😊 emoji is represented as 4 bytes when using the UTF-8 Unicode encoding, so the `ByteLevel` pre-tokenizer represents it as 4 characters, each representing a byte. Spaces are converted to Ġ.

In [64]:
tokenizers.pre_tokenizers.ByteLevel().pre_tokenize_str(some_review)

[('Ġwhat', (0, 4)),
 ('Ġan', (4, 7)),
 ('Ġawesome', (7, 15)),
 ('Ġmovie', (15, 21)),
 ('!', (21, 22)),
 ('ĠðŁĺĬ', (22, 24))]

In [65]:
bbpe_model = tokenizers.models.BPE(unk_token="<unk>")
bbpe_tokenizer = tokenizers.Tokenizer(bbpe_model)
bbpe_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel()
bbpe_trainer = tokenizers.trainers.BpeTrainer(vocab_size=1000,
                                              special_tokens=special_tokens)
bbpe_tokenizer.train_from_iterator(train_reviews, bbpe_trainer)

In [66]:
bbpe_encoding = bbpe_tokenizer.encode(some_review)
bbpe_tokens = bbpe_encoding.tokens
bbpe_tokens

['Ġwhat',
 'Ġan',
 'Ġaw',
 'es',
 'ome',
 'Ġmovie',
 '!',
 'Ġ',
 '<unk>',
 'Ł',
 'ĺ',
 '<unk>']

In [67]:
bbpe_token_ids = bbpe_encoding.ids
bbpe_token_ids

[354, 216, 561, 148, 244, 232, 2, 107, 1, 125, 119, 1]

In [68]:
bbpe_decoded = bbpe_tokenizer.decode(bbpe_token_ids)
bbpe_decoded

'Ġwhat Ġan Ġaw es ome Ġmovie ! Ġ Ł ĺ'

In [69]:
bbpe_decoded.replace(" ", "").replace("Ġ", " ").strip()

'what an awesome movie! Łĺ'

### WordPiece

In [70]:
wp_model = tokenizers.models.WordPiece(unk_token="<unk>")
wp_tokenizer = tokenizers.Tokenizer(wp_model)
wp_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
wp_trainer = tokenizers.trainers.WordPieceTrainer(vocab_size=1000,
                                                  special_tokens=special_tokens)
wp_tokenizer.train_from_iterator(train_reviews, wp_trainer)

In [71]:
wp_encoding = wp_tokenizer.encode(some_review)
wp_tokens = wp_encoding.tokens
wp_tokens

['what', 'an', 'aw', '##es', '##ome', 'movie', '!', '<unk>']

In [72]:
wp_token_ids = wp_encoding.ids
wp_token_ids

[443, 312, 635, 257, 354, 331, 4, 1]

In [73]:
wp_decoded = wp_tokenizer.decode(wp_token_ids)
wp_decoded

'what an aw ##es ##ome movie !'

In [74]:
wp_decoded.replace(" ##", "").replace(" !", "!")

'what an awesome movie!'

### Unigram LM

In [75]:
unigram_model = tokenizers.models.Unigram()
unigram_tokenizer = tokenizers.Tokenizer(unigram_model)
unigram_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
unigram_trainer = tokenizers.trainers.UnigramTrainer(
    vocab_size=1000, special_tokens=special_tokens, unk_token="<unk>")
unigram_tokenizer.train_from_iterator(train_reviews, unigram_trainer)

In [76]:
unigram_encoding = unigram_tokenizer.encode(some_review)
unigram_tokens = unigram_encoding.tokens
unigram_tokens

['what', 'an', 'a', 'w', 'e', 'some', 'movie', '!', '😊']

In [77]:
unigram_token_ids = unigram_encoding.ids[:10]
unigram_token_ids

[79, 37, 4, 40, 6, 70, 46, 74, 1]

In [78]:
unigram_tokenizer.decode(unigram_token_ids)

'what an a w e some movie !'

### Pretrained Tokenizers

BBPE is used by models like GPT-2 and RoBERTa.

In [79]:
import transformers

gpt2_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
gpt2_encoding = gpt2_tokenizer(train_reviews[:3], truncation=True,
                               max_length=500)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [80]:
gpt2_encoding.keys()

KeysView({'input_ids': [[14247, 35030, 1690, 423, 257, 1688, 8046, 13, 484, 1690, 1282, 503, 2045, 588, 257, 2646, 4676, 373, 2391, 4624, 319, 262, 3800, 357, 10508, 355, 366, 3847, 2802, 11074, 9785, 1681, 46390, 316, 338, 4571, 7622, 262, 2646, 6776, 11, 543, 318, 2592, 2408, 1201, 262, 4286, 4438, 683, 645, 1103, 4427, 13, 991, 11, 340, 338, 3621, 284, 804, 379, 329, 644, 340, 318, 13, 262, 16585, 1022, 285, 40302, 269, 5718, 290, 33826, 8803, 302, 44655, 318, 2407, 10457, 13, 262, 17262, 286, 511, 2776, 389, 6452, 13, 269, 5718, 318, 9623, 355, 1464, 11, 290, 302, 44655, 3011, 530, 286, 465, 1178, 8395, 284, 1107, 719, 29847, 1671, 1220, 6927, 1671, 11037, 72, 22127, 326, 1312, 1053, 1239, 1775, 4173, 64, 443, 7114, 338, 711, 11, 475, 1312, 3285, 326, 474, 323, 1803, 261, 477, 268, 338, 16711, 318, 17074, 13, 262, 4226, 318, 8131, 47370, 11, 290, 7622, 345, 25260, 13, 366, 22595, 46670, 1, 318, 281, 36005, 17774, 2646, 11, 290, 318, 7151, 329, 3016, 477, 3296, 286, 3800, 290, 3159,

In [81]:
gpt2_token_ids = gpt2_encoding["input_ids"][0][:10]
gpt2_token_ids

[14247, 35030, 1690, 423, 257, 1688, 8046, 13, 484, 1690]

In [82]:
gpt2_tokenizer.decode(gpt2_token_ids)

'stage adaptations often have a major fault. they often'

WordPiece is used by models like BERT.

In [83]:
bert_tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
bert_encoding = bert_tokenizer(train_reviews[:3], padding=True,
                               truncation=True, max_length=500,
                               return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [84]:
bert_encoding["input_ids"]

tensor([[  101,  2754, 17241,  2411,  2031,  1037,  2350,  6346,  1012,  2027,
          2411,  2272,  2041,  2559,  2066,  1037,  2143,  4950,  2001,  3432,
          2872,  2006,  1996,  2754,  1006,  2107,  2004,  1000,  2305,  2388,
          1000,  1007,  1012, 11430, 11320, 11368,  1005,  1055,  3257,  7906,
          1996,  2143,  4142,  1010,  2029,  2003,  2926,  3697,  2144,  1996,
          3861,  3253,  2032,  2053,  2613,  4119,  1012,  2145,  1010,  2009,
          1005,  1055,  3835,  2000,  2298,  2012,  2005,  2054,  2009,  2003,
          1012,  1996,  6370,  2090,  2745, 19881,  1998,  5696, 20726,  2003,
          3243,  8235,  1012,  1996, 10949,  1997,  2037,  3276,  2024, 11341,
          1012, 19881,  2003, 10392,  2004,  2467,  1010,  1998, 20726,  4152,
          2028,  1997,  2010,  2261,  9592,  2000,  2428,  2552,  1012,  1026,
          7987,  1013,  1028,  1026,  7987,  1013,  1028,  1045, 18766,  2008,
          1045,  1005,  2310,  2196,  2464, 11209, 2

In [85]:
bert_encoding["attention_mask"]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [86]:
# extra code – shows how to drop the special tokens
bert_encoding = bert_tokenizer(train_reviews[:3], padding=True,
                               truncation=True, max_length=500,
                               add_special_tokens=False, return_tensors="pt")
bert_encoding["input_ids"]

tensor([[ 2754, 17241,  2411,  2031,  1037,  2350,  6346,  1012,  2027,  2411,
          2272,  2041,  2559,  2066,  1037,  2143,  4950,  2001,  3432,  2872,
          2006,  1996,  2754,  1006,  2107,  2004,  1000,  2305,  2388,  1000,
          1007,  1012, 11430, 11320, 11368,  1005,  1055,  3257,  7906,  1996,
          2143,  4142,  1010,  2029,  2003,  2926,  3697,  2144,  1996,  3861,
          3253,  2032,  2053,  2613,  4119,  1012,  2145,  1010,  2009,  1005,
          1055,  3835,  2000,  2298,  2012,  2005,  2054,  2009,  2003,  1012,
          1996,  6370,  2090,  2745, 19881,  1998,  5696, 20726,  2003,  3243,
          8235,  1012,  1996, 10949,  1997,  2037,  3276,  2024, 11341,  1012,
         19881,  2003, 10392,  2004,  2467,  1010,  1998, 20726,  4152,  2028,
          1997,  2010,  2261,  9592,  2000,  2428,  2552,  1012,  1026,  7987,
          1013,  1028,  1026,  7987,  1013,  1028,  1045, 18766,  2008,  1045,
          1005,  2310,  2196,  2464, 11209, 20206,  

Unigram LM is used by models like ALBERT, T5, and XLM-R.

In [87]:
albert_tokenizer = transformers.AutoTokenizer.from_pretrained("albert-base-v2")
albert_encoding = albert_tokenizer(
    train_reviews[:3], padding=True, truncation=True, max_length=500,
    add_special_tokens=False, return_tensors="pt")
albert_token_ids = albert_encoding["input_ids"]
albert_token_ids

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

tensor([[  876,  5004,    18,   478,    57,    21,   394,  4173,     9,    59,
           478,   340,    70,   699,   101,    21,   171,  3336,    23,  1659,
          1037,    27,    14,   876,    13,     5,  4289,    28,    13,     7,
          4893,   449,     7,     6,     9, 12508,  1612,  5909,    22,    18,
          1400,  8968,    14,   171,  2481,    15,    56,    25,  1118,  1956,
           179,    14,  2151,  1434,    61,    90,   683,  2404,     9,   174,
            15,    32,    22,    18,  2210,    20,   361,    35,    26,    98,
            32,    25,     9,    14,  5427,   128,   832, 22427,    17,  4479,
         24604,    25,  1450,  7472,     9,    14, 12289,    16,    66,  1429,
            50, 12891,     9, 22427,    25, 10356,    28,   550,    15,    17,
         24604,  3049,    53,    16,    33,   310, 11285,    20,   510,   601,
             9,     1,  5145,    13,   118,     1,  5145,    13,   118,     1,
            49, 14586,    30,    31,    22,   195,  

In [88]:
hf_tokenizer = transformers.PreTrainedTokenizerFast(
    tokenizer_object=bpe_tokenizer)
hf_encodings = hf_tokenizer(train_reviews[:3], padding=True, truncation=True,
                            max_length=500, return_tensors="pt")
hf_encodings["input_ids"]

tensor([[159, 402, 176, 246,  61, 782, 156, 737, 252,  42, 239,  51, 154, 460,
         917,  17, 272, 156, 737, 576, 215, 976, 275,  42, 199,  44, 554,  42,
         192, 585,  57, 160, 259, 170, 157, 143, 138, 159, 402,  11, 589, 152,
           5, 819, 168, 230,   5, 521, 924, 981, 962, 250,  61,  10,  60, 426,
         526, 959,  60, 138, 199, 150, 319,  15, 363, 141, 957, 694,  47, 696,
          61, 875, 138, 960, 337, 414, 140, 157, 385, 174, 433, 161, 221, 145,
         213,  17, 549,  15, 151,  10,  60,  55, 416, 146, 407, 144, 182, 303,
         151, 141,  17, 138, 547, 538, 528, 768,  54, 335,  42, 203,  44, 270,
          46, 153, 876, 141, 919, 233, 522, 172, 141, 719, 162, 807, 279,  17,
         138,  45,  66,  55, 188, 989, 156, 378, 698, 301, 296, 689, 212, 558,
         926, 148,  17,  44, 270,  46, 141,  47, 279, 302, 171, 152, 787,  15,
         153, 522, 172, 766, 205, 156, 234, 677, 161, 139, 513, 146, 370, 251,
         219, 162, 197, 162, 166,  50, 265,  47, 266

## Building and Training a Sentiment Analysis Model

In [89]:
def collate_fn(batch, tokenizer=bert_tokenizer):
    reviews = [review["text"] for review in batch]
    labels = [[review["label"]] for review in batch]
    encodings = tokenizer(reviews, padding=True, truncation=True,
                          max_length=200, return_tensors="pt")
    labels = torch.tensor(labels, dtype=torch.float32)
    return encodings, labels

batch_size = 256
imdb_train_loader = DataLoader(imdb_train_set, batch_size=batch_size,
                               collate_fn=collate_fn, shuffle=True)
imdb_valid_loader = DataLoader(imdb_valid_set, batch_size=batch_size,
                               collate_fn=collate_fn)
imdb_test_loader = DataLoader(imdb_test_set, batch_size=batch_size,
                              collate_fn=collate_fn)

In [90]:
class SentimentAnalysisModel(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=128, hidden_dim=64,
                 pad_id=0, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim,
                                  padding_idx=pad_id)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, 1)

    def forward(self, encodings):
        embeddings = self.embed(encodings["input_ids"])
        _outputs, hidden_states = self.gru(embeddings)
        return self.output(hidden_states[-1])

In [91]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

sequences = torch.tensor([[1, 2, 0, 0], [5, 6, 7, 8]])
packed = pack_padded_sequence(sequences, lengths=(2, 4),
                              enforce_sorted=False, batch_first=True)
packed

PackedSequence(data=tensor([5, 1, 6, 2, 7, 8]), batch_sizes=tensor([2, 2, 1, 1]), sorted_indices=tensor([1, 0]), unsorted_indices=tensor([1, 0]))

In [92]:
padded, lengths = pad_packed_sequence(packed, batch_first=True)
padded, lengths

(tensor([[1, 2, 0, 0],
         [5, 6, 7, 8]]),
 tensor([2, 4]))

In [93]:
class SentimentAnalysisModelPackedSeq(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=128,
                 hidden_dim=64, pad_id=0, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim,
                                  padding_idx=pad_id)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, 1)

    def forward(self, encodings):
        embeddings = self.embed(encodings["input_ids"])
        lengths = encodings["attention_mask"].sum(dim=1)                      # <= line added
        packed = pack_padded_sequence(embeddings, lengths=lengths.cpu(),      # <= line added
                                      batch_first=True, enforce_sorted=False) # <= line added
        _outputs, hidden_states = self.gru(packed)                            # <= line changed
        return self.output(hidden_states[-1])

In [94]:
torch.manual_seed(42)

vocab_size = bert_tokenizer.vocab_size
imdb_model_ps = SentimentAnalysisModelPackedSeq(vocab_size).to(device)

n_epochs = 10
xentropy = nn.BCEWithLogitsLoss()
optimizer = torch.optim.NAdam(imdb_model_ps.parameters())
accuracy = torchmetrics.Accuracy(task="binary").to(device)

history = train(imdb_model_ps, optimizer, xentropy, accuracy,
                imdb_train_loader, imdb_valid_loader, n_epochs)

Epoch 1/10,                      train loss: 0.6795, train metric: 57.05%, valid metric: 55.84%
Epoch 2/10,                      train loss: 0.6118, train metric: 67.74%, valid metric: 59.58%
Epoch 3/10,                      train loss: 0.4613, train metric: 78.65%, valid metric: 80.68%
Epoch 4/10,                      train loss: 0.3391, train metric: 85.62%, valid metric: 83.52%
Epoch 5/10,                      train loss: 0.2547, train metric: 89.93%, valid metric: 84.66%
Epoch 6/10,                      train loss: 0.2894, train metric: 88.05%, valid metric: 81.98%
Epoch 7/10,                      train loss: 0.1837, train metric: 93.08%, valid metric: 84.26%
Epoch 8/10,                      train loss: 0.1181, train metric: 96.04%, valid metric: 82.80%
Epoch 9/10,                      train loss: 0.0648, train metric: 98.22%, valid metric: 83.44%
Epoch 10/10,                      train loss: 0.0488, train metric: 98.80%, valid metric: 83.30%


In [95]:
class SentimentAnalysisModelBidi(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=128,
                 hidden_dim=64, pad_id=0, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim,
                                  padding_idx=pad_id)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout, bidirectional=True)  # <= line changed
        self.output = nn.Linear(2 * hidden_dim, 1)                               # <= line changed

    def forward(self, encodings):
        embeddings = self.embed(encodings["input_ids"])
        lengths = encodings["attention_mask"].sum(dim=1)
        packed = pack_padded_sequence(embeddings, lengths=lengths.cpu(),
                                      batch_first=True, enforce_sorted=False)
        _outputs, hidden_states = self.gru(packed)
        n_dims = self.output.in_features                                          # <= line added
        top_states = hidden_states[-2:].permute(1, 0, 2).reshape(-1, n_dims)      # <= line added
        return self.output(top_states)                                            # <= line changed

In [96]:
torch.manual_seed(42)

vocab_size = bert_tokenizer.vocab_size
imdb_model_bidi = SentimentAnalysisModelBidi(vocab_size).to(device)

n_epochs = 10
xentropy = nn.BCEWithLogitsLoss()
optimizer = torch.optim.NAdam(imdb_model_bidi.parameters())
accuracy = torchmetrics.Accuracy(task="binary").to(device)

history = train(imdb_model_bidi, optimizer, xentropy, accuracy, imdb_train_loader,
                imdb_valid_loader, n_epochs)

Epoch 1/10,                      train loss: 0.6514, train metric: 60.40%, valid metric: 56.66%
Epoch 2/10,                      train loss: 0.5051, train metric: 75.31%, valid metric: 74.18%
Epoch 3/10,                      train loss: 0.3939, train metric: 82.45%, valid metric: 82.12%
Epoch 4/10,                      train loss: 0.2917, train metric: 87.91%, valid metric: 80.72%
Epoch 5/10,                      train loss: 0.2114, train metric: 91.63%, valid metric: 83.94%
Epoch 6/10,                      train loss: 0.1501, train metric: 94.41%, valid metric: 83.98%
Epoch 7/10,                      train loss: 0.0854, train metric: 97.29%, valid metric: 81.14%
Epoch 8/10,                      train loss: 0.0834, train metric: 97.10%, valid metric: 84.30%
Epoch 9/10,                      train loss: 0.0246, train metric: 99.42%, valid metric: 83.80%
Epoch 10/10,                      train loss: 0.0109, train metric: 99.80%, valid metric: 83.56%


Before we continue, let's clean up the GPU RAM:

In [97]:
del albert_token_ids, attention_mask, bpe_batch_ids
del encoded_text, lengths, optimizer, padded, probs, samples, sequences, x
del xentropy, y, Y_logits

Out.clear()  # clear Jupyter's `Out` variable which saves all the cell outputs
free_vram(device)

## Reusing Pretrained Embeddings and Language Models

In [98]:
bert_model = transformers.AutoModel.from_pretrained("bert-base-uncased")
bert_model.embeddings.word_embeddings

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Embedding(30522, 768, padding_idx=0)

In [99]:
class SentimentAnalysisModelPreEmbeds(nn.Module):
    def __init__(self, pretrained_embeddings, n_layers=2, hidden_dim=64,
                 dropout=0.2):
        super().__init__()
        weights = pretrained_embeddings.weight.data
        self.embed = nn.Embedding.from_pretrained(weights, freeze=True)
        embed_dim = weights.shape[-1]
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout, bidirectional=True)
        self.output = nn.Linear(2 * hidden_dim, 1)

    def forward(self, encodings):
        embeddings = self.embed(encodings["input_ids"])
        lengths = encodings["attention_mask"].sum(dim=1)
        packed = pack_padded_sequence(embeddings, lengths=lengths.cpu(),
                                      batch_first=True, enforce_sorted=False)
        _outputs, hidden_states = self.gru(packed)
        n_dims = self.output.in_features
        top_states = hidden_states[-2:].permute(1, 0, 2).reshape(-1, n_dims)
        return self.output(top_states)

In [100]:
torch.manual_seed(42)

imdb_model_bert_embeds = SentimentAnalysisModelPreEmbeds(
    bert_model.embeddings.word_embeddings).to(device)

n_epochs = 10
xentropy = nn.BCEWithLogitsLoss()
optimizer = torch.optim.NAdam(imdb_model_bert_embeds.parameters())
accuracy = torchmetrics.Accuracy(task="binary").to(device)

history = train(imdb_model_bert_embeds, optimizer, xentropy, accuracy,
                imdb_train_loader, imdb_valid_loader, n_epochs)

Epoch 1/10,                      train loss: 0.6896, train metric: 54.61%, valid metric: 56.50%
Epoch 2/10,                      train loss: 0.6583, train metric: 63.41%, valid metric: 69.16%
Epoch 3/10,                      train loss: 0.5599, train metric: 71.92%, valid metric: 72.56%
Epoch 4/10,                      train loss: 0.4806, train metric: 77.27%, valid metric: 64.06%
Epoch 5/10,                      train loss: 0.4019, train metric: 81.57%, valid metric: 82.24%
Epoch 6/10,                      train loss: 0.3624, train metric: 83.79%, valid metric: 83.94%
Epoch 7/10,                      train loss: 0.3333, train metric: 85.51%, valid metric: 79.96%
Epoch 8/10,                      train loss: 0.3118, train metric: 86.58%, valid metric: 80.74%
Epoch 9/10,                      train loss: 0.2938, train metric: 87.45%, valid metric: 63.74%
Epoch 10/10,                      train loss: 0.2664, train metric: 88.83%, valid metric: 79.90%


In [101]:
bert_encoding = bert_tokenizer(train_reviews[:3], padding=True,
                               max_length=200, truncation=True,
                               return_tensors="pt")
bert_output = bert_model(**bert_encoding)
bert_output.last_hidden_state.shape

torch.Size([3, 200, 768])

In [102]:
bert_output.pooler_output.shape

torch.Size([3, 768])

Let's free some GPU RAM:

In [103]:
del bert_model
free_vram(device)

In [104]:
class SentimentAnalysisModelBert(nn.Module):
    def __init__(self, n_layers=2, hidden_dim=64, dropout=0.2):
        super().__init__()
        self.bert = transformers.AutoModel.from_pretrained("bert-base-uncased")
        embed_dim = self.bert.config.hidden_size
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, 1)

    def forward(self, encodings):
        contextualized_embeddings = self.bert(**encodings).last_hidden_state
        lengths = encodings["attention_mask"].sum(dim=1)
        packed = pack_padded_sequence(contextualized_embeddings, lengths=lengths.cpu(),
                                      batch_first=True, enforce_sorted=False)
        _outputs, hidden_states = self.gru(packed)
        return self.output(hidden_states[-1])

In [105]:
torch.manual_seed(42)

imdb_model_bert = SentimentAnalysisModelBert().to(device)
imdb_model_bert.bert.requires_grad_(False)

n_epochs = 4
xentropy = nn.BCEWithLogitsLoss()
optimizer = torch.optim.NAdam(imdb_model_bert.parameters())
accuracy = torchmetrics.Accuracy(task="binary").to(device)

history = train(imdb_model_bert, optimizer, xentropy, accuracy,
                imdb_train_loader, imdb_valid_loader, n_epochs)

Epoch 1/4,                      train loss: 0.4879, train metric: 75.29%, valid metric: 87.22%
Epoch 2/4,                      train loss: 0.3072, train metric: 86.91%, valid metric: 88.26%
Epoch 3/4,                      train loss: 0.2779, train metric: 88.44%, valid metric: 88.20%
Epoch 4/4,                      train loss: 0.2575, train metric: 89.22%, valid metric: 88.00%


Let's free some GPU RAM again:

In [106]:
del imdb_model_bert
free_vram(device)

In [107]:
class SentimentAnalysisModelBert2(nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.bert = transformers.AutoModel.from_pretrained("bert-base-uncased")
        self.output = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, encodings):
        bert_output = self.bert(**encodings)
        return self.output(bert_output.last_hidden_state[:, 0])

In [108]:
class SentimentAnalysisModelBert3(nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.bert = transformers.AutoModel.from_pretrained("bert-base-uncased")
        self.output = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, encodings):
        bert_output = self.bert(**encodings)
        return self.output(bert_output.pooler_output)

In [109]:
torch.manual_seed(42)

imdb_model_bert3 = SentimentAnalysisModelBert3().to(device)
imdb_model_bert3.bert.requires_grad_(False)

n_epochs = 5
xentropy = nn.BCEWithLogitsLoss()
optimizer = torch.optim.NAdam(imdb_model_bert3.parameters())
accuracy = torchmetrics.Accuracy(task="binary").to(device)

history = train(imdb_model_bert3, optimizer, xentropy, accuracy,
                imdb_train_loader, imdb_valid_loader, n_epochs)

Epoch 1/5,                      train loss: 0.6602, train metric: 61.26%, valid metric: 71.66%
Epoch 2/5,                      train loss: 0.6033, train metric: 69.37%, valid metric: 66.74%
Epoch 3/5,                      train loss: 0.5716, train metric: 72.45%, valid metric: 75.18%
Epoch 4/5,                      train loss: 0.5535, train metric: 73.12%, valid metric: 73.04%
Epoch 5/5,                      train loss: 0.5343, train metric: 74.74%, valid metric: 59.80%


In [110]:
imdb_model_bert3.bert.pooler.requires_grad_(True)

history = train(imdb_model_bert3, optimizer, xentropy, accuracy,
                imdb_train_loader, imdb_valid_loader, n_epochs)

Epoch 1/5,                      train loss: 0.8155, train metric: 71.27%, valid metric: 81.72%
Epoch 2/5,                      train loss: 0.4582, train metric: 78.73%, valid metric: 82.40%
Epoch 3/5,                      train loss: 0.4408, train metric: 79.41%, valid metric: 81.02%
Epoch 4/5,                      train loss: 0.4216, train metric: 80.86%, valid metric: 81.26%
Epoch 5/5,                      train loss: 0.4172, train metric: 80.87%, valid metric: 82.20%


Let's free some GPU RAM:

In [111]:
del imdb_model_bert3
free_vram(device)

# Task-Specific Classes

In [112]:
# from transformers import AutoModelForSequenceClassification
from transformers import BertForSequenceClassification

torch.manual_seed(42)
bert_for_binary_clf = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2, dtype=torch.float16).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [113]:
encoding = bert_tokenizer(["This was a great movie!"])
with torch.no_grad():
    output = bert_for_binary_clf(
        input_ids=torch.tensor(encoding["input_ids"], device=device),
        attention_mask=torch.tensor(encoding["attention_mask"], device=device))

output.logits

tensor([[-0.0120,  0.6304]], device='cuda:0', dtype=torch.float16)


In [114]:
torch.softmax(output.logits, dim=-1)

tensor([[0.3447, 0.6553]], device='cuda:0', dtype=torch.float16)


In [115]:
with torch.no_grad():
    output = bert_for_binary_clf(
        input_ids=torch.tensor(encoding["input_ids"], device=device),
        attention_mask=torch.tensor(encoding["attention_mask"], device=device),
        labels=torch.tensor([1], device=device))

output.loss

tensor(0.4226, device='cuda:0', dtype=torch.float16)


Every model from the Transformers library has a `config` attribute that contains the model's configuration:

In [116]:
bert_for_binary_clf.config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.56.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

# The Trainer API

We could use our own training function to train this model, or we can use the Trainer API instead.

The Trainer API needs datasets that are already tokenized, so let's prepare them:

In [117]:
def tokenize_batch(batch):
    return bert_tokenizer(batch["text"], truncation=True, max_length=200)

tok_imdb_train_set = imdb_train_set.map(tokenize_batch, batched=True)
tok_imdb_valid_set = imdb_valid_set.map(tokenize_batch, batched=True)
tok_imdb_test_set = imdb_test_set.map(tokenize_batch, batched=True)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

The Trainer API doesn't support streaming metrics so we need to write our own evaluation function:

In [118]:
def compute_accuracy(pred):
    return {"accuracy": (pred.label_ids == pred.predictions.argmax(-1)).mean()}

In [119]:
from transformers import TrainingArguments

train_args = TrainingArguments(
    output_dir="my_imdb_model", num_train_epochs=2,
    per_device_train_batch_size=128, per_device_eval_batch_size=128,
    eval_strategy="epoch", logging_strategy="epoch", save_strategy="epoch",
    load_best_model_at_end=True, metric_for_best_model="accuracy",
    report_to="none")

In [120]:
from transformers import DataCollatorWithPadding, Trainer

trainer = Trainer(
    bert_for_binary_clf, train_args, train_dataset=tok_imdb_train_set,
    eval_dataset=tok_imdb_valid_set, compute_metrics=compute_accuracy,
    data_collator=DataCollatorWithPadding(bert_tokenizer))
train_output = trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3102,0.237499,0.9034
2,0.1634,0.239066,0.9082


Let's free some GPU RAM:

In [121]:
del bert_for_binary_clf
free_vram(device)

# Pipelines

In [122]:
from transformers import pipeline

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
classifier_imdb = pipeline("sentiment-analysis", model=model_name,
                           truncation=True, max_length=512)
classifier_imdb(train_reviews[:10])

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9996108412742615},
 {'label': 'POSITIVE', 'score': 0.9998623132705688},
 {'label': 'NEGATIVE', 'score': 0.9943684935569763},
 {'label': 'POSITIVE', 'score': 0.997913658618927},
 {'label': 'POSITIVE', 'score': 0.999544084072113},
 {'label': 'NEGATIVE', 'score': 0.9845332503318787},
 {'label': 'POSITIVE', 'score': 0.9859278202056885},
 {'label': 'POSITIVE', 'score': 0.9993758797645569},
 {'label': 'POSITIVE', 'score': 0.9978922009468079},
 {'label': 'NEGATIVE', 'score': 0.9997020363807678}]

In [123]:
accuracy = torchmetrics.Accuracy(task="binary").to(device)
with torch.no_grad():
    text_imdb_valid_loader = DataLoader(imdb_valid_set, batch_size=256)
    for index, batch in enumerate(text_imdb_valid_loader):
        y_pred = classifier_imdb(batch["text"], truncation=True)
        y_pred = torch.tensor([pred["label"] == "POSITIVE" for pred in y_pred], dtype=int)
        accuracy.update(y_pred, batch["label"])
        print(f"\r{index + 1}/{len(text_imdb_valid_loader)}", end="")

accuracy.compute()

8/20

20/20

tensor(0.8820, device='cuda:0')

Models can be very biased. For example, it may like or dislike some countries depending on the data it was trained on, and how it is used, so use it with care:

In [124]:
# extra code – shows that binary classification can amplify model bias
countries = ["Iraq", "Thailand", "the USA", "Vietnam"]
texts = [f"I am from {country}" for country in countries]
list(zip(countries, classifier_imdb(texts)))

[('Iraq', {'label': 'NEGATIVE', 'score': 0.9706069231033325}),
 ('Thailand', {'label': 'POSITIVE', 'score': 0.9903932213783264}),
 ('the USA', {'label': 'POSITIVE', 'score': 0.9642282128334045}),
 ('Vietnam', {'label': 'NEGATIVE', 'score': 0.9747399091720581})]

In [125]:
# extra code – using a model with a neutral class solves this bias issue
# note: the warning is normal: this model's pooler will not be used for
# classification, so its weights are downloaded but not used.
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
classifier_imdb_with_neutral = pipeline("sentiment-analysis", model=model_name)
list(zip(countries, classifier_imdb_with_neutral(texts)))

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


[('Iraq', {'label': 'neutral', 'score': 0.8353288769721985}),
 ('Thailand', {'label': 'neutral', 'score': 0.8824347853660583}),
 ('the USA', {'label': 'neutral', 'score': 0.8349122405052185}),
 ('Vietnam', {'label': 'neutral', 'score': 0.8436853885650635})]

In [126]:
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli([
    "She loves me. [SEP] She loves me not. [SEP]",
    "Alice just woke up. [SEP] Alice is awake. [SEP]",
    "I like dogs. [SEP] Everyone likes dogs. [SEP]"])

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'contradiction', 'score': 0.9717152714729309},
 {'label': 'entailment', 'score': 0.9119168519973755},
 {'label': 'neutral', 'score': 0.9509281516075134}]

Let's free some GPU RAM:

In [127]:
del classifier_imdb, classifier_mnli, classifier_imdb_with_neutral, trainer
Out.clear()
free_vram(device)

# An Encoder–Decoder Network for Neural Machine Translation

In [128]:
# If you want to start the notebook here, please run the cells in the Setup
# section at the start of the notebook, then come back to this cell

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.utils.data import DataLoader
from datasets import load_dataset
import tokenizers

In [129]:
nmt_original_valid_set, nmt_test_set = load_dataset(
    path="ageron/tatoeba_mt_train", name="eng-spa",
    split=["validation", "test"])
split = nmt_original_valid_set.train_test_split(train_size=0.8, seed=42)
nmt_train_set, nmt_valid_set = split["train"], split["test"]

README.md: 0.00B [00:00, ?B/s]

eng-spa/validation-00000-of-00001.parque(…):   0%|          | 0.00/7.85M [00:00<?, ?B/s]

eng-spa/test-00000-of-00001.parquet:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/197299 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/24514 [00:00<?, ? examples/s]

In [130]:
nmt_train_set[0]

{'source_text': 'Tom tried to break up the fight.',
 'target_text': 'Tom trató de disolver la pelea.',
 'source_lang': 'eng',
 'target_lang': 'spa'}

In [131]:
def train_eng_spa():  # a generator function to iterate over all training text
    for pair in nmt_train_set:
        yield pair["source_text"]
        yield pair["target_text"]

max_length = 256
vocab_size = 10_000

nmt_tokenizer_model = tokenizers.models.BPE(unk_token="<unk>")
nmt_tokenizer = tokenizers.Tokenizer(nmt_tokenizer_model)
nmt_tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
nmt_tokenizer.enable_truncation(max_length=max_length)
nmt_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
nmt_tokenizer_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=vocab_size, special_tokens=["<pad>", "<unk>", "<s>", "</s>"])
nmt_tokenizer.train_from_iterator(train_eng_spa(), nmt_tokenizer_trainer)

In [132]:
nmt_tokenizer.encode("I like soccer").ids

[43, 401, 4381]

In [133]:
nmt_tokenizer.encode("<s> Me gusta el fútbol").ids

[2, 396, 582, 219, 3356]

In [134]:
from collections import namedtuple

fields = ["src_token_ids", "src_mask", "tgt_token_ids", "tgt_mask"]
class NmtPair(namedtuple("NmtPairBase", fields)):
    def to(self, device):
        return NmtPair(self.src_token_ids.to(device), self.src_mask.to(device),
                       self.tgt_token_ids.to(device), self.tgt_mask.to(device))

In [135]:
def nmt_collate_fn(batch):
    src_texts = [pair['source_text'] for pair in batch]
    tgt_texts = [f"<s> {pair['target_text']} </s>" for pair in batch]
    src_encodings = nmt_tokenizer.encode_batch(src_texts)
    tgt_encodings = nmt_tokenizer.encode_batch(tgt_texts)
    src_token_ids = torch.tensor([enc.ids for enc in src_encodings])
    tgt_token_ids = torch.tensor([enc.ids for enc in tgt_encodings])
    src_mask = torch.tensor([enc.attention_mask for enc in src_encodings])
    tgt_mask = torch.tensor([enc.attention_mask for enc in tgt_encodings])
    inputs = NmtPair(src_token_ids, src_mask,
                     tgt_token_ids[:, :-1], tgt_mask[:, :-1])
    labels = tgt_token_ids[:, 1:]
    return inputs, labels

batch_size = 32
nmt_train_loader = DataLoader(nmt_train_set, batch_size=batch_size,
                              collate_fn=nmt_collate_fn, shuffle=True)
nmt_valid_loader = DataLoader(nmt_valid_set, batch_size=batch_size,
                              collate_fn=nmt_collate_fn)
nmt_test_loader = DataLoader(nmt_test_set, batch_size=batch_size,
                             collate_fn=nmt_collate_fn)

In [136]:
class NmtModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, pad_id=0, hidden_dim=512,
                 n_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.encoder = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                              batch_first=True)
        self.decoder = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                              batch_first=True)
        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, pair):
        src_embeddings = self.embed(pair.src_token_ids)
        tgt_embeddings = self.embed(pair.tgt_token_ids)
        src_lengths = pair.src_mask.sum(dim=1)
        src_packed = pack_padded_sequence(
            src_embeddings, lengths=src_lengths.cpu(),
            batch_first=True, enforce_sorted=False)
        _, hidden_states = self.encoder(src_packed)
        outputs, _ = self.decoder(tgt_embeddings, hidden_states)
        return self.output(outputs).permute(0, 2, 1)

torch.manual_seed(42)
vocab_size = nmt_tokenizer.get_vocab_size()
nmt_model = NmtModel(vocab_size).to(device)

In [137]:
n_epochs = 10
xentropy = nn.CrossEntropyLoss(ignore_index=0)  # ignore <pad> tokens
optimizer = torch.optim.NAdam(nmt_model.parameters(), lr=0.001)
accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=vocab_size)
accuracy = accuracy.to(device)

history = train(nmt_model, optimizer, xentropy, accuracy,
                nmt_train_loader, nmt_valid_loader, n_epochs)

Epoch 1/10,                      train loss: 3.1338, train metric: 17.29%, valid metric: 20.30%
Epoch 2/10,                      train loss: 2.0359, train metric: 21.86%, valid metric: 21.31%
Epoch 3/10,                      train loss: 1.7177, train metric: 23.36%, valid metric: 21.52%
Epoch 4/10,                      train loss: 1.5585, train metric: 24.21%, valid metric: 21.51%
Epoch 5/10,                      train loss: 1.4643, train metric: 24.59%, valid metric: 21.42%
Epoch 6/10,                      train loss: 1.4143, train metric: 24.83%, valid metric: 21.32%
Epoch 7/10,                      train loss: 1.0877, train metric: 26.90%, valid metric: 22.21%
Epoch 8/10,                      train loss: 0.8589, train metric: 28.78%, valid metric: 22.25%
Epoch 9/10,                      train loss: 0.7411, train metric: 29.56%, valid metric: 22.20%
Epoch 10/10,                      train loss: 0.6579, train metric: 30.26%, valid metric: 22.15%


In [138]:
torch.save(nmt_model.state_dict(), "my_nmt_model.pt")

In [139]:
def translate(model, src_text, max_length=20, pad_id=0, eos_id=3):
    tgt_text = ""
    token_ids = []
    for index in range(max_length):
        batch, _ = nmt_collate_fn([{"source_text": src_text,
                                    "target_text": tgt_text}])
        with torch.no_grad():
            Y_logits = model(batch.to(device))
            Y_token_ids = Y_logits.argmax(dim=1)  # find the best token IDs
            next_token_id = Y_token_ids[0, index]  # take the last token ID

        next_token = nmt_tokenizer.id_to_token(next_token_id)
        tgt_text += " " + next_token
        if next_token_id == eos_id:
            break
    return tgt_text

In [140]:
nmt_model.eval()
translate(nmt_model, "I like soccer.")

' Me gusta el fútbol . </s>'

In [141]:
longer_text = "I like to play soccer with my friends."
translate(nmt_model, longer_text)

' Me gusta jugar con mis amigos . </s>'

## Beam Search

This is a very basic implementation of beam search. I tried to make it readable and understandable, but it's definitely not optimized for speed! The function first uses the model to find the top _k_ words to start the translations (where _k_ is the beam width). For each of the top _k_ translations, it evaluates the conditional probabilities of all possible words it could add to that translation. These extended translations and their probabilities are added to the list of candidates. Once we've gone through all top _k_ translations and all words that could complete them, we keep only the top _k_ candidates with the highest probability, and we iterate over and over until they all finish with an EOS token. The top translation is then returned (after removing its EOS token).

* Note: If p(S) is the probability of sentence S, and p(W|S) is the conditional probability of the word W given that the translation starts with S, then the probability of the sentence S' = concat(S, W) is p(S') = p(S) * p(W|S). As we add more words, the probability gets smaller and smaller. To avoid the risk of it getting too small, which could cause floating point precision errors, the function keeps track of log probabilities instead of probabilities: recall that log(a\*b) = log(a) + log(b), therefore log(p(S')) = log(p(S)) + log(p(W|S)).

In [142]:
def beam_search(model, src_text, beam_width=3, max_length=20,
                verbose=False, length_penalty=0.6):
    top_translations = [(torch.tensor(0.), "")]
    for index in range(max_length):
        if verbose:
            print(f"Top {beam_width} translations so far:")
            for log_proba, tgt_text in top_translations:
                print(f"    {log_proba.item():.3f} – {tgt_text}")

        candidates = []
        for log_proba, tgt_text in top_translations:
            if tgt_text.endswith(" </s>"):
                candidates.append((log_proba, tgt_text))
                continue  # don't add tokens after EOS token
            batch, _ = nmt_collate_fn([{"source_text": src_text,
                                        "target_text": tgt_text}])
            with torch.no_grad():
                Y_logits = model(batch.to(device))
                Y_log_proba = F.log_softmax(Y_logits, dim=1)
                Y_top_log_probas = torch.topk(Y_log_proba, k=beam_width, dim=1)

            for beam_index in range(beam_width):
                next_token_log_proba = Y_top_log_probas.values[0, beam_index, index]
                next_token_id = Y_top_log_probas.indices[0, beam_index, index]
                next_token = nmt_tokenizer.id_to_token(next_token_id)
                next_tgt_text = tgt_text + " " + next_token
                candidates.append((log_proba + next_token_log_proba, next_tgt_text))

        def length_penalized_score(candidate, alpha=length_penalty):
            log_proba, text = candidate
            length = len(text.split())
            penalty = ((5 + length) ** alpha) / (6 ** alpha)
            return log_proba / penalty

        top_translations = sorted(candidates,
                                  key=length_penalized_score,
                                  reverse=True)[:beam_width]

    return top_translations[-1][1]

In [143]:
beam_search(nmt_model, longer_text, beam_width=3)

'Me gusta jugar al fútbol con mis amigos . </s>'


In [144]:
longest_text = "I like to play soccer with my friends at the beach."
beam_search(nmt_model, longest_text, beam_width=3)

' Me gusta jugar con jugar con los jug adores de la playa . </s>'


Let's free some GPU RAM:

In [145]:
del nmt_model
free_vram(device)

# Attention Mechanisms

Let's implement Luong attention (a.k.a. dot-product attention):

In [146]:
def attention(query, key, value):  # note: dq == dk and Lk == Lv
    scores = query @ key.transpose(1, 2)  # [B,Lq,dq] @ [B,dk,Lk] = [B, Lq, Lk]
    weights = torch.softmax(scores, dim=-1)  # [B, Lq, Lk]
    return weights @ value  # [B, Lq, Lk] @ [B, Lv, dv] = [B, Lq, dv]

* `B` = batch size
* `Lq` = max query length in this batch
* `Lk` = max key length in this batch = `Lv` = max value length in this batch
* `dq` = query embedding size = `dk` = key embedding size
* `dv` = value embedding size

In [147]:
class NmtModelWithAttention(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, pad_id=0, hidden_dim=512,
                 n_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.encoder = nn.GRU(
            embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        self.decoder = nn.GRU(
            embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        self.output = nn.Linear(2 * hidden_dim, vocab_size)

    def forward(self, pair):
        src_embeddings = self.embed(pair.src_token_ids)  # same as earlier
        tgt_embeddings = self.embed(pair.tgt_token_ids)  # same
        src_lengths = pair.src_mask.sum(dim=1)  # same
        src_packed = pack_padded_sequence(
            src_embeddings, lengths=src_lengths.cpu(),
            batch_first=True, enforce_sorted=False)  # same
        encoder_outputs_packed, hidden_states = self.encoder(src_packed)
        decoder_outputs, _ = self.decoder(tgt_embeddings, hidden_states)  # same
        encoder_outputs, _ = pad_packed_sequence(encoder_outputs_packed,
                                                 batch_first=True)
        attn_output = attention(query=decoder_outputs,
                                key=encoder_outputs, value=encoder_outputs)
        combined_output = torch.cat((attn_output, decoder_outputs), dim=-1)
        return self.output(combined_output).permute(0, 2, 1)

In [148]:
torch.manual_seed(42)
nmt_attn_model = NmtModelWithAttention(vocab_size).to(device)

n_epochs = 10
xentropy = nn.CrossEntropyLoss(ignore_index=0)  # ignore <pad> tokens
optimizer = torch.optim.NAdam(nmt_attn_model.parameters(), lr=0.001)
accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=vocab_size)
accuracy = accuracy.to(device)

history = train(nmt_attn_model, optimizer, xentropy, accuracy,
                nmt_train_loader, nmt_valid_loader, n_epochs)

Epoch 1/10,                      train loss: 3.0072, train metric: 17.94%, valid metric: 20.44%
Epoch 2/10,                      train loss: 2.1242, train metric: 21.48%, valid metric: 21.15%
Epoch 3/10,                      train loss: 1.9239, train metric: 22.38%, valid metric: 21.29%
Epoch 4/10,                      train loss: 1.8401, train metric: 22.76%, valid metric: 21.25%
Epoch 5/10,                      train loss: 1.7882, train metric: 23.05%, valid metric: 21.25%
Epoch 6/10,                      train loss: 1.7606, train metric: 23.14%, valid metric: 21.23%
Epoch 7/10,                      train loss: 1.4143, train metric: 25.02%, valid metric: 22.39%
Epoch 8/10,                      train loss: 1.1987, train metric: 26.35%, valid metric: 22.59%
Epoch 9/10,                      train loss: 1.0960, train metric: 26.93%, valid metric: 22.71%
Epoch 10/10,                      train loss: 1.0242, train metric: 27.44%, valid metric: 22.73%


In [149]:
torch.save(nmt_attn_model.state_dict(), "my_nmt_attn_model.pt")

In [150]:
translate(nmt_attn_model, longer_text)

' Me gusta jugar al fútbol con mis amigos . </s>'

In [151]:
translate(nmt_attn_model, longest_text)

' Me gusta jugar fu tbol con mis amigos en la playa . </s>'


Let's free some GPU RAM:

In [152]:
del nmt_attn_model
free_vram(device)

# Extra Material – Exploring Pretrained Embeddings

Let's load BERT's pretrained embeddings:

In [153]:
from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
embedding_matrix = model.get_input_embeddings().weight.detach()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Let's write a little helper function to get a given token's embedding:

In [154]:
def get_embedding(token):
    token_id = tokenizer.vocab[token]
    return embedding_matrix[token_id]

get_embedding("hello").shape

torch.Size([768])

The following function takes three tokens, computes `E(token2) - E(token1) + E(token3)` (where E(token) is the token's embedding) and finds the most similar token embeddings, using the _cosine similarity_. The cosine similarity between two vectors is the cosine of the angle between the vectors, so its value ranges from –1 (completely opposite) to +1 (perfectly aligned). It returns a list of (similarity, token) pairs.

In [155]:
import torch.nn.functional as F

def find_closest_tokens(token1, token2, token3, top_n=5):
    E = get_embedding
    result = E(token2) - E(token1) + E(token3)
    similarities = F.cosine_similarity(result, embedding_matrix)
    top_k = torch.topk(similarities, k=top_n)
    return [(sim.item(), tokenizer.decode(idx.item()))
            for sim, idx in zip(top_k.values, top_k.indices)]

Let's look at a few examples:

In [156]:
examples = [
    ("king", "queen", "man"),
    ("man", "woman", "nephew"),
    ("father", "mother", "son"),
    ("man", "woman", "doctor"),
    ("germany", "hitler", "italy"),
    ("england", "london", "germany"),
]
for (token1, token2, token3) in examples:
    print(f"{token1} is to {token2} as {token3} is to: ", end="")
    for similarity, token in find_closest_tokens(token1, token2, token3):
        print(f"{token} ({similarity:.1f})", end=" ")
    print()

king is to queen as man is to: man (0.7) woman (0.6) queen (0.5) girl (0.5) lady (0.5) 
man is to woman as nephew is to: nephew (0.8) niece (0.8) granddaughter (0.7) grandson (0.7) daughters (0.6) 
father is to mother as son is to: son (0.8) daughter (0.7) mother (0.6) sons (0.5) daughters (0.5) 
man is to woman as doctor is to: doctor (0.8) doctors (0.6) physician (0.5) woman (0.5) physicians (0.5) 
germany is to hitler as italy is to: hitler (0.8) mussolini (0.6) italy (0.6) fascism (0.6) italians (0.6) 
england is to london as germany is to: germany (0.7) london (0.7) berlin (0.6) german (0.5) munich (0.5) 


As you can see, the correct answer is generally among the closest. That said, if you play around with other examples, you will find that it only works for fairly simple examples: the embeddings aren't always that simple to interpret.

# Exercise solutions

## 1. to 6.

1. Stateless RNNs can only capture patterns whose length is less than, or equal to, the size of the windows the RNN is trained on. Conversely, stateful RNNs can capture longer-term patterns. However, implementing a stateful RNN is much harder⁠—especially preparing the dataset properly. Moreover, stateful RNNs do not always work better, in part because consecutive batches are not independent and identically distributed (IID). Gradient Descent is not fond of non-IID datasets.
2. In general, if you translate a sentence one word at a time, the result will be terrible. For example, the French sentence "Je vous en prie" means "You are welcome," but if you translate it one word at a time, you get "I you in pray." Huh? It is much better to read the whole sentence first and then translate it. A plain sequence-to-sequence RNN would start translating a sentence immediately after reading the first word, while an Encoder–Decoder RNN will first read the whole sentence and then translate it. That said, one could imagine a plain sequence-to-sequence RNN that would output silence whenever it is unsure about what to say next (just like human translators do when they must translate a live broadcast).
3. Variable-length input sequences can be handled by padding the shorter sequences so that all sequences in a batch have the same length, and using masking to ensure the RNN ignores the padding token. For better performance, you may also want to create batches containing sequences of similar sizes. Ragged tensors can hold sequences of variable lengths, and Keras now supports them, which simplifies handling variable-length input sequences (at the time of this writing, it still does not handle ragged tensors as targets on the GPU, though). Regarding variable-length output sequences, if the length of the output sequence is known in advance (e.g., if you know that it is the same as the input sequence), then you just need to configure the loss function so that it ignores tokens that come after the end of the sequence. Similarly, the code that will use the model should ignore tokens beyond the end of the sequence. But generally the length of the output sequence is not known ahead of time, so the solution is to train the model so that it outputs an end-of-sequence token at the end of each sequence.
4. Beam search is a technique used to improve the performance of a trained Encoder–Decoder model, for example in a neural machine translation system. The algorithm keeps track of a short list of the _k_ most promising output sentences (say, the top three), and at each decoder step it tries to extend them by one word; then it keeps only the _k_ most likely sentences. The parameter _k_ is called the _beam width_: the larger it is, the more CPU and RAM will be used, but also the more accurate the system will be. Instead of greedily choosing the most likely next word at each step to extend a single sentence, this technique allows the system to explore several promising sentences simultaneously. Moreover, this technique lends itself well to parallelization. You can implement beam search by writing a custom memory cell. Alternatively, TensorFlow Addons's seq2seq API provides an implementation.
5. An attention mechanism is a technique initially used in Encoder–Decoder models to give the decoder more direct access to the input sequence, allowing it to deal with longer input sequences. At each decoder time step, the current decoder's state and the full output of the encoder are processed by an alignment model that outputs an alignment score for each input time step. This score indicates which part of the input is most relevant to the current decoder time step. The weighted sum of the encoder output (weighted by their alignment score) is then fed to the decoder, which produces the next decoder state and the output for this time step. The main benefit of using an attention mechanism is the fact that the Encoder–Decoder model can successfully process longer input sequences. Another benefit is that the alignment scores make the model easier to debug and interpret: for example, if the model makes a mistake, you can look at which part of the input it was paying attention to, and this can help diagnose the issue. An attention mechanism is also at the core of the Transformer architecture, in the Multi-Head Attention layers. See the next answer.
6. Sampled softmax is used when training a classification model when there are many classes (e.g., thousands). It computes an approximation of the cross-entropy loss based on the logit predicted by the model for the correct class, and the predicted logits for a sample of incorrect words. This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.

## Work in progress

I'm working on the exercise solutions, hoping to finish them by December 2025. Thanks for your patience!