## Encoder-Decoder Models

The following tutorial takes a look at basic encoder-decoder and generates a model that test on the opus data set. The recommended reading for this tutorial is Chapter 2 of Large Language Models: A Deep Dive. You can find it here for under $15: [purchase](https://link.springer.com/book/10.1007/978-3-031-65647-7)

We will be implementing the torch Gated Recurrent Unit (GRU), which is a choice against using the traditional Long Short-Term Memory (LSTM) model. By the end of the softmax layer, we are implementing greedy searching for tokens. However, an alternative beam search component is provided, too.

In [27]:
# install prereqs (torch 2.1.2 needed for torchtext)
!pip -q install --upgrade torch torchvision torchaudio torchtext==0.17.0 --index-url https://download.pytorch.org/whl/cu118
!pip -q install --upgrade datasets transformers sentencepiece tqdm sacrebleu

[0m

In [2]:
# Imports & global config
import math, random, itertools, time, os, gc
from pathlib import Path
from functools import partial

import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

from datasets import load_dataset
from transformers import AutoTokenizer

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 7
random.seed(SEED); torch.manual_seed(SEED); torch.backends.cudnn.deterministic = True


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 712, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.11/dist-package

### The Maths

1. For the hidden state: `h_t = f(h_{t-1}, x_t)` we use an RNN cell (nn.GRU).
2. For the context vector: `c = m(h_1,...,h_T)` we feed the output of the encoder into the decoder.
3. The decoder RNN update is `s_t' = g(s_{t-1}', y_{t-1}', c)` uses another RNN cell (nn.GRU)
4. We softmax the result.
5. Our loss is the typical cross-entropy of `L = −Σ log P(y_t)` with criterion().

### Build the Dataset and Encoder/Padding Helpers

We'll use IWSLT14 DE<->EN as training data.

In [3]:
# This is our DE->EN dataset from huggingface
raw_ds = load_dataset("iwslt2017", "iwslt2017-de-en",
                      split={"train": "train[:50%]",
                             "valid": "validation" ,
                             "test" : "test"})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


de-en.zip:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/206112 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8079 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/888 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 103056
    })
    valid: Dataset({
        features: ['translation'],
        num_rows: 888
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 8079
    })
})


In [4]:
# Another one of these notebooks will go over creating a tokenizer
tok = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
# We're going to translate from German to English
SRC_LANG, TGT_LANG = "de", "en"

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



In [19]:
# Examine the data train length:
print(type(raw_ds))
print(raw_ds['train'])

# Print an example:
print(raw_ds["train"][0]["translation"][SRC_LANG])
print(raw_ds["train"][0]["translation"][TGT_LANG])

<class 'datasets.dataset_dict.DatasetDict'>
Dataset({
    features: ['translation'],
    num_rows: 103056
})
Vielen Dank, Chris.
Thank you so much, Chris.


In [23]:
# This function just helps you to convert the input batches to tokens
# Like all machine learning, we batch everything
def encode_batch(batch, tokenizer, max_len=128):
  src = [b[SRC_LANG] for b in batch["translation"]]
  tgt = [b[TGT_LANG] for b in batch["translation"]]
  model_in = tokenizer(src, truncation=True, max_length=max_len, return_tensors=None)
  model_out = tokenizer(tgt, truncation=True, max_length=max_len, return_tensors=None)
  return {
      "src_ids": model_in["input_ids"],
      "tgt_ids": model_out["input_ids"]
  }

encoded = raw_ds.map(partial(encode_batch, tokenizer=tok), batched=True, remove_columns=["translation"])

Map:   0%|          | 0/103056 [00:00<?, ? examples/s]

Map:   0%|          | 0/888 [00:00<?, ? examples/s]

Map:   0%|          | 0/8079 [00:00<?, ? examples/s]

In [24]:
# Let's look at what is inside encoded. Note the values are no longer 'en' and 'de':
print(encoded["train"][0])

{'src_ids': [8567, 2461, 2, 14718, 3, 0], 'tgt_ids': [6539, 5741, 41, 88, 285, 2637, 2, 14718, 3, 0]}


In [29]:
# This function pads the ids for RNN architecture
def collate(batch):
  # Convert to torch tensors for acceleration. Long because they are ints
  src = [torch.tensor(x["src_ids"], dtype = torch.long) for x in batch]
  tgt = [torch.tensor(x["tgt_ids"], dtype=torch.long) for x in batch]

  # We pad the inputs so they are all the same length for the RNN.
  # The pad token is taken from the tokenizer
  src_pad = pad_sequence(src, padding_value=tok.pad_token_id, batch_first=True)
  tgt_pad = pad_sequence(tgt, padding_value=tok.pad_token_id, batch_first=True)
  return src_pad.to(DEVICE), tgt_pad.to(DEVICE)

BATCH_SIZE = 96
train_loader = DataLoader(encoded["train"], batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate)
valid_loader = DataLoader(encoded["valid"], batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate)

### Building the Model

We'll create an encoder and a decoder

In [None]:
class Encoder(nn.Module):
  def __init__(self, vocab_size, emb_dim, hid_dim=512, n_layers=2, bidir=True, dropout=.2):
    super().__init__()
    # arbitrary embedding to be updated with RNN optimizer/loss
    self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=tok.pad_token_id)

    # the RNN unit
    self.rnn = nn.GRU(emb_dim, hid_dim, num_layers=n_layers,
                      bidirectional=bidir, batch_first=True, dropout=dropout)
    self.dropout = nn.Dropout(dropout)
    self.bidir = bidir

    # length of the hidden H
    self.hid_dim = hid_dim
    self.n_layers = n_layers

    # if bidir, 2 * H output
    self.dir_mult = 2 if bidir else 1

  def forward(self, src):
    # src = [B, T]
    # after embedding, converts to [B, T, E]
    embedded = self.dropout(self.embedding(src))

    # Converts outputs to [B, T, H * dir_mult]: forward and backward H for each T token if 2
    # H = 512, so 1024 if bidir and 512 if not
    # Converts hidden into [n_layers * bidir, B, H]
    outputs, hidden = self.rnn(embedded)

    # hidden = [n_layers*dir, B, H]
    if self.bidir:
      # If bidir, this makes a new vector with [n_layers, 2, B, H]
      hidden = hidden.view(self.n_layers, self.dir_mult, src.size(0), self.hid_dim)
      # then gets the final layer of forward and final layer of backward
      # both are [B, H], but the concat produces [B, 1024]
      hidden = torch.cat((hidden[-1,0,:,:], hidden[-1,1,:,:]), dim=-1)
    else:
      # else get [B, 512] (final layer)
      hidden = hidden[-1]
    return outputs, hidden

In [28]:
class Decoder(nn.Module):
  def __init__(self, vocab_size, emb_dim=256, enc_hid_dim=512, dec_hid_dim=512, n_layers=2, dropout=.2):
    super().__init__()
    # normal embedding, but now we take in the equation for s_t' = g(s_{t-1}', y_{t-1}', c)
    # where the input will be decoder output and the hidden layer
    self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=tok.pad_token_id)
    self.rnn = nn.GRU(emb_dim + enc_hid_dim,
                      dec_hid_dim,
                      num_layers=n_layers,
                      batch_first=True,
                      dropout=dropout)
    self.fc_out = nn.Linear(dec_hid_dim, vocab_size)
    self.dropout = nn.Dropout(dropout)
    self.enc_hid_dim = enc_hid_dim

  def forward(self, input_tok, hidden, context):
    # input tok: [B]
    # hidden: [n_layers, B, H]
    # context: [B, enc_hid_dim]

    # We want a single timestep/token, so we change via unsqueeze to [B, 1]
    input_tok = input_tok.unsqueeze(1)

    # Then it becomes [B, 1, emb_dim]
    embedded = self.dropout(self.embedding(input_tok))

    # Same for context, but it's [B, 1, enc_hid_dim]
    # This is the same as the hidden dim in the encoder
    context = context.unsqueeze(1)

    # We then append the embedding to the context along the last dim
    rnn_in = torch.cat((embedded, context), dim=-1)

    # Same as before
    output, hidden = self.rnn(rnn_in, hidden)

    # Predict the linear layers as [B, 1, dec_hid_dim]
    # becomes [B, dec_hid_dim]
    pred = self.fc_out(output.squeeze(1))
    return pred, hidden