# What is to seq-to-seq modelling?

Seq-to-seq (sequence-to-sequence) models are a type of neural network that can be used to map an input sequence to an output sequence. They are commonly used in natural language processing (NLP) tasks such as machine translation, language summarization, and question answering.

Seq-to-seq models consist of two main components: an encoder and a decoder. The encoder processes the input sequence and produces a fixed-length context vector that captures the relevant information from the input. The decoder then uses the context vector to generate the output sequence.

# TODO image.png

The encoder and decoder can be implemented using any type of neural network, such as a fully-connected network, a convolutional network, or a recurrent neural network (RNN). RNNs are particularly well-suited for seq-to-seq modelling because they can handle variable-length sequences and capture temporal dependencies.

Seq-to-seq models are trained using supervised learning, where the input and output sequences are paired. During training, the model is given an input sequence and the corresponding output sequence, and it learns to predict the output sequence given the input sequence.

In this lesson, we will learn how to implement a basic seq-to-seq model using PyTorch. We will start by preprocessing the data and creating a dataset object to iterate through the input and output sequences. Then, we will define the encoder and decoder models and train the seq-to-seq model using an optimizer and a loss function. Finally, we will use the trained model to generate new output sequences given input sequences.


## Getting the data

In [1]:
from datasets import load_dataset
import random

dataset = load_dataset("tatoeba", lang1="en", lang2="hr")
dataset = dataset["train"]
dataset[random.randint(0, len(dataset))]

  from .autonotebook import tqdm as notebook_tqdm
Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 34.33it/s]


{'id': '1298',
 'translation': {'en': 'Welcome to Tatoeba!', 'hr': 'Dobrodošla na Tatoeba.'}}

## Preprocessing the data

First, let's start by preprocessing the data. We will use a dataset of English-French translations for this example. We will need to convert the data to lowercase, tokenize it, and create a vocabulary of unique tokens.

In [2]:
import torch
from transformers import AutoTokenizer

# Load the BERT tokenizer
source_tokeniser = AutoTokenizer.from_pretrained("bert-base-cased")
target_tokeniser = AutoTokenizer.from_pretrained("bert-base-multilingual-cased", lang="hr")


Take a look at how they tokenise input text:

In [3]:
encoded_example = source_tokeniser.encode("Hello world")
print(encoded_example)

[101, 8667, 1362, 102]


Why are the tokenised sequences longer than the number of words in the sequence?

Decode the sequence back into text to take a look.

In [4]:
source_tokeniser.decode(encoded_example)

'[CLS] Hello world [SEP]'

You can see that some "special tokens" have been inserted around the sequence.

- `[CLS]` represents the start of the sequence
    - Its name comes from our tokeniser which is used to train BERT. During BERT's training process, the model is asked to perform some classification to better understand the text. The `[CLS]` token represents the start of a new sentence that BERT is asked to classify. It's not important to understand any more than that for now.
    - In other tokenisers, this would be the equivalient of `[SOS]`, that more intuitively indicates the Start Of Sequence
- `[SEP]` represents the end of the sequence
    - Its name comes from our tokeniser which is used to train BERT. During BERT's training process, the model is asked to perform some classification to better understand the text. The `[SEP]` token represents the separation between sentences that BERT is asked to classify. It's not important to understand any more than that for now.
    - In other tokenisers, this would be the equivalient of `[EOS]`, that more intuitively indicates the End Of Sequence

Later, we will get around to generating new sequences (translations). To do that, we'll need to give the model the ids of the start of sequence token and the end of sequence token. Let's store those as variables.

In [5]:
decoder_start_of_sequence_token_id = target_tokeniser.get_vocab()["[CLS]"]
decoder_end_of_sequence_token_id = target_tokeniser.get_vocab()["[SEP]"]

## Creating the dataset object

Now that we have preprocessed the data, let's create a dataset object to iterate through the input and output sequences. We will use a PyTorch Dataset object for this purpose.

In [6]:
class TranslationDataset(torch.utils.data.Dataset):
    def __init__(self, source_lang="en", target_lang="hr"):
        dataset = load_dataset("tatoeba", lang1=source_lang, lang2=target_lang)
        dataset = dataset["train"]
        self.examples = []
        for ex in dataset:
            ex = ex["translation"]
            
            source_seq = ex[source_lang]
            source_seq = source_tokeniser(source_seq)
            source_seq = source_seq["input_ids"]

            target_seq = ex[target_lang]
            target_seq = target_tokeniser(target_seq)
            target_seq = target_seq["input_ids"]
            self.examples.append((source_seq, target_seq))
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        example = self.examples[idx]
        source_seq, target_seq = example
        return source_seq, target_seq


def test_dataset():
    dataset = TranslationDataset()
    for example in dataset:
        print(example)
        source, target = example
        print(source_tokeniser.decode(source))
        print(target_tokeniser.decode(target))
        print()
        break


test_dataset()
    

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 170.53it/s]


([101, 146, 1138, 1106, 1301, 1106, 2946, 119, 102], [101, 46052, 10147, 177, 14477, 32650, 38573, 10116, 119, 102])
[CLS] I have to go to sleep. [SEP]
[CLS] Moram ići spavati. [SEP]



## Create the dataloader

In [8]:
from torch.utils.data import DataLoader, random_split

def get_dataloaders(batch_size=2):
    dataset = TranslationDataset()
    train_len = round(0.8*len(dataset))
    val_len = round(0.1*len(dataset))
    test_len = len(dataset) - val_len - train_len
    train_dataset, val_dataset, test_dataset = random_split(dataset, [train_len, val_len, test_len])

    batch_size = 2

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    return train_loader, val_loader, test_loader

def test_dataloaders():
    train_loader, val_loader, test_loader = get_dataloaders()
    for loader_name, loader in [("Train", train_loader), ("Validation", val_loader), ("Test", test_loader)]:
        for example in loader:
            print(f"{loader_name} loader example:")
            features, labels = example
            print("Features")
            print(features)
            print("Features shape:", features.shape)
            print("Labels")
            print(labels)
            print("Label shape:", labels.shape)
            print()
            break
    
test_dataloaders()

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 255.11it/s]


RuntimeError: each element in list of batch should be of equal size

The variable length of sequences causes an error because the (source or target) sequences in a batch need to be the same length. 
That's because mathematically, tensors can't have empty values. 
We need to pad each sequence full of `[PAD]` tokens so that they have the length of the longest one.

Dataloaders use their `collate_fn` to group together examples returned from your dataset. It can be set using a keyword argument upon initialisation.

In [9]:


def get_dataloaders(batch_size=2):
    dataset = TranslationDataset()
    train_len = round(0.8*len(dataset))
    val_len = round(0.1*len(dataset))
    test_len = len(dataset) - val_len - train_len
    train_dataset, val_dataset, test_dataset = random_split(dataset, [train_len, val_len, test_len])

    def collate_fn(batch):    
        source = torch.nn.utils.rnn.pad_sequence([torch.tensor(ex[0]) for ex in batch], batch_first=True)
        target = torch.nn.utils.rnn.pad_sequence([torch.tensor(ex[1]) for ex in batch], batch_first=True)
        return source, target

    batch_size = 2

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=collate_fn)
    return train_loader, val_loader, test_loader

test_dataloaders()


Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 374.22it/s]


Train loader example:
Features
tensor([[ 101, 2627, 1132, 1128,  136,  102,    0,    0],
        [ 101, 2268, 1110, 1226, 1104, 2731,  119,  102]])
Features shape: torch.Size([2, 8])
Labels
tensor([[  101, 30186, 10294, 14382,   136,   102,     0,     0,     0,     0],
        [  101,   294, 11024, 34300, 10144, 14283, 12377, 50173,   119,   102]])
Label shape: torch.Size([2, 10])

Validation loader example:
Features
tensor([[  101,  1135,  1110, 21321,  1106, 24530,  1283,  1103,  2377,  1297,
           119,   102],
        [  101,  1188,  1110,   170,  3415,   119,   102,     0,     0,     0,
             0,     0]])
Features shape: torch.Size([2, 12])
Labels
tensor([[  101, 14321, 62019, 42176, 10343, 10144, 17674, 15847, 76014, 10325,
         10339, 19434, 13501, 17165, 21826,   119,   102],
        [  101, 11469, 10144, 19157,   119,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])
Label shape: torch.Size([2, 17])

Test loader exampl

## Defining the Encoder

In [222]:
# Define the encoder model
class Encoder(torch.nn.Module):
    def __init__(self, source_vocab_size, hidden_size=128, num_layers=3):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = torch.nn.Embedding(source_vocab_size, hidden_size)
        self.gru = torch.nn.GRU(hidden_size, hidden_size, num_layers, batch_first=True)
    
    def forward(self, input):
        self.init_hidden(len(input))
        embedded = self.embedding(input)
        output, self.hidden = self.gru(embedded, self.hidden)
        return self.hidden # return the hidden state
      
    def init_hidden(self, batch_size):
        self.hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size)


def test_encoder():
    source_vocab_size = len(source_tokeniser.get_vocab())
    target_vocab_size = len(target_tokeniser.get_vocab())
    encoder = Encoder(source_vocab_size)
    train_loader, _, _ = get_dataloaders()
    for batch in train_loader:
        inputs, _ = batch
        encoder.init_hidden(len(inputs))
        hidden = encoder(inputs)
        break
    print("Hidden shape:", hidden.shape)

test_encoder()

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 219.74it/s]


Hidden shape: torch.Size([3, 2, 128])


## Defining the Decoder

### Batch Decoding

If target sequences have different lengths, then decoding them in a batch can be tricky or inefficient. It can be tricky because 

For simplicity, in this example, we will implement a decoder that decodes each sequence independently, instead of as part of a batch. That's going to require quite a bit more code compared to the encoder.

### Decoder training vs inference

When the model is being used to make predictions in the real world, the next prediction will have to continue from the previously predicted token - we have no labels in the wild!

However, this can make training very difficult. That's because if a model predicts the incorrect token during decoding, and then bases the next token prediction upon that incorrect token, it's going to make it very hard to predict the correct token. As the sequence length increases, this problem gets worse. The previously incorrect tokens make it highly unlikely for the model to get anywhere close.

To combat this, we can use _teacher forcing_ during training, which is where we disregard the previous predicted token, and instead pass the correct token from the labels to the model at the next timestamp.

### `model.eval()` and `model.train()`

These methods toggle the behaviour of child modules of a model that differ between training and evaluation.

The do this by switching the `training` attribute of any `torch.nn.Module` subclass between `True` and `False`.

### Where is the batch dimension needed?

Recurrent PyTorch layers (`RNN`, `LSTM`, `GRU`) can process batched or unbatched examples. 

If the hidden state is found to be 2D ($N$ x $D$), then the model assumes that inputs are unbatched, and that they should have size $T$ x $D$.

### Where is the time dimension needed?

The time dimension is always expected by recurrent layers. This is true even if you're passing in individual timesteps (like we will implement below), in which case the size of that dimension should just be 1.

In [256]:
import torch.nn.functional as F

class Decoder(torch.nn.Module):
    def __init__(self, target_vocab_size, hidden_size, num_layers, start_of_sequence_token_id, end_of_sequence_token_id):
        super().__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.start_of_sequence_token_id = start_of_sequence_token_id
        self.end_of_sequence_token_id = end_of_sequence_token_id
        
        self.embedding = torch.nn.Embedding(target_vocab_size, hidden_size)
        self.gru = torch.nn.GRU(hidden_size, hidden_size, num_layers, batch_first=False, bidirectional=False)
        # TODO should this be batch first or not?
        self.out = torch.nn.Linear(hidden_size, target_vocab_size)
    
    def forward(self, encodings, target_seqs=None):

        if self.training:
            assert target_seqs != None, "The decoder requires targets in training mode to implement teacher forcing."
            total_loss = 0
        else:
            assert target_seqs == None, "The decoder should not receive targets in evaluation mode."

        batch_size = encodings.shape[1]
        decodings = []
        for example_idx in range(batch_size):

            encoding = encodings[:, example_idx, :]
            
            if self.training:
                target_seq = target_seqs[example_idx]
                decoding, loss = self.forward_single_example(encoding, target_seq)
                total_loss += loss
            else:
                decoding = self.forward_single_example(encoding)
            decodings.append(decoding)

        if self.training:
            return decodings, loss
        else:
            return decodings

    # def eval(self):
    #     super().eval() # do everything that the parent model would do
    #     self.training = False
    
    # def train(self):
    #     super().train()
    #     self.training = True

    def forward_single_example(self, encoding, target_seq=None):

        if self.training:
            assert target_seq != None, "The decoder requires targets in training mode to implement teacher forcing."
        else:
            assert target_seq == None, "The decoder should not receive targets in evaluation mode."

        current_token_id = self.start_of_sequence_token_id
        current_token_id = torch.tensor(current_token_id)
        current_token_id = current_token_id.unsqueeze(0)
        # embedding = embedding.unsqueeze(0) # add expected time dimension

        # encoding = encoding.unsqueeze(1)
        self.hidden = encoding

        if self.training:
            loss = 0

        predicted_sequence = []
        while True:

            # Predict next token
            embedding = self.embedding(current_token_id)
            output, self.hidden = self.gru(embedding, self.hidden)
            output = self.out(output)

            # Calculate loss if in training mode
            if self.training:
                target_token_id = target_seq[0]
                target_seq = target_seq[1:]
                # print(output.shape)
                # print(target_token_id.shape)
                target_token_id = target_token_id.unsqueeze(0)
                loss += F.cross_entropy(output, target_token_id)

            current_token_id = torch.argmax(output, dim=1)
            predicted_sequence.append(current_token_id.item())

            # Implement teacher forcing if in training mode
            if self.training:
                current_token_id = target_token_id

            # Stopping conditions
            if len(predicted_sequence) > 10:
                break
            elif current_token_id == self.end_of_sequence_token_id:
                break
        if self.training:
            # loss /= len(target_seq) # Normalise by sequence length
            return predicted_sequence, loss
        else:
            return predicted_sequence
    
    # def init_hidden(self):
    #     self.hidden = torch.zeros(self.num_layers, self.hidden_size) # Does not need batch dimension as it will process unbatched examples

def test_decoder():
    source_vocab_size = len(source_tokeniser.get_vocab())
    target_vocab_size = len(target_tokeniser.get_vocab())
    hidden_size = 128
    num_layers = 3
    encoder = Encoder(source_vocab_size, hidden_size, num_layers)
    decoder = Decoder(target_vocab_size, hidden_size, num_layers, decoder_start_of_sequence_token_id, decoder_end_of_sequence_token_id)

    max_target_seq_len = 20

    train_loader, _, _ = get_dataloaders()
    for batch in train_loader:
        inputs, targets = batch
        encoder.init_hidden(len(inputs))
        hidden = encoder(inputs)

        # Test training mode
        decoder.train()
        predicted_seq, loss = decoder(hidden, targets)
        print("Training mode tests passed")

        # Test evaluation mode
        print(decoder.training)
        decoder.eval()
        print(decoder.training)
        predicted_seq = decoder(hidden)
        print("Testing mode tests passed")

        # for example_idx in range(len(inputs)):
        #     decoder.hidden = hidden[:, example_idx, :].unsqueeze(1)

        #     current_token_id = target_tokeniser.get_vocab()["[CLS]"]

        #     predicted_sequence = [current_token_id]
        #     for idx in range(max_target_seq_len):
        #         current_token_id = torch.tensor(current_token_id).view(1, 1, 1)
        #         prediction = decoder(current_token_id)
        #         current_token_id = torch.argmax(prediction, dim=2)
        #         predicted_sequence.append(current_token_id.item())
        #         if current_token_id == target_tokeniser.get_vocab()["[SEP]"]:
        #             break
        #     print(predicted_sequence)
        break
    print("Hidden shape:", hidden.shape)

test_decoder()

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 219.88it/s]


Training mode tests passed
True
False
Testing mode tests passed
Hidden shape: torch.Size([3, 2, 128])


## Seq2Seq = Encoder + Decoder

Now we need to combine the encoder and decoder into a single model that encodes the source sequence, then decodes it into a target sequence.

This should be pretty simple. The Seq2Seq model simply needs to:
- Implement an initialiser
    - Initialise an encoder
    - Initialise a decoder
- Implement the forward pass where:
    - The source sequence is passed through the encoder to produce an encoding
    - The encoding is passed through the decoder to produce a prediction of the target sequence

In [262]:
class Seq2Seq(torch.nn.Module):
    def __init__(self, source_vocab_size, target_vocab_size, num_layers, hidden_size, decoder_start_of_sequence_token_id, decoder_end_of_sequence_token_id):
        super().__init__()
        self.encoder = Encoder(source_vocab_size, hidden_size, num_layers)
        self.decoder = Decoder(target_vocab_size, hidden_size, num_layers, decoder_start_of_sequence_token_id, decoder_end_of_sequence_token_id)
        self.eval() # Default to evaluation mode

    def forward(self, source_seqs, target_seqs=None):
        batch_size = len(source_seqs) # TODO you could eliminate this by randomly sampling from the dataset instead of indexing
        encoding = self.encoder(source_seqs)
        if self.training:
            assert target_seqs != None, "The seq2seq model requires targets in training mode to implement teacher forcing."
            target_seq, loss = self.decoder(encoding, target_seqs)
            return target_seq, loss
        else:
            target_seq = self.decoder(encoding)
            return target_seq

def test_seq2seq_model():
    source_vocab_size = len(source_tokeniser.get_vocab())
    target_vocab_size = len(target_tokeniser.get_vocab())
    hidden_size = 128
    num_layers = 3
    seq2seq = Seq2Seq(source_vocab_size, target_vocab_size, num_layers, hidden_size, decoder_start_of_sequence_token_id, decoder_end_of_sequence_token_id)
    train_loader, _, _ = get_dataloaders()
    for batch in train_loader:
        source_seqs, target_seqs = batch
        predicted_target_seqs = seq2seq(source_seqs)
        print("Initially random sequences generated:")
        for seq in predicted_target_seqs:
            print(target_tokeniser.decode(seq))
        break
    print("Seq2Seq model tests passed")

test_seq2seq_model()

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 126.11it/s]


Initially random sequences generated:
jugu Reese Reese Reeseтовой 996 996 996 יחד יחד יחד
impressivenkinnkinlığınlığın ऑ ऑ踊 Reese Reese ordo
Seq2Seq model tests passed


## The Training Loop

In [267]:
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

source_vocab_size = len(source_tokeniser.get_vocab())
target_vocab_size = len(target_tokeniser.get_vocab())

def train(model, dataloader, hparam_dict, lr=0.01, epochs=1):
    model.train()    
    batch_idx = 1
    writer = SummaryWriter()

    running_avg = None

    optimiser = torch.optim.SGD(model.parameters(), lr=lr)
    
    for epoch in range(epochs):
        for batch in dataloader:
            source_seqs, target_seqs = batch
            
            # Forward pass
            prediction, loss = model(source_seqs, target_seqs)

            # Log loss

            writer.add_scalar("Loss/Train", loss.item(), batch_idx)
            running_avg = running_avg + ((loss.item() - running_avg) / batch_idx) if running_avg != None else loss.item()
            if batch_idx % 100 == 0:
                writer.add_hparams(hparam_dict, {f"hparam/{batch_idx}-step loss": running_avg})

            for source_seq, prediction_seq, target_seq in zip(source_seqs, prediction, target_seqs): 
                source_seq = source_tokeniser.decode(source_seq)
                prediction = target_tokeniser.decode(prediction_seq)
                target_seq = target_tokeniser.decode(target_seq)
                writer.add_text(
                    "Text",
                    f"""
                    Source:    {source_seq}
                    Predicted: {prediction}
                    Label:     {target_seq}
                    """, 
                    batch_idx
                )
            batch_idx += 1
            # print("Loss:", loss.item())

            # Do optimisation
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()

def test_train():
    source_vocab_size = len(source_tokeniser.get_vocab())
    target_vocab_size = len(target_tokeniser.get_vocab())
    hidden_size = 256
    num_layers = 3
    batch_size = 8
    lr = 0.01

    hparam_dict = {
        "hidden_size": hidden_size,
        "num_layers": num_layers,
        "batch_size": batch_size,
        "lr": lr
    }

    train_loader, val_loader, test_loader = get_dataloaders(batch_size=batch_size)
    model = Seq2Seq(source_vocab_size, target_vocab_size, num_layers, hidden_size, decoder_start_of_sequence_token_id, decoder_end_of_sequence_token_id)
    
    train(model, train_loader, hparam_dict)

test_train()

Using custom data configuration en-hr-lang1=en,lang2=hr
Found cached dataset tatoeba (/Users/ice/.cache/huggingface/datasets/tatoeba/en-hr-lang1=en,lang2=hr/0.0.0/b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 217.24it/s]


KeyboardInterrupt: 