## TC 5033
## Deep Learning
## Transformers

#### Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.

In [1]:
# # Install packages
# %pip install -q torch --index-url https://download.pytorch.org/whl/cu121
# %pip install -q torcheval
# %pip install -q tatoebatools numpy pandas dask[dataframe]
# %pip install -q torchinfo

In [2]:
%load_ext autoreload
%autoreload 2

Import the required libraries

In [3]:
import re
import os
import math
import time
import torch
import torch.hub
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as utils
import torcheval.metrics.functional as metrics

import multiprocessing

import numpy as np
import pandas as pd
import dask.dataframe as dd

from collections import Counter
from tatoebatools import tatoeba
from torchinfo import summary
from tqdm import tqdm
from utils import get_device, DummyTokenizer, Truncate, AddToken, VocabTransform, ToTensor, PadTransform

In [4]:
DEVICE = get_device()

TXT_DIR = './translations'
TXT_SRC = 'eng'
TXT_TGT = 'spa'

TOK_SRC = 'en'
TOK_TGT = 'es'

MAX_LEN = 30
BATCH_SIZE = 96
MODEL_DIM = 512
NUM_HEADS = 8
HIDDEN_DIM = 2024
ENC_NUM_LAYERS = 6
DEC_NUM_LAYERS = 6
DROPOUT = 0.1
EPOCHS = 20

UNK_TOK = '[unk]'
PAD_TOK = '[pad]'
BOS_TOK = '[bos]'
EOS_TOK = '[eos]'

In [5]:
print(f"Available device: {DEVICE.type}")

Available device: cuda


Get the dataset using tatoebatools

In [6]:
tatoeba.dir = TXT_DIR

src_sentences_df = tatoeba.get('sentences_detailed', [TXT_SRC])
tgt_sentences_df = tatoeba.get('sentences_detailed', [TXT_TGT])
src_tgt_links_df = tatoeba.get('links', language_codes=[TXT_SRC, TXT_TGT])

user_df = tatoeba.get('user_languages', language_codes='*')

Pair spanish and english sentences correctly into a new dataframe

In [7]:
def get_translations(src_df, tgt_df, links_df, user_df, level=0.0):
    df1 = pd.merge(src_df, links_df, how='left', left_on='sentence_id', right_on='sentence_id')
    df2 = pd.merge(df1, tgt_df, how='left', left_on='translation_id', right_on='sentence_id')
    df3 = pd.merge(df2, user_df, how='left', left_on='username_y', right_on='username')

    filter = (df3.skill_level >= level) & (df3.username != np.nan)
    
    return df3.where(filter).dropna()[['sentence_id_x', 'lang_x', 'text_x', 'sentence_id_y', 'lang_y', 'text_y']].reset_index(drop=True)

In [8]:
if not os.path.exists(os.path.join(TXT_DIR, "translations.csv")):
    translations_df = get_translations(src_sentences_df, tgt_sentences_df, src_tgt_links_df, user_df)
    translations_df = translations_df.drop_duplicates(subset=['text_x', 'text_y']).reset_index(drop=True)
    translations_df.to_csv(os.path.join(TXT_DIR, "translations.csv"))
else:
    translations_df = pd.read_csv(os.path.join(TXT_DIR, "translations.csv"))
    
len(translations_df)

96139

In [9]:
translations_df.sample(20)

Unnamed: 0.1,Unnamed: 0,sentence_id_x,lang_x,text_x,sentence_id_y,lang_y,text_y
27269,27269,2236266.0,eng,Tom felt safe.,9432153.0,spa,Tom se sintió seguro.
9474,9474,978516.0,eng,Tom doesn't want to live in Boston for more th...,978520.0,spa,Tom no quiere vivir en Boston por más de un año.
65944,65944,7782709.0,eng,The text is illegible.,7551085.0,spa,El texto es ilegible.
92994,92994,11537073.0,eng,We lost our passports on the trip.,11537116.0,spa,Perdimos nuestros pasaportes en el viaje.
33999,33999,2431559.0,eng,"I'm not very good at making pizza, but Tom is.",2486055.0,spa,"No se me da muy bien hacer pizzas, pero a Tom sí."
29096,29096,2253183.0,eng,I haven't formed an opinion on that subject yet.,2253184.0,spa,Todavía no me he formado una opinión sobre ese...
52343,52343,4771575.0,eng,Tom doesn't care what I do.,4725490.0,spa,A Tom le da igual lo que yo haga.
93972,93972,11830071.0,eng,Ivan’s dog was terrified of something.,11830090.0,spa,Al perro de Iván le aterrorizaba algo.
76180,76180,9432352.0,eng,It's on the other side of the river.,11836154.0,spa,Está al otro lado del río.
16276,16276,1553395.0,eng,Something's going to happen. I can feel it.,1613658.0,spa,Va a pasar algo. Puedo sentirlo.


Create vocabulary for both source and target languages

In [10]:
rownames = {
    TXT_SRC: 'text_x',
    TXT_TGT: 'text_y',
}

In [11]:
def preprocess_sentece(sentence: str):
    sentence = sentence.lower().strip()
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    return sentence.strip().split()

def idx2word(word_count: Counter, specials = [PAD_TOK, UNK_TOK, BOS_TOK, EOS_TOK]):
    sorted_word_count = sorted(word_count.items())
    specials = { token: idx for idx, token in enumerate(specials) }
    word2idx = { word: idx for idx, (word, _) in enumerate(sorted_word_count, start=len(specials)) }
    
    word2idx = { **specials, **word2idx }
    idx2word = { idx: word for word, idx in word2idx.items() }
    
    return word2idx, idx2word    


def build_vocab(dataframe, rownames):
    c1, c2 = Counter(), Counter()
    ddf = dd.from_pandas(dataframe, npartitions=multiprocessing.cpu_count())
    
    def process(row):
        tokens = preprocess_sentece(row[rownames[TXT_SRC]])
        c1.update(tokens)
        
        tokens = preprocess_sentece(row[rownames[TXT_TGT]])
        c2.update(tokens)
        
    ddf.apply(process, axis=1, meta=dataframe).compute()
    
    return c1, c2


src, tgt = build_vocab(translations_df, rownames)

src_word2idx, src_idx2word = idx2word(src)
tgt_word2idx, tgt_idx2word = idx2word(tgt)

src_vocab_len = len(src_word2idx)
tgt_vocab_len = len(tgt_word2idx)

In [12]:
class SentencesDataset(utils.Dataset):
    
    def __init__(self, translations_df: pd.DataFrame, src_word2idx: dict, tgt_word2idx: dict):
        self.translations_df = translations_df
        self.src_word2idx = src_word2idx
        self.tgt_word2idx = tgt_word2idx
        
    def __len__(self):
        return len(self.translations_df)

    def __getitem__(self, index):
        row = self.translations_df.iloc[index]
        return row.text_x, row.text_y

In [13]:
translations_dataset = SentencesDataset(translations_df, src_word2idx, tgt_word2idx)

In [14]:
train_dataset, valid_dataset, test_dataset = utils.random_split(translations_dataset, lengths=[0.8, 0.15, 0.05])

len(train_dataset), len(valid_dataset), len(test_dataset)

(76912, 14421, 4806)

In [15]:
src_transforms = nn.Sequential(
    DummyTokenizer(tokenizer_fn=preprocess_sentece),
    Truncate(max_seq_len=MAX_LEN-2),
    AddToken(token_id=BOS_TOK, begin=True),
    AddToken(token_id=EOS_TOK, begin=False),
    VocabTransform(word2idx=src_word2idx, unk_tok=src_word2idx.get(UNK_TOK)),
    ToTensor(padding_value=src_word2idx.get(PAD_TOK), dtype=torch.long),
    PadTransform(max_length=MAX_LEN, pad_value=src_word2idx.get(PAD_TOK))
)

tgt_transforms = nn.Sequential(
    DummyTokenizer(tokenizer_fn=preprocess_sentece),
    Truncate(max_seq_len=MAX_LEN-2),
    AddToken(token_id=BOS_TOK, begin=True),
    VocabTransform(word2idx=tgt_word2idx, unk_tok=tgt_word2idx.get(UNK_TOK)),
    ToTensor(padding_value=tgt_word2idx.get(PAD_TOK), dtype=torch.long),
    PadTransform(max_length=MAX_LEN, pad_value=tgt_word2idx.get(PAD_TOK))
)

label_transforms = nn.Sequential(
    DummyTokenizer(tokenizer_fn=preprocess_sentece),
    Truncate(max_seq_len=MAX_LEN-2),
    AddToken(token_id=EOS_TOK, begin=False),
    VocabTransform(word2idx=tgt_word2idx, unk_tok=tgt_word2idx.get(UNK_TOK)),
    ToTensor(padding_value=tgt_word2idx.get(PAD_TOK), dtype=torch.long),
    PadTransform(max_length=MAX_LEN, pad_value=tgt_word2idx.get(PAD_TOK))
)

In [16]:
def collate_batch_fn(batch):
    x, y = list(zip(*batch))
    return src_transforms(x), tgt_transforms(y), label_transforms(y)

In [17]:
train_dataloader = utils.DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=True)
valid_dataloader = utils.DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=True)
test_dataloader = utils.DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=False)

In [18]:
for x, y, label in train_dataloader:
    print(x.shape, x[0])
    print(y.shape, y[0])
    print(label.shape, label[0])
    break

torch.Size([96, 30]) tensor([    2,  8093, 12665, 16552, 10721, 16767, 16552, 11320,     3,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
torch.Size([96, 30]) tensor([    2, 30198, 23049, 17452, 19672,  2768,  1250, 18676,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
torch.Size([96, 30]) tensor([30198, 23049, 17452, 19672,  2768,  1250, 18676,     3,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])


### Model Implementation

In [19]:
from transformer import Seq2SeqTranslatorTransformer

In [20]:
translator = Seq2SeqTranslatorTransformer(
    dim_model=MODEL_DIM,
    num_heads=NUM_HEADS,
    hidden_dim=HIDDEN_DIM,
    max_len=MAX_LEN,
    enc_num_layers=ENC_NUM_LAYERS,
    dec_num_layers=DEC_NUM_LAYERS,
    src_vocab_size=src_vocab_len,
    tgt_vocab_size=tgt_vocab_len,
    dropout=DROPOUT
)

x, y, label = next(iter(train_dataloader))

summary(translator, input_data=[x, y])

Layer (type:depth-idx)                        Output Shape              Param #
Seq2SeqTranslatorTransformer                  [96, 30, 30311]           --
├─Embedding: 1-1                              [96, 30, 512]             9,500,160
├─PositionalEncoding: 1-2                     [96, 30, 512]             --
│    └─Dropout: 2-1                           [96, 30, 512]             --
├─Encoder: 1-3                                [96, 30, 512]             --
│    └─ModuleList: 2-2                        --                        --
│    │    └─EncoderBlock: 3-1                 [96, 30, 512]             3,125,736
│    │    └─EncoderBlock: 3-2                 [96, 30, 512]             3,125,736
│    │    └─EncoderBlock: 3-3                 [96, 30, 512]             3,125,736
│    │    └─EncoderBlock: 3-4                 [96, 30, 512]             3,125,736
│    │    └─EncoderBlock: 3-5                 [96, 30, 512]             3,125,736
│    │    └─EncoderBlock: 3-6                 [96, 30

### Definition of train and test step functions

In [21]:
def train_epoch(model, optimizer, dataloader):
    total_loss = torch.zeros(len(dataloader))
    
    model.train()
    for i, (enc_input, dec_input, label) in enumerate(tqdm(dataloader)):
        enc_input, dec_input, label = enc_input.to(DEVICE), dec_input.to(DEVICE), label.to(DEVICE)
        
        # compute forward pass
        logits = model(enc_input, dec_input)
        
        logits = logits.view(-1, logits.size(-1))
        label = label.contiguous().view(-1)

        # compute loss, gradients, and update params
        loss = F.cross_entropy(
            logits, label, ignore_index=tgt_word2idx.get(PAD_TOK))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        
        # update metrics
        total_loss[i] = loss.item()
        
          
    # Compute avg
    avg_loss = total_loss.mean()
    
    return avg_loss

In [22]:
def validate_epoch(model, dataloader):
    total_loss = torch.zeros(len(dataloader))
    
    model.eval()
    for i, (enc_input, dec_input, label) in enumerate(dataloader):
        enc_input, dec_input, label = enc_input.to(DEVICE), dec_input.to(DEVICE), label.to(DEVICE)
                
        # compute forward pass
        logits = model(enc_input, dec_input)

        logits = logits.view(-1, logits.size(-1))
        label = label.contiguous().view(-1)
       
        # compute loss
        loss = F.cross_entropy(logits, label, ignore_index=tgt_word2idx.get(PAD_TOK))
        
        # update metrics
        total_loss[i] = loss.item()

    # Compute avg
    avg_loss = total_loss.mean()

    return avg_loss

In [23]:
def train(model: nn.Module, train_dataloader, test_dataloader, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    
    for i in range(epochs):
        print(f"Epoch={i+1}")
        train_avg_loss = train_epoch(model, optimizer, train_dataloader)
        valid_avg_loss = validate_epoch(model, test_dataloader)
        print(f"Train Loss={train_avg_loss:>7f} \t Valid Loss={valid_avg_loss:>7f}")

In [24]:
translator = Seq2SeqTranslatorTransformer(
    dim_model=MODEL_DIM,
    num_heads=NUM_HEADS,
    hidden_dim=HIDDEN_DIM,
    max_len=MAX_LEN,
    enc_num_layers=ENC_NUM_LAYERS,
    dec_num_layers=DEC_NUM_LAYERS,
    src_vocab_size=src_vocab_len,
    tgt_vocab_size=tgt_vocab_len,
    dropout=DROPOUT
).to(DEVICE)

In [25]:
train(translator, train_dataloader, valid_dataloader, epochs=EPOCHS)

timestamp = int(time.time())

path = f"./transformer_{timestamp}.pkl"

torch.save(translator.state_dict(), path)

Epoch=1


100%|██████████| 802/802 [00:39<00:00, 20.16it/s]


Train Loss=6.354481 	 Valid Loss=5.925820
Epoch=2


100%|██████████| 802/802 [00:39<00:00, 20.31it/s]


Train Loss=5.534527 	 Valid Loss=5.197404
Epoch=3


100%|██████████| 802/802 [00:39<00:00, 20.37it/s]


Train Loss=4.942280 	 Valid Loss=4.773385
Epoch=4


100%|██████████| 802/802 [00:39<00:00, 20.47it/s]


Train Loss=4.499310 	 Valid Loss=4.363037
Epoch=5


100%|██████████| 802/802 [00:39<00:00, 20.30it/s]


Train Loss=4.118775 	 Valid Loss=4.063642
Epoch=6


100%|██████████| 802/802 [00:39<00:00, 20.30it/s]


Train Loss=3.789316 	 Valid Loss=3.805299
Epoch=7


100%|██████████| 802/802 [00:39<00:00, 20.29it/s]


Train Loss=3.498801 	 Valid Loss=3.567425
Epoch=8


100%|██████████| 802/802 [00:39<00:00, 20.34it/s]


Train Loss=3.248730 	 Valid Loss=3.386536
Epoch=9


100%|██████████| 802/802 [00:39<00:00, 20.34it/s]


Train Loss=3.018673 	 Valid Loss=3.231917
Epoch=10


100%|██████████| 802/802 [00:39<00:00, 20.30it/s]


Train Loss=2.809431 	 Valid Loss=3.109604
Epoch=11


100%|██████████| 802/802 [00:39<00:00, 20.37it/s]


Train Loss=2.618526 	 Valid Loss=2.985568
Epoch=12


100%|██████████| 802/802 [00:39<00:00, 20.24it/s]


Train Loss=2.439417 	 Valid Loss=2.891711
Epoch=13


100%|██████████| 802/802 [00:39<00:00, 20.27it/s]


Train Loss=2.276958 	 Valid Loss=2.827249
Epoch=14


100%|██████████| 802/802 [00:39<00:00, 20.27it/s]


Train Loss=2.124901 	 Valid Loss=2.737869
Epoch=15


100%|██████████| 802/802 [00:39<00:00, 20.30it/s]


Train Loss=1.984834 	 Valid Loss=2.679720
Epoch=16


100%|██████████| 802/802 [00:39<00:00, 20.37it/s]


Train Loss=1.852399 	 Valid Loss=2.627661
Epoch=17


100%|██████████| 802/802 [00:39<00:00, 20.35it/s]


Train Loss=1.726209 	 Valid Loss=2.591347
Epoch=18


100%|██████████| 802/802 [00:39<00:00, 20.20it/s]


Train Loss=1.607944 	 Valid Loss=2.561717
Epoch=19


100%|██████████| 802/802 [00:39<00:00, 20.36it/s]


Train Loss=1.497664 	 Valid Loss=2.537451
Epoch=20


100%|██████████| 802/802 [00:39<00:00, 20.39it/s]


Train Loss=1.394727 	 Valid Loss=2.536337


### Evaluate sentences using the trained model

In [26]:
def indices_to_sentence(indices, idx2word: dict):
    return " ".join([idx2word.get(idx) for idx in indices if idx2word.get(idx) != PAD_TOK])

In [27]:
def translate_sentence(model, src, max_seq_len=MAX_LEN):
    src_idx = src_transforms([src])
    
    tgt_idx = torch.zeros(1, 1, dtype=torch.long)
    tgt_idx[0, 0] = tgt_word2idx.get(BOS_TOK)
    
    with torch.no_grad():
        for _ in range(max_seq_len):
            # crop the last max seq len indices
            tgt_idx = tgt_idx[:, -max_seq_len:]

            # evaluate the model
            logits = model(src_idx.to(DEVICE), tgt_idx.to(DEVICE))            

            # focus only on the last time step
            logits = logits[:, -1, :]
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            
            # get the next token                        
            _, next_token = torch.max(probs, dim=-1, keepdim=True)

            if next_token.cpu().item() == tgt_word2idx.get(EOS_TOK):
                break

            tgt_idx = torch.cat([tgt_idx, next_token.cpu()], dim=1)

    return tgt_idx

In [28]:
def evaluate_sentences(model, dataset, max_seq_len=MAX_LEN):
    model.eval()

    for i, (src, tgt) in enumerate(dataset):
        sequence = translate_sentence(model, src, max_seq_len)
        translation = indices_to_sentence(sequence[0].numpy(), tgt_idx2word)
        
        if i >= 10: break
        yield src, tgt, translation

In [29]:
for sentence, correct, translation in evaluate_sentences(translator, test_dataset):
    print(f"sentence: {sentence}\ncorrect: {correct}\ntranslation: {translation}", end="\n\n\n")

sentence: Ask for it.
correct: Pedidlo.
translation: [bos] preguntale por ello


sentence: I knew there was nothing you could do about it.
correct: Yo sabía que no había nada que pudieras hacer al respecto.
translation: [bos] supe que no habia nada que pudieras hacer por ello


sentence: I wish I could be with you.
correct: Ojalá pudiera estar con vosotras.
translation: [bos] me gustaria poder estar contigo


sentence: Who's the girl with you?
correct: ¿Quién es la chica que está contigo?
translation: [bos] quien es la chica con usted


sentence: You can count on me to be there by 10:00.
correct: Estaré allí sobre las 10:00.
translation: [bos] puede contar conmigo


sentence: She'll know.
correct: Se va a saber.
translation: [bos] ella lo sabra


sentence: I'll tell everything to my father and mother.
correct: Se lo diré todo a mi padre y a mi madre.
translation: [bos] se lo contare todo a mi padre y a mi madre


sentence: I have a wonderful brother-in-law and sister-in-law. But why do