## TC 5033
## Deep Learning
## Transformers

#### Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.

In [1]:
# %pip install torchtext==0.17.1
# %pip install numpy==1.26.4
# %pip install pandas==2.2.3
# %pip install -q tatoebatools
# %pip install -U -q spacy
# %pip install mlflow
# !python -m spacy download en_core_web_sm
# !python -m spacy download es_core_news_sm

Import required libraries

In [2]:
import time
import torch
import torch.hub
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tdata
import torch.utils.data.dataset as tdataset

import mlflow
import numpy as np
import pandas as pd

import torchtext
# torchtext.disable_torchtext_deprecation_warning()

import torchtext.transforms as T

from torchtext.vocab import vocab
from torchtext.data.utils import get_tokenizer

from collections import Counter
from tatoebatools import tatoeba
from torchinfo import summary
from tqdm import tqdm

In [3]:
if torch.backends.mps.is_available():
    device = torch.device('mps')
else:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    if device.type == 'cuda':
        # Allow TensorFloat32 on matmul and convolutions
        torch.backends.cuda.matmul.allow_tf32 = True
        torch.backends.cudnn.allow_tf32 = False
        torch.set_float32_matmul_precision("medium")

print(f"Available device: {device.type}")

Available device: cuda


Get the dataset using tatoebatools

In [4]:
tatoeba.dir = './translations'

spa_sentences_df = tatoeba.get('sentences_detailed', ['spa'])
eng_sentences_df = tatoeba.get('sentences_detailed', ['eng'])
spa_eng_links_df = tatoeba.get('links', language_codes=['spa', 'eng'])

user_languages_df = tatoeba.get('user_languages', language_codes='*')

Pair spanish and english sentences correctly into a new dataframe

In [5]:
spa_links_df = pd.merge(spa_sentences_df, spa_eng_links_df, how='left', left_on='sentence_id', right_on='sentence_id')
spa_eng_translations_df = pd.merge(spa_links_df, eng_sentences_df, how='left', left_on='translation_id', right_on='sentence_id')
translations_with_skill_df = pd.merge(spa_eng_translations_df, user_languages_df, how='left', left_on='username_y', right_on='username')

filter = (translations_with_skill_df.skill_level >= 2.0) & (translations_with_skill_df.username != np.nan)


translations_df = translations_with_skill_df.where(filter).dropna()[['sentence_id_x', 'lang_x', 'text_x', 'sentence_id_y', 'lang_y', 'text_y']].reset_index(drop=True)

In [6]:
translations_df.head(10)

Unnamed: 0,sentence_id_x,lang_x,text_x,sentence_id_y,lang_y,text_y
0,2499.0,spa,Todo el mundo debe aprender por sí mismo al fi...,328580.0,eng,Everyone must learn on their own in the end.
1,2503.0,spa,Eso no va a cambiar nada.,1872366.0,eng,That doesn't change anything.
2,2503.0,spa,Eso no va a cambiar nada.,6488765.0,eng,That'll change nothing.
3,119877.0,spa,Han utilizado las matemáticas para calcular có...,398983.0,eng,Math has been used to calculate how the Univer...
4,119881.0,spa,Los estudiantes reciben una beca de 15.000 eur...,398984.0,eng,"The students receive a 15,000 euro scholarship..."
5,126566.0,spa,No sé dónde vive.,473587.0,eng,I don't know where he lives.
6,330031.0,spa,El médico me dijo que dejara el tabaco.,1499361.0,eng,My doctor told me to quit smoking.
7,330031.0,spa,El médico me dijo que dejara el tabaco.,1499363.0,eng,My doctor told me to give up smoking.
8,330082.0,spa,Él me mintió.,297564.0,eng,He lied to me.
9,330691.0,spa,Hay algo de verdad en lo que dice.,8864106.0,eng,There's a bit of truth in what he's saying.


Creation of Dataset

In [7]:
MAX_SEQ_LEN = 20
BATCH_SIZE = 96
EMBEDDING_DIM = 256
NUM_HEADS = 4
D_FF = 2048
NUM_LAYERS = 6
DROPOUT=0.2

In [8]:
class SpaEngTranslationDataset(tdata.Dataset):
    
    def __init__(self, translations_df):
        self.translations_df = translations_df
        
    def __len__(self):
        return len(self.translations_df)

    def __getitem__(self, index):
        row = self.translations_df.iloc[index]
        return row.text_x, row.text_y

In [9]:
translations_dataset = SpaEngTranslationDataset(translations_df)

Create vocabulary for both languages english and spanish

In [10]:
tokenizer_en_fn = get_tokenizer('spacy', language='en_core_web_sm')
tokenizer_es_fn = get_tokenizer('spacy', language='es_core_news_sm')

specials = ['<unk>', '<pad>', '<sos>', '<eos>']

def build_vocab(dataset):
    es, en = Counter(), Counter()
    for spa, eng in dataset:
        es_words = tokenizer_es_fn(spa)
        en_words = tokenizer_en_fn(eng)
        es.update(es_words)
        en.update(en_words)
    
    return vocab(es, specials=specials), vocab(en, specials=specials)


es_vocab, en_vocab = build_vocab(translations_dataset)

es_vocab.set_default_index(es_vocab['<unk>'])
en_vocab.set_default_index(en_vocab['<unk>'])

In [11]:
SOURCE_VOCAB_SIZE = len(es_vocab)
TARGET_VOCAB_SIZE = len(en_vocab)

SRC_PAD_IDX = es_vocab['<pad>']
SRC_SOS_IDX = es_vocab['<sos>']
SRC_EOS_IDX = es_vocab['<eos>']

TRG_PAD_IDX = en_vocab['<pad>']
TRG_SOS_IDX = en_vocab['<sos>']
TRG_EOS_IDX = en_vocab['<eos>']

In [12]:
class Tokenizer(nn.Module):
    
    def __init__(self, tokenizer_fn):
        super(Tokenizer, self).__init__()
        self.tokenizer_fn = tokenizer_fn
    
    def forward(self, batch):
        return [ self.tokenizer_fn(line) for line in batch ]

In [13]:
text_es_transforms = T.Sequential(
        Tokenizer(tokenizer_fn=tokenizer_es_fn),
        T.Truncate(max_seq_len=MAX_SEQ_LEN),
        T.VocabTransform(vocab=es_vocab),
        T.ToTensor(padding_value=SRC_PAD_IDX, dtype=torch.long),
        T.PadTransform(max_length=MAX_SEQ_LEN, pad_value=SRC_PAD_IDX),
    )

text_en_transforms = T.Sequential(
        Tokenizer(tokenizer_fn=tokenizer_en_fn),
        T.Truncate(max_seq_len=MAX_SEQ_LEN),
        T.VocabTransform(vocab=en_vocab),
        T.ToTensor(padding_value=TRG_PAD_IDX, dtype=torch.long),
        T.PadTransform(max_length=MAX_SEQ_LEN, pad_value=TRG_PAD_IDX),
    )

In [14]:
def collate_batch_fn(batch):
    spa, eng = list(zip(*batch))
    return text_es_transforms(spa), text_en_transforms(eng)

Creation of Dataloaders

In [15]:
generator = torch.Generator().manual_seed(42)
train_dataset, valid_dataset, test_dataset = tdata.random_split(translations_dataset, lengths=[0.85, 0.10, 0.05], generator=generator)

In [16]:
train_dataloader = tdata.DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=False)
valid_dataloader = tdata.DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=False)
test_dataloader = tdata.DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch_fn, shuffle=False)

In [17]:
for spa, eng in train_dataloader:
    print(spa.shape)
    print(eng.shape)
    break

torch.Size([96, 20])
torch.Size([96, 20])


### Implementation of Transformer Model

In [18]:
%load_ext autoreload
%autoreload 2

In [19]:
from transformer import PositionalEncoding, HeadAttention, MultiHeadAttention, EncoderLayer, Encoder, DecoderLayer, Decoder, Transformer

torch.manual_seed(42)

<torch._C.Generator at 0x7f7c8dc75a70>

#### Positional Encoding

In [20]:
pe = PositionalEncoding(32, 8, vocab_size=64, padding_idx=SRC_PAD_IDX).to(device)

print("encoding table", pe.encoding_table.shape)

tensor = torch.randint(0, 64, (4, 8)).to(device) # (batch_size, length or time_dim)

result = pe(tensor)

tensor, result.shape

encoding table torch.Size([8, 32])


(tensor([[22, 28, 17, 30, 29, 51, 38, 34],
         [17, 58, 41, 38, 16, 13, 30, 23],
         [34, 43, 59, 44, 33,  2, 36, 42],
         [39, 25, 54, 22, 43, 38, 14, 55]], device='cuda:0'),
 torch.Size([4, 8, 32]))

In [21]:
head = HeadAttention(32, 2, 8, dropout=0.2, mask=False).to(device)

result1 = head(result, result, result)

result1.shape

torch.Size([4, 8, 32])

In [22]:
head = HeadAttention(32, 2, 8, dropout=0.2, mask=True).to(device)

result2 = head(result, result, result)

result2.shape

torch.Size([4, 8, 32])

In [23]:
mha = MultiHeadAttention(32, 2, 8, dropout=0.2, mask=False).to(device)

result3 = mha(result, result, result)

result3.shape

torch.Size([4, 8, 32])

In [24]:
mha = MultiHeadAttention(32, 2, 8, dropout=0.2, mask=True).to(device)

result4 = mha(result, result, result)

result4.shape

torch.Size([4, 8, 32])

In [25]:
enc = EncoderLayer(32, 2, 8, 128, dropout=0.2).to(device)

result5 = enc(result)

result5.shape

torch.Size([4, 8, 32])

In [26]:
encoder = Encoder(32, 2, 8, 128, 64, padding_idx=SRC_PAD_IDX, num_layers=2, dropout=0.2).to(device)

result6 = encoder(tensor)

result6.shape

torch.Size([4, 8, 32])

In [27]:
dec = DecoderLayer(32, 2, 8, 128, dropout=0.2).to(device)

result7 = dec(result, result6)

result7.shape

torch.Size([4, 8, 32])

In [28]:
decoder = Decoder(32, 2, 8, 128, 64, padding_idx=TRG_PAD_IDX, num_layers=2, dropout=0.2).to(device)

result8 = decoder(tensor, result6)

result8.shape

torch.Size([4, 8, 64])

In [29]:
model_transformer = Transformer(
        d_model=EMBEDDING_DIM,
        num_heads=NUM_HEADS,
        max_seq_len=MAX_SEQ_LEN,
        d_ff=D_FF,
        enc_vocab_size=SOURCE_VOCAB_SIZE,
        dec_vocab_size=TARGET_VOCAB_SIZE,
        src_pad_idx=SRC_PAD_IDX,
        trg_pad_idx=TRG_PAD_IDX,
        num_layers=NUM_LAYERS,
        dropout=DROPOUT
    )

source, target = next(iter(train_dataloader))

summary(model_transformer, input_data=[source, target])

Layer (type:depth-idx)                                  Output Shape              Param #
Transformer                                             [96, 20, 21016]           --
├─Encoder: 1-1                                          [96, 20, 384]             --
│    └─PositionalEncoding: 2-1                          [96, 20, 384]             --
│    │    └─Embedding: 3-1                              [96, 20, 384]             13,873,152
│    └─Sequential: 2-2                                  [96, 20, 384]             --
│    │    └─EncoderLayer: 3-2                           [96, 20, 384]             6,000,896
│    │    └─EncoderLayer: 3-3                           [96, 20, 384]             6,000,896
│    │    └─EncoderLayer: 3-4                           [96, 20, 384]             6,000,896
│    │    └─EncoderLayer: 3-5                           [96, 20, 384]             6,000,896
│    │    └─EncoderLayer: 3-6                           [96, 20, 384]             6,000,896
│    │    └─Encod

### Definition of train and test step functions

In [30]:
def train_step(model, optimizer, dataloader):
    total_loss = torch.zeros(len(dataloader))
    
    model.train()
    for i, (spa_batch, eng_batch) in enumerate(tqdm(dataloader)):
        spa_batch, eng_batch = spa_batch.to(device), eng_batch.to(device)
        
        # compute forward pass
        logits = model(spa_batch, eng_batch)

        # reshase before passing to loss fn
        B, L, C = logits.shape
        logits = logits.view(B * L, C)
        target = eng_batch.view(B * L)

        # compute loss, gradients, and update params
        optimizer.zero_grad(set_to_none=True)
        loss = F.cross_entropy(logits, target)
        loss.backward()
        optimizer.step()
        
        # update metrics
        total_loss[i] = loss.item()
    
    # Compute avg
    avg_loss = total_loss.mean()
    
    return avg_loss

In [31]:
def validate_step(model, dataloader):
    total_loss = torch.zeros(len(dataloader))
    
    model.eval()
    for i, (spa_batch, eng_batch) in enumerate(dataloader):
        spa_batch, eng_batch = spa_batch.to(device), eng_batch.to(device)
        
        # compute forward pass
        logits = model(spa_batch, eng_batch)

        # reshase before passing to loss fn
        B, L, C = logits.shape
        logits = logits.view(B * L, C)
        target = eng_batch.view(B * L)
       
        # compute loss
        loss = F.cross_entropy(logits, target)
        
        # update metrics
        total_loss[i] = loss.item()

    # Compute avg
    avg_loss = total_loss.mean()

    return avg_loss

In [32]:
def train(train_dataloader, test_dataloader, epochs=100):
    
    mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")
    mlflow.set_experiment("transformer_model")
    mlflow.autolog()
    
    with mlflow.start_run():

        model = Transformer(
            d_model=EMBEDDING_DIM,
            num_heads=NUM_HEADS,
            max_seq_len=MAX_SEQ_LEN,
            d_ff=D_FF,
            enc_vocab_size=SOURCE_VOCAB_SIZE,
            dec_vocab_size=TARGET_VOCAB_SIZE,
            src_pad_idx=SRC_PAD_IDX,
            trg_pad_idx=TRG_PAD_IDX,
            num_layers=NUM_LAYERS,
            dropout=DROPOUT
        ).to(device)


        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

        for i in range(epochs):
            print(f"Epoch={i+1}")
            
            train_avg_loss = train_step(model, optimizer, train_dataloader)
            valid_avg_loss = validate_step(model, test_dataloader)

            mlflow.log_metric("train_avg_loss", train_avg_loss, step=i)
            mlflow.log_metric("valid_avg_loss", valid_avg_loss, step=i)
            
            print(f"Train Loss={train_avg_loss:>7f} \t Valid Loss={valid_avg_loss:>7f}")

    return model

### Definition of Model Parameters and Training

In [33]:
model = train(train_dataloader, valid_dataloader, epochs=10)

Epoch=1


100%|██████████| 2043/2043 [04:31<00:00,  7.52it/s]


Train Loss=0.731358 	 Valid Loss=0.198389
Epoch=2


100%|██████████| 2043/2043 [04:34<00:00,  7.45it/s]


Train Loss=0.136884 	 Valid Loss=0.072912
Epoch=3


100%|██████████| 2043/2043 [04:34<00:00,  7.45it/s]


Train Loss=0.049901 	 Valid Loss=0.034172
Epoch=4


100%|██████████| 2043/2043 [04:35<00:00,  7.42it/s]


Train Loss=0.020418 	 Valid Loss=0.020666
Epoch=5


100%|██████████| 2043/2043 [04:35<00:00,  7.43it/s]


Train Loss=0.009577 	 Valid Loss=0.015923
Epoch=6


100%|██████████| 2043/2043 [04:34<00:00,  7.45it/s]


Train Loss=0.004672 	 Valid Loss=0.013690
Epoch=7


100%|██████████| 2043/2043 [04:34<00:00,  7.44it/s]


Train Loss=0.002086 	 Valid Loss=0.012674
Epoch=8


100%|██████████| 2043/2043 [04:34<00:00,  7.44it/s]


Train Loss=0.000582 	 Valid Loss=0.012489
Epoch=9


100%|██████████| 2043/2043 [04:34<00:00,  7.44it/s]


Train Loss=0.000112 	 Valid Loss=0.012643
Epoch=10


100%|██████████| 2043/2043 [04:33<00:00,  7.47it/s]
2024/11/12 18:59:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run angry-vole-14 at: http://127.0.0.1:5000/#/experiments/720540007848179926/runs/4c8409b3a62a4920b6357392df28c3e7.
2024/11/12 18:59:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/720540007848179926.


Train Loss=0.000466 	 Valid Loss=0.012068


In [34]:
timestamp = int(time.time())

path = f"./transformer_{timestamp}.pkl"

torch.save(model.state_dict(), path)

### Evaluate sentences using the trained model

In [84]:
model_transformer2 = Transformer(
        d_model=EMBEDDING_DIM,
        num_heads=NUM_HEADS,
        max_seq_len=MAX_SEQ_LEN,
        d_ff=D_FF,
        enc_vocab_size=SOURCE_VOCAB_SIZE,
        dec_vocab_size=TARGET_VOCAB_SIZE,
        src_pad_idx=SRC_PAD_IDX,
        trg_pad_idx=TRG_PAD_IDX,
        num_layers=NUM_LAYERS,
        dropout=DROPOUT
    ).to(device)



model_transformer2.load_state_dict(torch.load('transformer_1731437975.pkl', weights_only=True))

<All keys matched successfully>

In [85]:
test_sentences = translations_df['text_x'].sample(10).to_list()

test_sentences

['¡Que se borre su nombre!',
 'Me estaba helando.',
 'Aprendí inglés por internet.',
 'Necesito saber más detalles.',
 'Cuanto antes vayas, mejor.',
 '"¿Por qué está Tom enfermo?" "Podría haber comido algo en mal estado".',
 'La artista cedió su obra al dominio público.',
 'Los pueblos de las Primeras Naciones tienen historias interesantes que contar.',
 '¿Me ves?',
 'Gracias por mostrarme cómo se hace.']

In [86]:
def indices_to_sentence(indices, en_vocab):
    return ' '.join(en_vocab.lookup_token(idx) for idx in indices if en_vocab.lookup_token(idx) != '<pad>' and en_vocab.lookup_token(idx) != '<unk>')

In [87]:
def translate_sentence(model, sentence, en_vocab, max_seq_len=MAX_SEQ_LEN):
    src = text_es_transforms([sentence])
    tgt = torch.zeros((1,1), dtype=torch.long).to(device)

    for _ in range(max_seq_len):
        # crop the last max seq len indices
        tgt_cond = tgt[:, -max_seq_len:]
        # evaluate the model
        logits = model(src.to(device), tgt_cond.to(device))
        # focus only on the last time step
        logits = logits[:, -1, :]
        # apply softmax to get probabilities
        probs = F.softmax(logits, dim=-1)
        
        idx_next = torch.multinomial(probs, num_samples=1, replacement=False)
        
        if idx_next.cpu().item() == en_vocab['<eos>']:
            break

        tgt = torch.cat((tgt, idx_next), dim=1)
    return tgt

In [93]:
def evaluate_sentences(model, sentences, en_vocab, max_seq_len=MAX_SEQ_LEN):
    model.eval()

    with torch.no_grad():
        for sentence in sentences:
            sequence = translate_sentence(model, sentence, en_vocab, max_seq_len)
            translation = indices_to_sentence(sequence[0], en_vocab)
            yield sentence, translation

In [95]:
for sentence, translation in evaluate_sentences(model_transformer2, test_sentences, en_vocab):
    print(f"{sentence} \n {translation}")

¡Que se borre su nombre! 
 duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff duff
Me estaba helando. 
 Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays Holidays
Aprendí inglés por internet. 
 koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala koala
Necesito saber más detalles. 
 crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned crowned
Cuanto antes vayas, mejor. 
 trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim trim
"¿Por qué está Tom enfermo?" "Podría haber comido algo en mal estado". 
 principles principles principles principles principles principles principles principles principles principles principles princip