# **Notebook de referência**

Nome: Matheus Lindino

In [1]:
! pip install sacrebleu transformers sentencepiece torchmetrics -q

## **Instruções**

Neste colab iremos treinar um modelo T5 para traduzir de inglês para português. Iremos treiná-lo com o data Paracrawl.

- Usaremos o dataset Paracrawl Inglês-Português. Truncamos o dataset de treino para apenas 100k pares para deixar o treinamento mais rápido. Quem quiser pode treinar com mais amostras. Se demorar muito para treinar, truncar o dataset ainda mais.

- Usaremos o BLEU como métrica. Usaremos o SacreBLEU pois sempre faz o mesmo pré-processamento (tokenização, lowercase). Não usaremos torchnlp.metrics.bleu, torchtext.data.metrics.bleu_score, etc. SacreBLEU é lento: usar poucas amostras de validação (ex: 5k)


Usaremos o modelo PTT5 disponível no model hub da HuggingFace:

https://huggingface.co/unicamp-dl/ptt5-small-portuguese-vocab

Este é  um T5 pré-treinado em textos em português e com tokenizador em português.

É recomendável salvar os pesos do modelo e estado dos otimizadores, pois o treinamento é longo.


## **Imports and Global Variables**

In [2]:
# Importar todos os pacotes de uma só vez para evitar duplicados ao longo do notebook.
import gzip
import os
import random
#import sacrebleu
import torch
import copy
import torch.nn.functional as F
import matplotlib.pyplot as plt

#from google.colab import drive

from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torchmetrics import SacreBLEUScore
from torch.optim import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau
from tqdm.notebook import tqdm

from typing import Dict
from typing import List
from typing import Tuple

In [3]:
# Important: Fix seeds so we can replicate results
seed = 123
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

In [4]:
model_name = "unicamp-dl/ptt5-base-portuguese-vocab"
task_prefix = 'Translate English to Portuguese: ' 
batch_size = 64
accumulate_grad_batches = 2
source_max_length = 128
target_max_length = 128
learning_rate = 1e-3
evaluate_interval = 500

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Device: {device}')

Device: cuda:0


## **Dataset**

### Download files

In [5]:
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_train.tsv.gz
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_test.tsv.gz

File ‘paracrawl_enpt_train.tsv.gz’ already there; not retrieving.

File ‘paracrawl_enpt_test.tsv.gz’ already there; not retrieving.



### Split data

 - Train: 990k pairs
 - Valid: 10k pairs
 - Test: 20k pairs

In [6]:
def load_text_pairs(path):
    text_pairs = []
    for line in gzip.open(path, mode='rt'):
        text_pairs.append(line.strip().split('\t'))
    return text_pairs

x_train = load_text_pairs('paracrawl_enpt_train.tsv.gz')
x_test = load_text_pairs('paracrawl_enpt_test.tsv.gz')

random.shuffle(x_train)

x_val = x_train[990000:]
x_train = x_train[:990000]

for set_name, x in [('treino', x_train), ('validação', x_val), ('test', x_test)]:
    print(f'\n{len(x)} amostras de {set_name}')
    print(f'3 primeiras amostras {set_name}:')
    for i, (source, target) in enumerate(x[:3]):
        print(f'{i}: source: {source}\n   target: {target}')


990000 amostras de treino
3 primeiras amostras treino:
0: source: More Croatian words and phrases
   target: Mais palavras e frases em croata
1: source: Jerseys and pullovers, containing at least 50Â % by weight of wool and weighing 600Â g or more per article 6110 11 10 (PCE)
   target: Camisolas e pulôveres, com pelo menos 50 %, em peso, de lã e pesando 600g ou mais por unidade 6110 11 10 (PCE)
2: source: Atex Colombia SAS makes available its lead product, 100% natural liquid latex, excellent quality and price. ... Welding manizales caldas Colombia a DuckDuckGo
   target: Atex Colômbia SAS torna principal produto está disponível, látex líquido 100% natural, excelente qualidade e preço. ...

10000 amostras de validação
3 primeiras amostras validação:
0: source: His mother, Mamie Till Bradley, insisted that the casket be left open at the funeral parlor so people could see her son’s badly disfigured face.
   target: Sua mãe, Mamie Till Bradley, insistiu em que o caixão ser deixada em ab

### Dataset Class


In [7]:
tokenizer = T5Tokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/756k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456 [00:00<?, ?B/s]

In [8]:
class ParacrawlDataset(Dataset):
    def __init__(self, text_pairs: List[Tuple[str]], tokenizer, source_max_length: int = 32, target_max_length: int = 32):
        
        original_source = [task_prefix + sample[0] for sample in text_pairs]
        original_target = [sample[1] for sample in text_pairs]
        
        source =  tokenizer(original_source, truncation=True, padding=True, return_tensors='pt', max_length=source_max_length)
        target =  tokenizer(original_target, truncation=True, padding=True, return_tensors='pt', max_length=target_max_length)
        
        
        labels = target.input_ids
        labels[labels == tokenizer.pad_token_id] = -100
        
        self.data = {'source_token_ids': source.input_ids, 'source_mask': source.attention_mask,
                     'target_token_ids': labels, 'target_mask': target.attention_mask,
                     'original_source': original_source, 'original_target': original_target}
        
    def __len__(self):
        return len(self.data['source_token_ids'])
    
    def __getitem__(self, idx):
        source_token_ids  = self.data['source_token_ids'][idx]
        source_mask       = self.data['source_mask'][idx]
        target_token_ids  = self.data['target_token_ids'][idx]
        target_mask       = self.data['target_mask'][idx]
        original_source   = self.data['original_source'][idx]
        original_target   = self.data['original_target'][idx]
        
        return (source_token_ids, source_mask, target_token_ids, target_mask, original_source, original_target)

### Asserts

In [9]:
text_pairs = [('I like pizza', 'eu gosto de pizza'), ('we love pizza so much', 'nós amamos muito de pizza')]
dataset_debug = ParacrawlDataset(
    text_pairs=text_pairs,
    tokenizer=tokenizer,
    source_max_length=source_max_length,
    target_max_length=target_max_length)

dataloader_debug = DataLoader(dataset_debug, batch_size=10, shuffle=True, 
                              num_workers=0)

source_token_ids, source_mask, target_token_ids, target_mask, _, _ = next(iter(dataloader_debug))
print('source_token_ids:\n', source_token_ids)
print('source_mask:\n', source_mask)
print('target_token_ids:\n', target_token_ids)
print('target_mask:\n', target_mask)

print('source_token_ids.shape:', source_token_ids.shape)
print('source_mask.shape:', source_mask.shape)
print('target_token_ids.shape:', target_token_ids.shape)
print('target_mask.shape:', target_mask.shape)

source_token_ids:
 tensor([[ 2738,   104,   146, 20739,   934, 15374,  1066,    32,    46,   116,
          1079,   634,  1241,  7531,     1,     0,     0,     0,     0],
        [ 2738,   104,   146, 20739,   934, 15374,  1066,    32,    46,    31,
          1528,  2181,   327,  1241,  7531,  2469,  2032,   414,     1]])
source_mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
target_token_ids:
 tensor([[2077, 6618,    4, 1241, 7531,    1, -100, -100],
        [3247, 8060,  573,  151,    4, 1241, 7531,    1]])
target_mask:
 tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])
source_token_ids.shape: torch.Size([2, 19])
source_mask.shape: torch.Size([2, 19])
target_token_ids.shape: torch.Size([2, 8])
target_mask.shape: torch.Size([2, 8])


### DataLoaders

In [10]:
dataset_train = ParacrawlDataset(text_pairs=x_train,
                          tokenizer=tokenizer,
                          source_max_length=source_max_length,
                          target_max_length=target_max_length)

dataset_val = ParacrawlDataset(text_pairs=x_val,
                        tokenizer=tokenizer,
                        source_max_length=source_max_length,
                        target_max_length=target_max_length)

dataset_test = ParacrawlDataset(text_pairs=x_test,
                         tokenizer=tokenizer,
                         source_max_length=source_max_length,
                         target_max_length=target_max_length)

train_dataloader = DataLoader(dataset_train, batch_size=batch_size,
                              shuffle=True, num_workers=0)

val_dataloader = DataLoader(dataset_val, batch_size=batch_size, shuffle=False, 
                            num_workers=0)

test_dataloader = DataLoader(dataset_test, batch_size=batch_size,
                             shuffle=False, num_workers=0)

## **Model**

In [11]:
class EarlyStopping():
    def __init__(self, patience=5, min_delta=0.0001):
        self.patience = patience
        self.counter = 0
        self.best_score = None
        self.best_model_wts = None
        self.min_delta = min_delta

    def __call__(self, model, val_loss):
        score = -val_loss

        if self.best_score is None:
            self.best_score = score
            self.best_model_wts = copy.deepcopy(model.state_dict())
            return False

        elif score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                return True
        else:
            self.best_score = score
            self.best_model_wts = copy.deepcopy(model.state_dict())
            self.counter = 0
        return False

In [12]:
def train_step(model, batch, optimizer, batch_idx, dataloader_size):
    model.train()
    input_ids      = batch[0].to(device)
    attention_mask = batch[1].to(device)
    labels         = batch[2].to(device)

    loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss / accumulate_grad_batches       
    loss.backward()

    if ((batch_idx + 1) % accumulate_grad_batches == 0) or (batch_idx + 1 == dataloader_size): #len(dataloader)):
        optimizer.step()
        optimizer.zero_grad()

    return loss.item()

def evaluate(model, dataloader, mode='loss'):
    pred_seq = []
    targets = []
    running_loss = 0
    
    model.eval()
    for batch in tqdm(dataloader, total=len(dataloader), desc='Validation', leave=False):
        input_ids      = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels         = batch[2].to(device)

        targets       += [[sentence] for sentence in batch[5]]
        
        with torch.no_grad():
            if mode == 'sacreblue':
                output_sequences = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=target_max_length)
                pred_seq  += tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
            
            elif mode == 'loss':
                running_loss += model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss.item()
            
            else:
                print('Invalid mode!')
                break

    if mode == 'sacreblue':
        metric = SacreBLEUScore()
        return metric(pred_seq, targets)
    elif mode == 'loss':
        return running_loss / len(dataloader)
    else:
        return None

In [13]:
model = T5ForConditionalGeneration.from_pretrained(model_name)
model = model.to(device)

optimizer = AdamW(model.parameters(), lr=learning_rate)
scheduler = ReduceLROnPlateau(optimizer=optimizer, mode='min', patience=3, factor=0.1)
early_stopping = EarlyStopping()

history = {'train_loss': [], 'val_loss': []}
epoch = 1

while True:
    for batch_idx, data in enumerate(tqdm(train_dataloader, total=len(train_dataloader), desc=f'Training Epoch {epoch}')):
        train_loss = train_step(model=model, 
                                batch=data, 
                                optimizer=optimizer,
                                batch_idx=batch_idx,
                                dataloader_size=len(train_dataloader))
        history['train_loss'].append(train_loss)
        
        if ((batch_idx + 1) % evaluate_interval == 0):
            val_loss = evaluate(model=model,
                                dataloader=val_dataloader,
                                mode='loss')
            
            scheduler.step(val_loss)
            history['val_loss'].append(val_loss)   
            if early_stopping(model=model, val_loss=val_loss): break
    
    epoch += 1
    if early_stopping(model=model, val_loss=val_loss): break

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Training Epoch 1:   0%|          | 0/15469 [00:00<?, ?it/s]

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.75 GiB total capacity; 13.16 GiB already allocated; 39.44 MiB free; 13.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
import matplotlib.pyplot as plt


fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(30,7))
axes[0].plot(history['train_loss'], 'k', label='Train')
axes[0].set_title('Train Loss')
axes[0].set_xlabel('Steps'); axes[0].set_ylabel('Loss')
axes[0].grid()

axes[1].plot(history['val_loss'], 'ok', label='Val')
axes[1].set_title('Val Loss')
axes[1].set_xlabel('Steps'); axes[1].set_ylabel('Loss')
axes[1].grid()


plt.show()

In [None]:
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.load_state_dict(early_stopping.best_model_wts)
model.to(device)

blue = evaluate(model, val_dataloader, mode='sacreblue')
print(f'Validation BLUE: {blue.item()}')

In [None]:
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.load_state_dict(early_stopping.best_model_wts)
model.to(device)

blue = evaluate(model, test_dataloader, mode='sacreblue')
print(f'Test BLUE: {blue.item()}')

In [None]:
import random
randomlist = random.sample(range(0, len(dataset_test)), 5)

for i in randomlist:
    item           = dataset_test[i]
    input_ids      = item[0].to(device)
    attention_mask = item[1].to(device)
    sample_en      = item[-2]
    sample_pt      = item[-1]

    pred = model.generate(input_ids=input_ids.reshape(1, -1), attention_mask=attention_mask.reshape(1, -1), max_length=target_max_length)[0]
    pred = tokenizer.decode(pred, skip_special_tokens=True)
    
    print('-'*200)
    print(f'{sample_en}\n\tPortuguese Target: {sample_pt}\n\tPortuguese Output: {pred}\n\n')

In [None]:
sample = 'Translate English to Portuguese: There aren\'t any reasons to make this translation!'
sample_token = tokenizer(sample, return_tensors='pt', truncation=True, max_length=source_max_length)
input_ids = sample_token.input_ids.to(device)
attention_mask = sample_token.attention_mask.to(device)

pred = model.generate(input_ids=input_ids.reshape(1, -1), attention_mask=attention_mask.reshape(1, -1), max_length=target_max_length)[0]
pred = tokenizer.decode(pred, skip_special_tokens=True)

print(sample)
print(pred)