
<a href="https://colab.research.google.com/github/patrickctrf/IA024_2022S2/blob/main/ex07/patrick_ferreira/ex07_patrick_ferreira_175480.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: 

## Instruções

Neste colab iremos treinar um modelo T5 para traduzir de inglês para português. Iremos treiná-lo com o data Paracrawl.

- Usaremos o dataset Paracrawl Inglês-Português. Truncamos o dataset de treino para apenas 100k pares para deixar o treinamento mais rápido. Quem quiser pode treinar com mais amostras. Se demorar muito para treinar, truncar o dataset ainda mais.

- Usaremos o BLEU como métrica. Usaremos o SacreBLEU pois sempre faz o mesmo pré-processamento (tokenização, lowercase). Não usaremos torchnlp.metrics.bleu, torchtext.data.metrics.bleu_score, etc. SacreBLEU é lento: usar poucas amostras de validação (ex: 5k)


Usaremos o modelo PTT5 disponível no model hub da HuggingFace:

https://huggingface.co/unicamp-dl/ptt5-small-portuguese-vocab

Este é  um T5 pré-treinado em textos em português e com tokenizador em português.

É recomendável salvar os pesos do modelo e estado dos otimizadores, pois o treinamento é longo.


In [29]:
# Configurações gerais
from queue import Queue
from threading import Thread

import numpy as np
from torch import nn
from tqdm import tqdm

model_name = "unicamp-dl/ptt5-small-portuguese-vocab"
batch_size = 64
accumulate_grad_batches = 2
source_max_length = 128
target_max_length = 128
learning_rate = 5e-4

In [30]:
! pip install sacrebleu
! pip install transformers
! pip install sentencepiece



In [31]:
# Importar todos os pacotes de uma só vez para evitar duplicados ao longo do notebook.
import gzip
import os
import random
import sacrebleu
import torch
import torch.nn.functional as F

# from google.colab import drive

from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

from typing import Dict
from typing import List
from typing import Tuple

In [32]:
# Important: Fix seeds so we can replicate results
seed = 123
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

if torch.cuda.is_available():
    dev = "cuda:0"
else:
    dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


Iremos salvar os checkpoints (pesos do modelo) no google drive, para que possamos continuar o treino de onde paramos.

In [33]:
# drive.mount('/content/drive')

## Preparando Dados

Primeiro, fazemos download do dataset:

In [34]:
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_train.tsv.gz
! wget -nc https://storage.googleapis.com/unicamp-dl/ia024a_2022s2/paracrawl_enpt_test.tsv.gz

File ‘paracrawl_enpt_train.tsv.gz’ already there; not retrieving.

File ‘paracrawl_enpt_test.tsv.gz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (100k pares) e val (5k pares) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [35]:
def load_text_pairs(path):
    text_pairs = []
    for line in gzip.open(path, mode='rt'):
        text_pairs.append(line.strip().split('\t'))
    return text_pairs

x_train = load_text_pairs('paracrawl_enpt_train.tsv.gz')
x_test = load_text_pairs('paracrawl_enpt_test.tsv.gz')

# Embaralhamos o treino para depois fazermos a divisão treino/val.
random.shuffle(x_train)

# Truncamos o dataset para 100k pares de treino e 5k pares de validação.
x_val = x_train[100000:105000]
x_train = x_train[:100000]

for set_name, x in [('treino', x_train), ('validação', x_val), ('test', x_test)]:
    print(f'\n{len(x)} amostras de {set_name}')
    print(f'3 primeiras amostras {set_name}:')
    for i, (source, target) in enumerate(x[:3]):
        print(f'{i}: source: {source}\n   target: {target}')


100000 amostras de treino
3 primeiras amostras treino:
0: source: More Croatian words and phrases
   target: Mais palavras e frases em croata
1: source: Jerseys and pullovers, containing at least 50Â % by weight of wool and weighing 600Â g or more per article 6110 11 10 (PCE)
   target: Camisolas e pulôveres, com pelo menos 50 %, em peso, de lã e pesando 600g ou mais por unidade 6110 11 10 (PCE)
2: source: Atex Colombia SAS makes available its lead product, 100% natural liquid latex, excellent quality and price. ... Welding manizales caldas Colombia a DuckDuckGo
   target: Atex Colômbia SAS torna principal produto está disponível, látex líquido 100% natural, excelente qualidade e preço. ...

5000 amostras de validação
3 primeiras amostras validação:
0: source: «You have hidden these things from the wise and the learned you have revealed them to the childlike»
   target: «Escondeste estas coisas aos sábios e entendidos e as revelaste aos pequenos»
1: source: Repair of computers, applic

Criando Dataset


In [36]:
tokenizer = T5Tokenizer.from_pretrained(model_name)

In [37]:
class MyDataset(Dataset):
    def __init__(self, text_pairs: List[Tuple[str]], tokenizer,
                 source_max_length: int = 32, target_max_length: int = 32):
        self.tokenizer = tokenizer
        self.text_pairs = text_pairs
        self.source_max_length = source_max_length
        self.target_max_length = target_max_length

        sources, targets = list(zip(*text_pairs))

        self.sources_tokenizadas = tokenizer(sources, padding=True, truncation=True, max_length=self.source_max_length, return_tensors = "pt")
        self.targets_tokenizadas = tokenizer(targets, padding=True, truncation=True, max_length=self.source_max_length, return_tensors = "pt")


    def __len__(self):
        return len(self.text_pairs)
    
    def __getitem__(self, idx):
        source, target = self.text_pairs[idx]
        # TODO: tokenizar texto

        source_token_ids =  self.sources_tokenizadas.input_ids[idx]
        source_mask =       self.sources_tokenizadas.attention_mask[idx]
        target_token_ids =  self.targets_tokenizadas.input_ids[idx]
        target_mask =       self.targets_tokenizadas.attention_mask[idx]

        return source_token_ids, source_mask, target_token_ids, target_mask, source, target

## Testando o DataLoader

In [38]:
text_pairs = [('we like pizza', 'eu gosto de pizza')]
dataset_debug = MyDataset(
    text_pairs=text_pairs,
    tokenizer=tokenizer,
    source_max_length=source_max_length,
    target_max_length=target_max_length)

dataloader_debug = DataLoader(dataset_debug, batch_size=10, shuffle=True, 
                              num_workers=0)

source_token_ids, source_mask, target_token_ids, target_mask, _, _ = next(iter(dataloader_debug))
print('source_token_ids:\n', source_token_ids)
print('source_mask:\n', source_mask)
print('target_token_ids:\n', target_token_ids)
print('target_mask:\n', target_mask)

print('source_token_ids.shape:', source_token_ids.shape)
print('source_mask.shape:', source_mask.shape)
print('target_token_ids.shape:', target_token_ids.shape)
print('target_mask.shape:', target_mask.shape)

source_token_ids:
 tensor([[  31, 1528, 1079,  634, 1241, 7531,    1]])
source_mask:
 tensor([[1, 1, 1, 1, 1, 1, 1]])
target_token_ids:
 tensor([[2077, 6618,    4, 1241, 7531,    1]])
target_mask:
 tensor([[1, 1, 1, 1, 1, 1]])
source_token_ids.shape: torch.Size([1, 7])
source_mask.shape: torch.Size([1, 7])
target_token_ids.shape: torch.Size([1, 6])
target_mask.shape: torch.Size([1, 6])


## Criando DataLoaders de Treino/Val/Test

In [39]:
dataset_train = MyDataset(text_pairs=x_train,
                          tokenizer=tokenizer,
                          source_max_length=source_max_length,
                          target_max_length=target_max_length)

dataset_val = MyDataset(text_pairs=x_val,
                        tokenizer=tokenizer,
                        source_max_length=source_max_length,
                        target_max_length=target_max_length)

dataset_test = MyDataset(text_pairs=x_test,
                         tokenizer=tokenizer,
                         source_max_length=source_max_length,
                         target_max_length=target_max_length)

train_dataloader = DataLoader(dataset_train, batch_size=batch_size,
                              shuffle=True, num_workers=4)

val_dataloader = DataLoader(dataset_val, batch_size=batch_size, shuffle=False, 
                            num_workers=4)

test_dataloader = DataLoader(dataset_test, batch_size=batch_size,
                             shuffle=False, num_workers=4)

### Utilidade para converter dados de device em paralelo


### TREINAMENTO

In [40]:
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)


In [41]:
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.cuda.amp import GradScaler, autocast

max_examples = 100_000
eval_every_steps = 200
lr = learning_rate
use_amp = True

# DataManager(train_dataloader, device=device, data_type=None)
# DataManager(val_dataloader, device=device, data_type=None)

train_loader = train_dataloader
validation_loader = val_dataloader


optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimizer, 'min', factor=0.9, min_lr=3e-5, patience=15000, threshold=1e-1, verbose=True)
scaler=GradScaler()

accumulated_grad_batches_until_now = 0

def train_step(source_tokens, source_mask, target_tokens, target_mask, original_source, original_target):
    model.train()
    model.zero_grad()
    with autocast(enabled=use_amp):
        loss = model(input_ids = source_tokens.to(device), attention_mask = source_mask.to(device), decoder_attention_mask = target_mask.to(device), labels = target_tokens.to(device)).loss
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    return loss.item()


def validation_step(source_tokens, source_mask, target_tokens, target_mask, original_source, original_target):
    model.eval()
    with torch.no_grad():
        with autocast(enabled=use_amp):
            loss = model(input_ids = source_tokens.to(device), attention_mask = source_mask.to(device), decoder_attention_mask = target_mask.to(device), labels = target_tokens.to(device)).loss
    return loss.item()


best_validation_loss = 9999
train_losses = []
n_examples = 0
step = 0
pbar = tqdm(total=max_examples)
while n_examples < max_examples:
    for mini_batch in train_dataloader:
        loss = train_step(*mini_batch)
        train_losses.append(loss)

        # LR scheduler
        scheduler.step(loss)

        if step % eval_every_steps == 0:
            train_loss = np.average(train_losses)

            with torch.no_grad():
                valid_loss = np.average([
                    validation_step(*mini_batch)
                    for mini_batch in val_dataloader])
                # Checkpoint to best models found.
                if best_validation_loss > valid_loss:
                    # Update the new best perplexity.
                    best_validation_loss = valid_loss
                    model.eval()
                    torch.save(model, "best_model.pth")

            print(f'{step} steps; {n_examples} examples so far; train loss: {train_loss:.2f}, valid loss: {valid_loss:.2f}')
            train_losses = []

        n_examples += mini_batch[0].shape[0]  # Increment of batch size
        step += 1
        pbar.update(mini_batch[0].shape[0])
        if n_examples >= max_examples:
            break

pbar.close()

# Restore best model (checkpoint) found
model = torch.load("best_model.pth")

  0%|          | 0/100000 [01:15<?, ?it/s]
  0%|          | 64/100000 [00:09<4:19:08,  6.43it/s]

0 steps; 0 examples so far; train loss: 23.27, valid loss: 30.85


 64%|██████▍   | 64064/100000 [06:10<30:28, 19.65it/s] 

1000 steps; 64000 examples so far; train loss: 0.82, valid loss: 0.57


100%|██████████| 100000/100000 [09:25<00:00, 176.84it/s]


### Avaliando BLEU score

In [42]:
!pip install torchmetrics

# Rotina de avaliação inspirada no notebook de Bruno da Silvia
from torchmetrics import SacreBLEUScore

def evaluate_bleu_score(model, target_dataloader):

    pred_translations, targets = [], []

    for i, batch in tqdm(enumerate(target_dataloader), total=len(target_dataloader)):
        inputs = batch[0]
        inputs_mask = batch[1]
        targets += [[i] for i in batch[-1]]

        with torch.no_grad():
            model_output = model.generate(input_ids=inputs.to(device), attention_mask=inputs_mask.to(device), max_length=target_max_length)
            pred_translations += tokenizer.batch_decode(model_output, skip_special_tokens=True)

    metric = SacreBLEUScore()
    return metric(pred_translations, targets)

Collecting torchmetrics
  Downloading torchmetrics-0.10.0-py3-none-any.whl (529 kB)
[K     |████████████████████████████████| 529 kB 654 kB/s eta 0:00:01
Installing collected packages: torchmetrics
Successfully installed torchmetrics-0.10.0


In [43]:
# Restore best model (checkpoint) found
model = torch.load("best_model.pth")

bleu = evaluate_bleu_score(model, test_dataloader)
print(f'\nFinal BLEU score on test: {bleu.item()*100:.2f}')



100%|██████████| 313/313 [11:03<00:00,  2.12s/it]



Final BLEU score on test: 19.21


### Traduzindo alguns exemplos


In [45]:
# Rotina de geração de sentenças inspirada no notebook de Mateus Lindino

import random
randomlist = random.sample(range(0, len(dataset_test)), 5)

for i in randomlist:
    item           = dataset_test[i]
    input_ids      = item[0].to(device)
    attention_mask = item[1].to(device)
    sample_en      = item[-2]
    sample_pt      = item[-1]

    pred = model.generate(input_ids=input_ids.reshape(1, -1), attention_mask=attention_mask.reshape(1, -1), max_length=target_max_length)[0]
    pred = tokenizer.decode(pred, skip_special_tokens=True)

    print('-'*200)
    print(f'{sample_en}\n\tPortuguese Target: {sample_pt}\n\tPortuguese Output: {pred}\n\n')

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
To have your own server, a company or professional, you may find different options, from buying the physical server, or hardware equipment, to hire a dedicated server on the Internet, through the possibility of hiring a VPS or even a Reseller service, depending on the actual needs that have or will have in the future. It also depends on the potential knowledge, professional or employee of the company, on the administration server.
	Portuguese Target: Para ter seu próprio servidor, uma empresa ou um profissional, você pode encontrar opções diferentes, desde a compra de servidores físicos, ou equipamento de hardware, para contratar os serviços de um servidor dedicado na Internet, através da possibilidade de contratar um VPS ou até mesmo um serviço de Revenda, dependendo das necessidades rea