<a href="https://colab.research.google.com/github/patrickctrf/IA024_2022S2/blob/main/ex06/patrick_ferreira/ex06_175480_patrick_ferreira.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Patrick de Carvalho Tavares Rezende Ferreira

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante: 
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

- Usar pytorch lightning. Para entender como o pytorch lightning funciona, veja uma implementação simplificada no notebook `06_01_Treino_Validação_MNIST_Lightning_Lite.ipynb`

# Fixando a seed

In [None]:
import random
from typing import List

import torch
import torch.nn.functional as F
import numpy as np

In [None]:
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm

random.seed(123)
np.random.seed(123)
torch.manual_seed(123)


if torch.cuda.is_available():
    dev = "cuda:0"
else:
    dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



In [None]:
!pip install transformers
from transformers import BertTokenizer, BertForSequenceClassification

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in tqdm(os.listdir(folder)):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

import os
cwd = os.getcwd()

x_train_pos = load_texts(os.path.join(cwd,'aclImdb' + os.sep + 'train' + os.sep + 'pos'))
x_train_neg = load_texts(os.path.join(cwd,'aclImdb' + os.sep + 'train' + os.sep + 'neg'))
x_test_pos = load_texts(os.path.join(cwd, 'aclImdb' + os.sep + 'test' + os.sep + 'pos'))
x_test_neg = load_texts(os.path.join(cwd, 'aclImdb' + os.sep + 'test' + os.sep + 'neg'))

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])



  0%|          | 0/12500 [00:00<?, ?it/s][A
 31%|███       | 3867/12500 [00:00<00:00, 38664.14it/s][A
 62%|██████▏   | 7734/12500 [00:00<00:00, 37682.83it/s][A
100%|██████████| 12500/12500 [00:00<00:00, 36211.19it/s]

  0%|          | 0/12500 [00:00<?, ?it/s][A
 31%|███       | 3860/12500 [00:00<00:00, 38594.42it/s][A
 62%|██████▏   | 7720/12500 [00:00<00:00, 36209.32it/s][A
100%|██████████| 12500/12500 [00:00<00:00, 35900.20it/s]

  0%|          | 0/12500 [00:00<?, ?it/s][A
 32%|███▏      | 3960/12500 [00:00<00:00, 39595.98it/s][A
 63%|██████▎   | 7920/12500 [00:00<00:00, 39168.76it/s][A
100%|██████████| 12500/12500 [00:00<00:00, 37990.98it/s]

  0%|          | 0/12500 [00:00<?, ?it/s][A
 28%|██▊       | 3456/12500 [00:00<00:00, 34550.23it/s][A
 56%|█████▌    | 6994/12500 [00:00<00:00, 35024.69it/s][A
100%|██████████| 12500/12500 [00:00<00:00, 35518.73it/s]


20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False As low budget indies go, you will usually find that you get what you pay for, and let me just say, I
False The DVD was a joke, the audio for the first few minutes was terrible with sound out of sync and Sega
True When i was told of this movie i thought it would be another chick flick. I was wrong. This movie sen
3 últimas amostras treino:
False Kubrick meets King. It sounded so promising back in the spring of 1980, I remember. Then the movie c
True Richard Willaims is an animation god. He was hampered in directing this film by the producer. The fi
True 'Panic in the Streets (1950)' owes more to British noir that its American counterparts. Like Reed's 
3 primeiras amostras validação:
True I just got back from a screening a couple of hours ago, and I was very happy with the movie when I l
True "Panic" is a captivating, blurred-genre film about a brooding and conflicted 

In [None]:
### Criando classe do dataset

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, texts: List[str], labels):
        super().__init__()

        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], torch.tensor([0., 1.]) if self.labels[idx] else torch.tensor([1., 0.])

Dados de treino, validação e teste

In [None]:
training_dataset = MyDataset(x_train, y_train)
valid_dataset = MyDataset(x_valid, y_valid)
test_dataset = MyDataset(x_test, y_test)

print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

training examples: 20000
valid examples: 5000
test examples: 25000


Testando se o modelo processa os dados corretamente

In [None]:
model = tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
model.train().to(device)

sample_train, _ = next(iter(DataLoader(training_dataset, batch_size=4)))

sample_train = tokenizer.batch_encode_plus(sample_train, padding=True, return_tensors="pt", truncation=True, max_length=200).to(device)

model(**sample_train).logits.shape

del sample_train


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

TREINAMENTO

In [38]:


from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.cuda.amp import GradScaler, autocast

max_examples = 20_000
eval_every_steps = 1000
lr = 4e-5
use_amp = True

train_loader = DataLoader(training_dataset, batch_size=16, shuffle=True, num_workers=1)
validation_loader = DataLoader(valid_dataset, batch_size=16, num_workers=1, )

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimizer, 'min', factor=0.9, min_lr=3e-5, patience=15000, threshold=1e-1, verbose=True)
scaler=GradScaler()


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    with autocast(enabled=use_amp):
        logits = model(**input_ids).logits
        logits = logits.reshape(-1, logits.shape[-1])
    loss = nn.functional.cross_entropy(logits, target_ids, )
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    with autocast(enabled=use_amp):
        logits = model(**input_ids).logits
        loss = nn.functional.cross_entropy(logits, target_ids,)
        preds = logits.argmax(dim=1)
        accuracy = (preds == target_ids.argmax(dim=1)).sum().float() / logits.shape[0]
    return loss.item(), accuracy.item()

best_validation_loss = 9999
train_losses = []
n_examples = 0
step = 0
pbar = tqdm(total=max_examples)
while n_examples < max_examples:
    for train_input_ids, train_target_ids in train_loader:
        train_input_ids = tokenizer.batch_encode_plus(train_input_ids, padding=True, return_tensors="pt", truncation=True, max_length=200).to(device)
        loss = train_step(train_input_ids, train_target_ids.to(device))
        train_losses.append(loss)

        # LR scheduler
        scheduler.step(loss)

        if step % eval_every_steps == 0:
            train_loss = np.average(train_losses)

            with torch.no_grad():
                valid_loss = np.average([
                    validation_step(tokenizer.batch_encode_plus(val_input_ids, padding=True, return_tensors="pt", truncation=True, max_length=200).to(device), val_target_ids.to(device))[0]
                    for val_input_ids, val_target_ids in validation_loader])
                # Checkpoint to best models found.
                if best_validation_loss > valid_loss:
                    # Update the new best perplexity.
                    best_validation_loss = valid_loss
                    model.eval()
                    torch.save(model, "best_model.pth")

            print(f'{step} steps; {n_examples} examples so far; train loss: {train_loss:.2f}, valid loss: {valid_loss:.2f}')
            train_losses = []

        n_examples += len(train_input_ids)  # Increment of batch size
        step += 1
        pbar.update(len(train_input_ids))
        if n_examples >= max_examples:
            break

pbar.close()

# Restore best model (checkpoint) found
model = torch.load("best_model.pth")


  0%|          | 3/20000 [00:48<89:11:23, 16.06s/it]

0 steps; 0 examples so far; train loss: 0.12, valid loss: 0.28


 15%|█▌        | 3003/20000 [06:27<20:31:38,  4.35s/it]

1000 steps; 3000 examples so far; train loss: 0.18, valid loss: 0.29


 30%|███       | 6003/20000 [12:06<16:57:48,  4.36s/it]

2000 steps; 6000 examples so far; train loss: 0.11, valid loss: 0.39


 45%|████▌     | 9003/20000 [17:46<13:44:41,  4.50s/it]

3000 steps; 9000 examples so far; train loss: 0.07, valid loss: 0.48


 60%|██████    | 12003/20000 [23:25<9:42:05,  4.37s/it]

4000 steps; 12000 examples so far; train loss: 0.06, valid loss: 0.45


 75%|███████▌  | 15003/20000 [29:05<6:07:07,  4.41s/it]

5000 steps; 15000 examples so far; train loss: 0.04, valid loss: 0.43


 90%|█████████ | 18003/20000 [34:46<2:26:16,  4.39s/it]

6000 steps; 18000 examples so far; train loss: 0.03, valid loss: 0.45


20001it [38:04,  8.75it/s]


Avaliação no dataset de Teste

In [39]:
test_loader = DataLoader(test_dataset, batch_size=1024)

with torch.no_grad():
    losses = [
        validation_step(tokenizer.batch_encode_plus(input, padding=True, return_tensors="pt", truncation=True, max_length=200).to(device), target.to(device))
        for input, target in tqdm(test_loader)
    ]




100%|██████████| 25/25 [03:36<00:00,  8.65s/it]


In [44]:
losses = list(zip(*losses))

test_loss, test_accuracy = np.average(losses[0]), torch.mean(torch.tensor(losses[1]).float())

print(f'test loss: {test_loss}', f'test acc: {test_accuracy}')

test loss: 0.25690659701824187 test acc: 91.10171508789062
