<a href="https://colab.research.google.com/github/pedrogengo/DLforNLP/blob/main/Pedro_Aula_9_Exerc%C3%ADcio_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Pedro Gabriel Gengo Lourenço

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante: 
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

- Usar pytorch lightning. Para entender como o pytorch lightning funciona, veja uma implementação simplificada no notebook `06_01_Treino_Validação_MNIST_Lightning_Lite.ipynb`

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
import numpy as np

In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7fdf4dce5c10>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False Stan Laurel and Oliver Hardy are the most famous comedy duo in history, and deservedly so, so I am h
False Okay, I was bored and decided to see this movie. But I think the main thing that brought this movie 
True For anyone who may not know what a one-actor movie was like, this is the best example. This plot is 
3 últimas amostras treino:
False The movie started off strong, LL Cool J (Deed) as an undercover police officer, with partner Sgt. La
True Though the pieces are uneven this collection of 11 short films is truly a moving and human experienc
True Since my third or fourth viewing some time ago, I've abstained from La Maman et la putain while I wa
3 primeiras amostras validação:
True I had never heard of Robert Roy MacGregor before "Rob Roy" came out, but the movie is definitely wor
True OK, here is my personal list of top Nicktoons shows as in today:<br /><br />1

# Experimentado a biblioteca Transformers

In [None]:
!pip install transformers



In [None]:
from transformers import BertForSequenceClassification, BertTokenizer

In [None]:
checkpoint = 'bert-base-uncased'

In [None]:
# Utilizando o tokenizador que foi utilizado para treinar o checkpoint
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer(x_train[0])

{'input_ids': [101, 9761, 11893, 1998, 6291, 9532, 2024, 1996, 2087, 3297, 4038, 6829, 1999, 2381, 1010, 1998, 10849, 2135, 2061, 1010, 2061, 1045, 2572, 3407, 2000, 2156, 2151, 1997, 2037, 3152, 1012, 10468, 1037, 2158, 2012, 1037, 7109, 4713, 2003, 1996, 2959, 2000, 8579, 1010, 1998, 2574, 2438, 25208, 15288, 2007, 2035, 1996, 7109, 14950, 1012, 2002, 2003, 8345, 2012, 2188, 2007, 9761, 2011, 2010, 2217, 1010, 11303, 4251, 1010, 1998, 1996, 3460, 1006, 2508, 10346, 8485, 3385, 1007, 11640, 2000, 2360, 2002, 2003, 2746, 2058, 2000, 4638, 2006, 25208, 1012, 2044, 27504, 27902, 1998, 6451, 2003, 8494, 20043, 2039, 2011, 1037, 2892, 1011, 7168, 7192, 2386, 1010, 1996, 3460, 3310, 1999, 2005, 1037, 4638, 1011, 2039, 1010, 1998, 2044, 2070, 5852, 1010, 2002, 26021, 5948, 13555, 1005, 1055, 6501, 1998, 2893, 2070, 2712, 2250, 2006, 1996, 4153, 1012, 2044, 9761, 10975, 18908, 13087, 2070, 9368, 2652, 1010, 5689, 2041, 1996, 3332, 2011, 1996, 3042, 11601, 1998, 1037, 2482, 5823, 1010, 2002, 1

In [None]:
# Build do modelo com base no checkpoint
model = BertForSequenceClassification.from_pretrained(checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
tokens = tokenizer(x_train[0], padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
output

SequenceClassifierOutput([('logits',
                           tensor([[ 0.4554, -0.5477]], grad_fn=<AddmmBackward>))])

# Classes Dataset e Dataloader

In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import DataCollatorWithPadding

In [None]:
class IMDBDataset():
  def __init__(self, x, y):
    self.x = x
    self.y = y
  
  def __len__(self):
    return len(self.x)
  
  def __getitem__(self, idx):
    return self.x[idx], int(self.y[idx])

In [None]:
# Aplicando a tokenizacao diretamente nos batches e nao em cada exemplo, para fazer uso do codigo otimizado para batches do HF
def create_dataloader(x, y, tokenizer, batch_size, shuffle=False, max_length=250):
  def data_collator(batch):
    x, y = zip(*batch)
    tokenized_x = tokenizer(x, padding='longest', truncation=True, max_length=max_length, return_tensors='pt')
    return tokenized_x['input_ids'], torch.LongTensor(y), tokenized_x['attention_mask']
  dataset = IMDBDataset(x, y)
  return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=data_collator)

# Loop de treino, validação e teste

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

Tesla K80
cuda:0


In [None]:
def train(model, train, valid, optimizer, batch_count, filename_save, n_epochs=5, run=None, params=None):
  
  best_valid_loss = 10e9
  best_epoch = 0
  count = 0
  train_losses, valid_losses = [], []
  optimizer.zero_grad()
  if run:
    run['parameters'] = params
  for i in range(n_epochs):
    accumulated_loss = 0
    model.train()
    for x_train, y_train, attention_mask in train:
      x_train = x_train.to(device)
      y_train = y_train.to(device)
      attention_mask = attention_mask.to(device)
      outputs = model(input_ids=x_train, attention_mask=attention_mask, labels=y_train)
      batch_loss = outputs.loss
      batch_loss.backward()
      if count == batch_count-1:
        count = -1
        optimizer.step()
        optimizer.zero_grad()
      
      count += 1
      accumulated_loss += batch_loss.item()

    train_loss = accumulated_loss / len(train.dataset)
    train_losses.append(train_loss)

    # Laço de Validação, um a cada época.
    accumulated_loss = 0
    accumulated_accuracy = 0
    model.eval()
    with torch.no_grad():
        for x_valid, y_valid, attention_mask in valid:
            x_valid = x_valid.to(device)
            y_valid = y_valid.to(device)
            attention_mask = attention_mask.to(device)

            # predict da rede
            outputs = model(input_ids=x_valid, attention_mask=attention_mask, labels=y_valid)
            # calcula a perda
            batch_loss = outputs.loss
            preds = outputs.logits.argmax(dim=1)

            # calcula a acurácia
            batch_accuracy = (preds == y_valid).sum()
            accumulated_loss += batch_loss
            accumulated_accuracy += batch_accuracy

    valid_loss = accumulated_loss / len(valid.dataset)
    valid_losses.append(valid_loss)

    valid_acc = accumulated_accuracy / len(valid.dataset)

    print(f'Época: {i:d}/{n_epochs - 1:d} Train Loss: {train_loss:.6f} Valid Loss: {valid_loss:.6f} Valid Acc: {valid_acc:.3f}')

    if run:
      run[f"{filename_save}_valid/loss"].log(valid_loss)
      run[f"{filename_save}_valid/acc"].log(valid_acc)
      run[f"{filename_save}_train/loss"].log(train_loss)


    # Salvando o melhor modelo de acordo com a loss de validação
    if valid_loss < best_valid_loss:
        model.save_pretrained(filename_save)
        best_valid_loss = valid_loss
        best_epoch = i
        print('best model')

  return model, train_losses, valid_losses

# Validando o overfitting

In [None]:
from transformers import AdamW

In [None]:
checkpoint = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
dataloader_train = create_dataloader(x_train[:20], y_train[:20], tokenizer, 20, shuffle=True)
dataloader_valid = create_dataloader(x_valid[:20], y_valid[:20], tokenizer, 20)

In [None]:
_, train_losses_bow, valid_losses_bow = train(model, dataloader_train, dataloader_valid, optimizer, 1, "bert-imdb", n_epochs=20)

Época: 0/19 Train Loss: 0.035609 Valid Loss: 0.034723 Valid Acc: 0.500
best model
Época: 1/19 Train Loss: 0.032153 Valid Loss: 0.035063 Valid Acc: 0.450
Época: 2/19 Train Loss: 0.029061 Valid Loss: 0.037210 Valid Acc: 0.600
Época: 3/19 Train Loss: 0.026236 Valid Loss: 0.037679 Valid Acc: 0.500
Época: 4/19 Train Loss: 0.021259 Valid Loss: 0.033887 Valid Acc: 0.550
best model
Época: 5/19 Train Loss: 0.017975 Valid Loss: 0.033585 Valid Acc: 0.600
best model
Época: 6/19 Train Loss: 0.015056 Valid Loss: 0.034098 Valid Acc: 0.550
Época: 7/19 Train Loss: 0.011919 Valid Loss: 0.034701 Valid Acc: 0.600
Época: 8/19 Train Loss: 0.011060 Valid Loss: 0.034657 Valid Acc: 0.550
Época: 9/19 Train Loss: 0.008896 Valid Loss: 0.035670 Valid Acc: 0.550
Época: 10/19 Train Loss: 0.007037 Valid Loss: 0.037786 Valid Acc: 0.600
Época: 11/19 Train Loss: 0.005926 Valid Loss: 0.039901 Valid Acc: 0.600
Época: 12/19 Train Loss: 0.004680 Valid Loss: 0.042317 Valid Acc: 0.600
Época: 13/19 Train Loss: 0.004097 Valid L

In [None]:
accumulated_accuracy = 0
model.eval()
with torch.no_grad():
    for x_test_, y_test_, attention_mask in dataloader_train:
        x_test_ = x_test_.to(device)
        y_test_ = y_test_.to(device)
        attention_mask = attention_mask.to(device)

        # predict da rede
        outputs = model(input_ids=x_test_, attention_mask=attention_mask, labels=y_test_)
        # calcula a perda
        preds = outputs.logits.argmax(dim=1)

        # calcula a acurácia
        batch_accuracy = (preds == y_test_).sum()
        accumulated_accuracy += batch_accuracy

test_acc = accumulated_accuracy / len(dataloader_train.dataset)
test_acc *= 100
print('*' * 40)
print(f'Acurácia de {test_acc:.3f} %')
print('*' * 40)

****************************************
Acurácia de 100.000 %
****************************************


# Análise de sentimento utilizando BERT

In [None]:
checkpoint = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BertForSequenceClassification.from_pretrained(checkpoint)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
dataloader_train = create_dataloader(x_train, y_train, tokenizer, 20, shuffle=True)
dataloader_valid = create_dataloader(x_valid, y_valid, tokenizer, 20)

In [None]:
_, train_losses_bow, valid_losses_bow = train(model, dataloader_train, dataloader_valid, optimizer, 4, "bert-imdb", n_epochs=3)

Época: 0/2 Train Loss: 0.014313 Valid Loss: 0.014166 Valid Acc: 0.879
best model
Época: 1/2 Train Loss: 0.007381 Valid Loss: 0.012810 Valid Acc: 0.912
best model
Época: 2/2 Train Loss: 0.003596 Valid Loss: 0.014741 Valid Acc: 0.914


In [None]:
dataloader_test = create_dataloader(x_test, y_test, tokenizer, 20)
accumulated_accuracy = 0
model.eval()
with torch.no_grad():
    for x_test_, y_test_, attention_mask in dataloader_test:
        x_test_ = x_test_.to(device)
        y_test_ = y_test_.to(device)
        attention_mask = attention_mask.to(device)

        # predict da rede
        outputs = model(input_ids=x_test_, attention_mask=attention_mask, labels=y_test_)
        # calcula a perda
        preds = outputs.logits.argmax(dim=1)

        # calcula a acurácia
        batch_accuracy = (preds == y_test_).sum()
        accumulated_accuracy += batch_accuracy

In [None]:
test_acc = accumulated_accuracy / len(dataloader_test.dataset)
test_acc *= 100
print('*' * 40)
print(f'Acurácia de {test_acc:.3f} %')
print('*' * 40)

****************************************
Acurácia de 91.564 %
****************************************
