<a href="https://colab.research.google.com/github/patrickctrf/IA024_2022S2/blob/main/ex08/patrick_ferreira/ex08_patrick_ferreira_175480.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:


nome = "Patrick de Carvalho Tavares Rezende Ferreira"
print(f'Meu nome é {nome}')

Meu nome é Patrick de Carvalho Tavares Rezende Ferreira


#  Exercício: Modelo de Linguagem com auto-atenção (versão eficiente)

Este exercício é similar ao da aula 5, mas iremos agora treinar *eficientemente* uma rede neural com uma ou mais camadas de auto-atenção para prever a próxima palavra de um texto, data as palavras anteriores como entrada. 

Para tanto, deve-se implementar:
1. A máscara causal de atenção. Ela possibilitará que, durante o treinamento, com apenas uma forward+backward pass na rede, tenhamos as losses para todos os tokens de entrada (slide 117).
2. A máscara de PADs, que permite que usemos sequencias de comprimento variável no mesmo batch (slide 118).
3. Múltiplas cabeças.

## Importação dos pacotes

In [14]:
import collections
import itertools
import functools
import math
import os
import random
import re

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook, tqdm
from typing import List

In [15]:
# Check which GPU we are using
!nvidia-smi

Sat Oct 15 06:05:34 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P5    N/A /  N/A |    722MiB /  2004MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [16]:
if torch.cuda.is_available():
   dev = "cuda:0"
else:
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


# Carregamento do dataset 

Primeiro, fazemos download do dataset:

In [17]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



In [18]:
!pip install ptk-patrickctrf
!pip install transformers

from operator import itemgetter
from transformers import BertTokenizer
from ptk.utils import DataManager



## Carregando o dataset

Criaremos uma divisão de treino (80%) e validação (20%) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [19]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
random.shuffle(x_train)

n_train = int(0.8 * len(x_train))

x_valid = x_train[n_train:]
x_train = x_train[:n_train]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x in x_train[:3]:
    print(x[:100])

print('3 últimas amostras treino:')
for x in x_train[-3:]:
    print(x[:100])

print('3 primeiras amostras validação:')
for x in x_valid[:3]:
    print(x[:100])

print('3 últimas amostras validação:')
for x in x_valid[-3:]:
    print(x[:100])

random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
Here it is.. the first EVER episode of Friends, Where we get introduced to Control Freak Monica Gell
One of the last classics of the French New Wave. For direction, cineaste Jean Eustache drew from the
After seeing "Driven" on a plane flight to America 3 years ago I truly believed I had seen the worst
3 últimas amostras treino:
Not very impressed. Its difficult to offer any spoilers to this film, because there is almost no dev
I saw this film recently in a film festival. It's the romance of an ex-alcoholic unemployed man who 
Was'nt really bad for Raw's first PPV of 006. But the ending was really really shocking to everyone 
3 primeiras amostras validação:
White Fire has so much going for it. With Larry Bird look-alike Robert Ginty leading the charge blaz
There are several things wrong with this movie- Brenda Song's character being one of them. I do not 
WRITTEN ON THE WIND

<torch._C.Generator at 0x7f1cf0172e90>

In [20]:
### Criando classe do dataset

class MyDataset():
    def __init__(self, texts: List[str], tokenizer, context_size: int, vocab_size=1000):
        self.context_size = context_size
        try:
            self.x = np.load("x_" + str(len(texts)) + ".npy", mmap_mode="r", allow_pickle=True)
            self.y = np.load("y_" + str(len(texts)) + ".npy", mmap_mode="r", allow_pickle=True)
            print("Carregando dataset preprocessado")
        except Exception as e:
            # print("Excecao: ", e)
            print("Montando dataset")

            self.x = list()
            self.y = list()

            for text in tqdm(texts):
                tokens_key = tokenizer(text, return_tensors=None, add_special_tokens=False).input_ids
                for i in range(0, len(tokens_key)-context_size-1, context_size):
                    self.x.append(tokens_key[i:i+context_size])
                    self.y.append(tokens_key[i+1:i+context_size+1])

            self.x = np.array(self.x)
            self.y = np.array(self.y)

            np.save("x_" + str(len(texts)) + ".npy", self.x)
            np.save("y_" + str(len(texts)) + ".npy", self.y)

    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, idx):
        return torch.tensor(self.x[idx]).long(), torch.tensor(self.y[idx]).long(), torch.tensor([False] * self.context_size)

Testando Dataset

In [21]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=tokenizer, context_size=3, vocab_size=tokenizer.vocab_size)
dummy_loader = DataLoader(dummy_dataset, batch_size=2, shuffle=False)

assert len(dummy_dataset) == 3
print('Passou no assert de tamanho do dataset')

first_batch_input, first_batch_target, attention_mask = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor(
    [[7327, 2175, 16033],
     [3449, 2050, 2175]])

correct_first_batch_target = torch.LongTensor(
    [[2175, 16033, 2139],
     [2050, 2175, 9153]])
assert torch.equal(first_batch_input, correct_first_batch_input)
print('Passou no assert de input')
assert torch.equal(first_batch_target, correct_first_batch_target)
print('Passou no assert de target')

Carregando dataset preprocessado
Passou no assert de tamanho do dataset
Passou no assert de input
Passou no assert de target


Dados de treino, validação e teste

In [22]:
# Load datasets
context_size = 9

# tokenizer = Tokenizador(x_train, tokenize, vocab_size=3000)
# tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

training_dataset = MyDataset(texts=x_train, tokenizer=tokenizer, context_size=context_size)
valid_dataset = MyDataset(texts=x_valid, tokenizer=tokenizer, context_size=context_size)
test_dataset = MyDataset(texts=x_test, tokenizer=tokenizer, context_size=context_size)

print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

Carregando dataset preprocessado
Carregando dataset preprocessado
Carregando dataset preprocessado
training examples: 680149
valid examples: 169482
test examples: 829937


In [27]:
class MySequential(nn.Sequential):
    def forward(self, *input):
        for module in self._modules.values():
            input = module(*input)
        return input

class _AttentionLayer(torch.nn.Module):

    def __init__(self, embedding_dim: int, max_seq_length: int, ):
        """
        Implements the Self-attention, decoder-only."

        Args:
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
        """
        super().__init__()
        self.context_size = max_seq_length
        self.embedding_dim = embedding_dim

        # Linear projections
        self.w_q = nn.Linear(embedding_dim, embedding_dim)
        self.w_k = nn.Linear(embedding_dim, embedding_dim)
        self.w_v = nn.Linear(embedding_dim, embedding_dim)

        # cast to probabilities
        self.softmax = nn.Softmax(dim=-1)

        # Matriz triangular de mascara, convertida para Booleano
        # Onde vale 1, o valor deve ser substituida por um valor negativo alto no tensor de scores.
        self.causal_mask = torch.ones((max_seq_length, max_seq_length), device=device).triu(diagonal=1) == 1.0

    def forward(self, x_embeddings, attention_mask):

        k = self.w_k(x_embeddings)
        v = self.w_v(x_embeddings)
        q = self.w_q(x_embeddings)

        scores = torch.matmul(q, k.transpose(1, 2))

        # Onde a mascara vale 1, retornamos um valor negativo grande.
        # Onde a mascara vale zero, mantemos intacto.
        probabilities = self.softmax(scores.masked_fill(self.causal_mask, -1e4).masked_fill(attention_mask.repeat_interleave(scores.shape[0] // attention_mask.shape[0],0).unsqueeze(-1), -1e4))

        return torch.matmul(probabilities, v)

class MultiHeadAttentionLayer(torch.nn.Module):

    def __init__(self, embedding_dim: int, max_seq_length: int, n_heads: int):
        """
        Implements the Self-attention, decoder-only."

        Args:
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            n_heads (int): Number of self attention heads.
        """
        super().__init__()
        self.context_size = max_seq_length
        self.embedding_dim = embedding_dim
        self.head_dim = embedding_dim // n_heads
        self.n_heads = n_heads

        assert embedding_dim % n_heads == 0, "MultiHeadAttentionLayer Error: Embedding_dim must be an integer multiple of n_heads. "

        self.heads = _AttentionLayer(embedding_dim=self.head_dim, max_seq_length=max_seq_length)
        self.w_0 = nn.Linear(embedding_dim, embedding_dim)

        self.norm = nn.LayerNorm(embedding_dim)

    def forward(self, x_embeddings, attention_mask):

        return self.norm(
            x_embeddings +
            self.w_0(
                self.heads(
                    x_embeddings.reshape(x_embeddings.shape[0], x_embeddings.shape[1], self.n_heads, self.head_dim).movedim(1,2).reshape(-1, x_embeddings.shape[1], self.head_dim), attention_mask
                ).reshape(x_embeddings.shape[0], self.n_heads, x_embeddings.shape[1], self.head_dim).movedim(1,2).reshape(x_embeddings.shape[0], x_embeddings.shape[1], self.embedding_dim)
            )
        )


class DecoderLayer(torch.nn.Module):

    def __init__(self, embedding_dim: int, max_seq_length: int, n_heads: int, hidden_size: int):
        """
        Implements the Self-attention, decoder-only."

        Args:
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            n_heads (int): Number of self attention heads.
            hidden_size (int): Number of neurons for feed-forward MLP.
        """
        super().__init__()
        self.context_size = max_seq_length
        self.embedding_dim = embedding_dim

        # output MLP
        self.mlp = nn.Sequential(
            nn.Linear(embedding_dim, hidden_size),
            nn.LeakyReLU(),
            nn.LayerNorm(hidden_size),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, embedding_dim)
        )

        self.multihead_selfattention = MultiHeadAttentionLayer(embedding_dim, max_seq_length, n_heads)

        # cast to probabilities
        self.softmax = nn.Softmax(dim=-1)

        self.norm = nn.LayerNorm(embedding_dim)

    def forward(self, x_embeddings, attention_mask):

        e = self.multihead_selfattention(x_embeddings, attention_mask)

        logits = self.mlp(e)

        return self.norm(logits + e), attention_mask

class LanguageModel(torch.nn.Module):

    def __init__(self, vocab_size: int, max_seq_length: int, embedding_dim: int, n_layers: int, n_heads: int, hidden_size: int=2048):
        """
        Implements the Self-attention, decoder-only."

        Args:
            vocab_size (int): Size of the input vocabulary.
            max_seq_length (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            n_layers (int): number of self-attention layers.
            n_heads (int): Number of self attention heads.
            hidden_size (int): Number of neurons for feed-forward MLP.
        """
        super().__init__()
        context_size = max_seq_length
        self.context_size = context_size
        self.embedding_dim = embedding_dim

        # tokens (words indexes) embedding and positional embedding
        self.c_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.p_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.positional_indexes = torch.arange(self.context_size, device=device).view(1, -1)

        self.decoder_layers = MySequential(*[DecoderLayer(embedding_dim=embedding_dim, max_seq_length=max_seq_length, n_heads=n_heads, hidden_size=hidden_size) for i in range(n_layers)])

        self.linear_output = nn.Linear(embedding_dim, vocab_size)

        # cast to probabilities
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, inputs, attention_mask):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, max_seq_length)

        Returns:
            logits of shape (batch_size, max_seq_length, vocab_size)
        """
        input_embeddings = self.c_embedding(inputs)

        positional_embeddings = self.p_embedding(self.positional_indexes.repeat(inputs.shape[0], 1))

        x_embeddings = positional_embeddings + input_embeddings

        logits, _ = self.decoder_layers(x_embeddings, attention_mask)

        return self.linear_output(logits)

Testando se o modelo processa os dados corretamente

In [29]:
model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=context_size,
    embedding_dim=32,
    n_layers=1,
    n_heads=2,
    hidden_size=32,
).to(device)

sample_train, _, attention_mask = next(iter(DataLoader(training_dataset)))
sample_train_gpu = sample_train.to(device)
attention_mask = attention_mask.to(device)
model(sample_train_gpu, attention_mask).shape

torch.Size([1, 9, 30522])

Verificando a perplexidade

In [30]:
def perplexity(logits, target, ignore_token_id: int):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, seq_length, vocab_size)
        target: a LongTensor of shape (batch_size, seq_length)

    Returns:
        A float corresponding to the perplexity
    """
    logits = logits.reshape(-1, logits.shape[-1])
    target = target.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target, reduction='mean', ignore_index=ignore_token_id)
    return torch.exp(loss)


n_examples = 100

sample_train, target_token_ids, attention_mask = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
sample_train_gpu = sample_train.to(device)
target_token_ids = target_token_ids.to(device)
attention_mask = attention_mask.to(device)
logits = model(sample_train_gpu, attention_mask)

print("vocab_size: ", tokenizer.vocab_size)

print("logits shape: ", logits.shape)

my_perplexity = perplexity(logits=logits, target=target_token_ids, ignore_token_id=tokenizer.pad_token_id)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {tokenizer.vocab_size}')

assert math.isclose(my_perplexity, tokenizer.vocab_size, abs_tol=6000)
print('Passou o no assert da perplexidade')

del sample_train_gpu
del target_token_ids
del logits

vocab_size:  30522
logits shape:  torch.Size([100, 9, 30522])
my perplexity:              34155
correct initial perplexity: 30522
Passou o no assert da perplexidade


TREINAMENTO

In [31]:
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.cuda.amp import GradScaler, autocast

max_examples = 30_000_000
eval_every_steps = 5000
lr = 3e-5
use_amp = True

unscaling_factor = 1

model = LanguageModel(
    vocab_size=tokenizer.vocab_size,
    max_seq_length=context_size,
    embedding_dim=512//unscaling_factor,
    n_layers=8//unscaling_factor,
    n_heads=8,
    hidden_size=2048//unscaling_factor,
).to(device)
train_loader = DataLoader(training_dataset, batch_size=1024//unscaling_factor, shuffle=True, num_workers=1)
validation_loader = DataLoader(valid_dataset, batch_size=1024//unscaling_factor, num_workers=1, )

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimizer, 'min', factor=0.8, min_lr=8e-6, patience=3, threshold=5e-3, verbose=True)
scaler=GradScaler()


def train_step(input_ids, target_ids, attention_mask):
    model.train()
    model.zero_grad()
    with autocast(enabled=use_amp):
        logits = model(input_ids, attention_mask)
        logits = logits.reshape(-1, logits.shape[-1])
        target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, )
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    return loss.item()


def validation_step(input_ids, target_ids, attention_mask):
    model.eval()
    with torch.no_grad():
        with autocast(enabled=use_amp):
            logits = model(input_ids, attention_mask)
            logits = logits.reshape(-1, logits.shape[-1])
            target_ids = target_ids.reshape(-1)
            loss = nn.functional.cross_entropy(logits, target_ids,)
    return loss.item()

best_validation_ppl = 9999
train_losses = []
n_examples = 0
step = 0
pbar = tqdm(total=max_examples)
while n_examples < max_examples:
    for train_input_ids, train_target_ids, attention_mask in DataManager(train_loader, device=device, buffer_size=1, data_type=None):
        loss = train_step(train_input_ids.to(device), train_target_ids.to(device), attention_mask)
        train_losses.append(loss)

        if step % eval_every_steps == 0:
            train_ppl = np.exp(np.average(train_losses))

            # LR scheduler
            scheduler.step(np.average(train_losses))

            with torch.no_grad():
                valid_ppl = np.exp(np.average([
                    validation_step(val_input_ids.to(device), val_target_ids.to(device), attention_mask)
                    for val_input_ids, val_target_ids, attention_mask in DataManager(validation_loader, device=device, buffer_size=1, data_type=None)]))
                # Checkpoint to best models found.
                if best_validation_ppl > valid_ppl:
                    # Update the new best perplexity.
                    best_validation_ppl = valid_ppl
                    model.eval()
                    torch.save(model, "best_model.pth")

            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}')
            train_losses = []

        n_examples += len(train_input_ids)  # Increment of batch size
        step += 1
        pbar.update(len(train_input_ids))
        if n_examples >= max_examples:
            break

pbar.close()

# Restore best model (checkpoint) found
model = torch.load("best_model.pth")
model.eval()

  0%|          | 0/60000000 [00:00<?, ?it/s]

KeyboardInterrupt: 

Avaliação no dataset de Teste

In [None]:
test_loader = DataLoader(test_dataset, batch_size=64, num_workers=1)

with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(input.to(device), target.to(device))
        for input, target, attention_mask in tqdm(test_loader)
    ]))

print(f'test perplexity: {test_ppl}')

Testando gerar uma sentença

In [None]:
prompt = 'Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to'
max_output_tokens = 10

for _ in range(max_output_tokens):
    input_ids = tokenizer(prompt, return_tensors=None, add_special_tokens=False).input_ids
    input_ids_truncated = input_ids[-context_size:]# Usamos apenas os últimos <context_size> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device), torch.tensor([False] * context_size, device=device))[:, -1, :]
    # Ao usarmos o argmax, a saída do modelo em cada passo é token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    predicted_word = tokenizer.decode([predicted_id,])
    prompt += predicted_word[0]
    print(prompt)