<a href="https://colab.research.google.com/github/patrickctrf/IA024_2022S2/blob/main/ex04/patrick_ferreira/ex04_175480_patrick_ferreira.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
nome = "Patrick de Carvalho Tavares Rezende Ferreira"
print(f'Meu nome é {nome}')



Meu nome é Patrick de Carvalho Tavares Rezende Ferreira


#  Exercício: Modelo de Linguagem (Bengio 2003) - MLP + Embeddings

Neste exercício iremos treinar uma rede neural similar a do Bengio 2003 para prever a próxima palavra de um texto, data as palavras anteriores como entrada. Esta tarefa é chamada de "Modelagem da Linguagem".

Algumas dicas:
- Inclua caracteres de pontuação (ex: `.` e `,`) no vocabulário.
- Deixe tudo como caixa baixa (lower-case).
- A escolha do tamanho do vocabulario é importante: ser for muito grande, fica difícil para o modelo aprender boas representações. Se for muito pequeno, o modelo apenas conseguirá gerar textos simples.
- Remova qualquer exemplo de treino/validação/teste que tenha pelo menos um token desconhecido (ou na entrada ou na saída). 
- Este dataset já possui um tamanho razoável e é bem provável que você vai precisar rodar seus experimentos com GPU.
- Durante a depuração, faça seu dataset ficar bem pequeno, para que a depuração seja mais rápida e não precise de GPU. Somente ligue a GPU quando o seu laço de treinamento já está funcionando
- Não deixe para fazer esse exercício na véspera. Ele é trabalhoso.

## Importação dos pacotes

In [2]:
import collections
import itertools
import functools
import math
import os
import random
import re

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm_notebook, tqdm
from typing import List


from operator import itemgetter

In [3]:
# Check which GPU we are using
!nvidia-smi

Wed Sep 21 20:46:00 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8    N/A /  N/A |    202MiB /  2004MiB |     37%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [4]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


# Carregamento do dataset 

Primeiro, fazemos download do dataset:

In [5]:
# !wget -nc http://files.fast.ai/data/aclImdb.tgz
# !tar -xzf aclImdb.tgz

## Carregando o dataset

Criaremos uma divisão de treino (80%) e validação (20%) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [6]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
random.shuffle(x_train)

n_train = int(0.8 * len(x_train))

x_valid = x_train[n_train:]
x_train = x_train[:n_train]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x in x_train[:3]:
    print(x[:100])

print('3 últimas amostras treino:')
for x in x_train[-3:]:
    print(x[:100])

print('3 primeiras amostras validação:')
for x in x_valid[:3]:
    print(x[:100])

print('3 últimas amostras validação:')
for x in x_valid[-3:]:
    print(x[:100])



random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
Gentleman Jim not really a boxing film. It is a vehicle for Errol Flynn as Jim Corbett. But having s
If there is one film which is the worst of this year- it's TASHAN The first promo gave an indication
Anyone giving this movie a good review obviously must have had something to do with its creation. Th
3 últimas amostras treino:
I saw this at a screening last night too. I was totally blown away at how much better this movie was
I commend pictures that try something different. Many films just seem like re-treads of old ideas, s
Xiao Chen Zhi Chun is a great movie, not only in the year it was shot but also now. It's an art movi
3 primeiras amostras validação:
I didn't really know what this movie was about when I went to the theater to see it (hype about the 
The jokes are obvious, the gags are corny, and the characters are walking characatures - but I could
''The 40 Year Old V

<torch._C.Generator at 0x7f8e88087e90>

### Definindo funções de manipulação de texto.

In [7]:
from typing import List

import re
import string


def tokenize(text: str):
    """
    Convert string to a list of tokens (i.e., words).
    This function lower cases everything and removes punctuation.
    """

    return re.findall(r"[\w']+|[.,!?;]", text.lower())


from collections import Counter


def create_vocab(texts: List[str], max_tokens: int):
    """
    Returns a dictionary whose keys are tokens and values are token ids (from 0 to max_tokens - 1).
    """

    tokens = []

    for t in texts:
        tokens.extend(tokenize(t))

    return dict(Counter(tokens).most_common(max_tokens))

def items_em_comum(amostra, corpus):
    c = [value for value in amostra if value in corpus]
    return c


def concatenate_list_of_str(texts: List[str]):
    return ".".join(texts)

class Tokenizador():
    def __init__(self, texts, tokenizer=tokenize, vocab_size=3000):
        super().__init__()

        self.tokenizer = tokenizer
        self.vocab_size = vocab_size

        self.every_text = concatenate_list_of_str(texts)
        self.tokens_ocurrences = create_vocab(tokenize(self.every_text), max_tokens=vocab_size)

        self.lista_do_vocabulario = list(self.tokens_ocurrences.keys())

        self.dicionario_tokens = dict(zip(self.lista_do_vocabulario, range(len(self.lista_do_vocabulario))))

    def __call__(self, text):
        tokens = self.tokenizer(text)

        # Pegamos somente os tokens que existem no vocabulario
        tokens_keys = list(items_em_comum(tokens, self.lista_do_vocabulario))

        return list(itemgetter(*tokens_keys)(self.dicionario_tokens))

    def decode(self, ids):
        reversed_dict = dict(zip(self.dicionario_tokens.values(), self.dicionario_tokens.keys()))
        return list(itemgetter(*ids)(reversed_dict))


In [8]:
### Criando classe do dataset

class MyDataset():
    def __init__(self, texts: List[str], tokenizer, context_size: int, vocab_size=1000):
        try:
            self.x = np.load("x_" + str(len(texts)) + ".npy", mmap_mode="r", allow_pickle=True)
            self.y = np.load("y_" + str(len(texts)) + ".npy", mmap_mode="r", allow_pickle=True)
            print("Carregando dataset preprocessado")
        except Exception as e:
            # print("Excecao: ", e)
            print("Montando dataset")

            self.x = list()
            self.y = list()

            for text in tqdm(texts):
                tokens_key = tokenizer(text)
                for i in range(len(tokens_key)-context_size):
                    self.x.append(tokens_key[i:i+context_size])
                    self.y.append(tokens_key[i+context_size])

            self.x = np.array(self.x)
            self.y = np.array(self.y)

            np.save("x_" + str(len(texts)) + ".npy", self.x)
            np.save("y_" + str(len(texts)) + ".npy", self.y)

    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, idx):
        return torch.tensor(self.x[idx]).long(), torch.tensor(self.y[idx]).long()

Testando Dataset

In [9]:
vocab_size = 3000

dummy_texts = ['Eu gosto de correr', 'Ela gosta muito de comer pizza']

dummy_dataset = MyDataset(texts=dummy_texts, tokenizer=Tokenizador(dummy_texts, tokenize, vocab_size), context_size=3, vocab_size=vocab_size)
dummy_loader = DataLoader(dummy_dataset, batch_size=2, shuffle=False)

assert len(dummy_dataset) == 4
print('Passou no assert de tamanho do dataset')

first_batch_input, first_batch_target = next(iter(dummy_loader))

correct_first_batch_input = torch.LongTensor(
    [[1, 2, 0],
     [5, 6, 7]])

correct_first_batch_target = torch.LongTensor([3,   0, ])

assert torch.equal(first_batch_input, correct_first_batch_input)
print('Passou no assert de input')
assert torch.equal(first_batch_target, correct_first_batch_target)
print('Passou no assert de target')

Carregando dataset preprocessado
Passou no assert de tamanho do dataset
Passou no assert de input
Passou no assert de target


Dados de treino, validação e teste

In [10]:
# Load datasets
context_size = 15

tokenizer = Tokenizador(x_train, tokenize, vocab_size=3000)

training_dataset = MyDataset(texts=x_train, tokenizer=tokenizer, context_size=context_size)
valid_dataset = MyDataset(texts=x_valid, tokenizer=tokenizer, context_size=context_size)
test_dataset = MyDataset(texts=x_test, tokenizer=tokenizer, context_size=context_size)

print(f'training examples: {len(training_dataset)}')
print(f'valid examples: {len(valid_dataset)}')
print(f'test examples: {len(test_dataset)}')

Carregando dataset preprocessado
Carregando dataset preprocessado
Carregando dataset preprocessado
training examples: 4429505
valid examples: 1121530
test examples: 5420688


In [11]:
class LanguageModel(torch.nn.Module):

    def __init__(self, vocab_size, context_size, embedding_dim, hidden_size):
        """
        Implements the Neural Language Model proposed by Bengio et al."

        Args:
            vocab_size (int): Size of the input vocabulary.
            context_size (int): Size of the sequence to consider as context for prediction.
            embedding_dim (int): Dimension of the embedding layer for each word in the context.
            hidden_size (int): Size of the hidden layer.
        """
        
        super().__init__()

        self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)

        self.features = torch.nn.Sequential(
            nn.Linear(context_size * embedding_dim, hidden_size),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_size, hidden_size),
            nn.LeakyReLU(0.2),
        )

        self.classifier = nn.Linear(hidden_size, vocab_size)

    def forward(self, inputs):
        """
        Args:
            inputs is a LongTensor of shape (batch_size, context_size)
        """

        return self.classifier(
            self.features(
                self.embedding(inputs).view(inputs.shape[0], -1)
            )
        )

Testando se o modelo processa os dados corretamente

In [12]:
model = LanguageModel(
    vocab_size=vocab_size,
    context_size=context_size,
    embedding_dim=64,
    hidden_size=128,
).to(device)

sample_train, _ = next(iter(DataLoader(training_dataset)))
sample_train_gpu = sample_train.to(device)
model(sample_train_gpu).shape

torch.Size([1, 3000])

Verificando a perplexidade

In [13]:
def perplexity(logits, target):
    """
    Computes the perplexity.

    Args:
        logits: a FloatTensor of shape (batch_size, vocab_size)
        target: a LongTensor of shape (batch_size,)

    Returns:
        A float corresponding to the perplexity.
    """
    # Escreva seu código aqui.

    crossentropy =  nn.functional.cross_entropy(logits,target)

    return crossentropy.exp()


n_examples = 1000

sample_train, target_token_ids = next(iter(DataLoader(training_dataset, batch_size=n_examples)))
sample_train_gpu = sample_train.to(device)
target_token_ids = target_token_ids.to(device)
logits = model(sample_train_gpu)

my_perplexity = perplexity(logits=logits, target=target_token_ids)

print(f'my perplexity:              {int(my_perplexity)}')
print(f'correct initial perplexity: {vocab_size}')

assert math.isclose(my_perplexity, vocab_size, abs_tol=2000)
print('Passou o no assert da perplexidade')

my perplexity:              2984
correct initial perplexity: 3000
Passou o no assert da perplexidade


TREINAMENTO

In [14]:
from queue import Queue
from threading import Thread


class DataManager(Thread):
    def __init__(self, data_loader, buffer_size=3, device=torch.device("cpu"), data_type=torch.float32):
        """
This manager intends to load a PyTorch dataloader like from disk into memory,
reducing the acess time. It does not easily overflow memory, because we set a
buffer size limiting how many samples will be loaded at once. Everytime a sample
is consumed by the calling thread, another one is replaced in the
buffer (unless we reach the end of dataloader).

A manger may be called exactly like a dataloader, an it's based in an internal
thread that loads samples into memory in parallel. This is specially useful
when you are training in GPU and processor is almost idle.

        :param data_loader: Base dataloader to load in parallel.
        :param buffer_size: How many samples to keep loaded (caution to not overflow RAM). Default: 3.
        :param device: Torch device to put samples in, like torch.device("cpu") (default). It saves time by transfering in parallel.
        :param data_type: Automatically casts tensor type. Default: torch.float32.
        """
        super().__init__()
        self.buffer_queue = Queue(maxsize=buffer_size)
        self.data_loader = data_loader
        self.buffer_size = buffer_size
        self.device = device
        self.data_type = data_type

        self.dataloader_finished = False

    def run(self):
        """
Runs the internal thread that iterates over the dataloader until fulfilling the
buffer or the end of samples.
        """
        for i, (x, y) in enumerate(self.data_loader):
            # Important to set before put in queue to avoid race condition
            # would happen if trying to get() in next() method before setting this flag
            if i >= len(self) - 1:
                self.dataloader_finished = True

            self.buffer_queue.put([x.to(self.data_type).to(self.device),
                                   y.to(self.data_type).to(self.device)])

    def __iter__(self):
        """
Returns an iterable of itself.

        :return: Iterable around this class.
        """
        self.start()
        self.dataloader_finished = False
        return self

    def __next__(self):
        """
Intended to be used as iterator.

        :return: Next iteration element.
        """
        if self.dataloader_finished is True and self.buffer_queue.empty():
            raise StopIteration()

        return self.buffer_queue.get()

    def __len__(self):
        return len(self.data_loader)

In [15]:
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.cuda.amp import GradScaler, autocast

max_examples = 30_000_000
eval_every_steps = 5000
lr = 3e-5
use_amp = True

model = LanguageModel(
    vocab_size=vocab_size,
    context_size=context_size,
    embedding_dim=64,
    hidden_size=128,
).to(device)
train_loader = DataLoader(training_dataset, batch_size=1024, shuffle=True, num_workers=4)
validation_loader = DataLoader(valid_dataset, batch_size=1024, num_workers=4, )

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(optimizer, 'min', factor=0.9, min_lr=1e-5, patience=15000, threshold=1e-1, verbose=True)
scaler=GradScaler()


def train_step(input_ids, target_ids):
    model.train()
    model.zero_grad()
    with autocast(enabled=use_amp):
        logits = model(input_ids)
        logits = logits.reshape(-1, logits.shape[-1])
        target_ids = target_ids.reshape(-1)
    loss = nn.functional.cross_entropy(logits, target_ids, )
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    return loss.item()


def validation_step(input_ids, target_ids):
    model.eval()
    with autocast(enabled=use_amp):
        logits = model(input_ids)
        logits = logits.reshape(-1, logits.shape[-1])
        target_ids = target_ids.reshape(-1)
        loss = nn.functional.cross_entropy(logits, target_ids,)
    return loss.item()

best_validation_ppl = 9999
train_losses = []
n_examples = 0
step = 0
pbar = tqdm(total=max_examples)
while n_examples < max_examples:
    for train_input_ids, train_target_ids in DataManager(train_loader, device=device, buffer_size=1, data_type=None):
        loss = train_step(train_input_ids.to(device), train_target_ids.to(device))
        train_losses.append(loss)

        # LR scheduler
        scheduler.step(loss)

        if step % eval_every_steps == 0:
            train_ppl = np.exp(np.average(train_losses))

            with torch.no_grad():
                valid_ppl = np.exp(np.average([
                    validation_step(val_input_ids.to(device), val_target_ids.to(device))
                    for val_input_ids, val_target_ids in DataManager(validation_loader, device=device, buffer_size=1, data_type=None)]))
                # Checkpoint to best models found.
                if best_validation_ppl > valid_ppl:
                    # Update the new best perplexity.
                    best_validation_ppl = valid_ppl
                    model.eval()
                    torch.save(model, "best_model.pth")

            print(f'{step} steps; {n_examples} examples so far; train ppl: {train_ppl:.2f}, valid ppl: {valid_ppl:.2f}')
            train_losses = []

        n_examples += len(train_input_ids)  # Increment of batch size
        step += 1
        pbar.update(len(train_input_ids))
        if n_examples >= max_examples:
            break

pbar.close()

# Restore best model (checkpoint) found
model = torch.load("best_model.pth")

  0%|          | 4096/30000000 [00:15<88:07:01, 94.56it/s] 

0 steps; 0 examples so far; train ppl: 3027.90, valid ppl: 3029.91


 17%|█▋        | 5125825/30000000 [02:25<5:21:25, 1289.79it/s]

5000 steps; 5119681 examples so far; train ppl: 390.11, valid ppl: 279.49


 34%|███▍      | 10245506/30000000 [04:47<24:52:50, 220.55it/s]

10000 steps; 10239362 examples so far; train ppl: 247.27, valid ppl: 224.19


 51%|█████     | 15371331/30000000 [06:44<2:01:28, 2007.22it/s]

15000 steps; 15359043 examples so far; train ppl: 209.61, valid ppl: 198.89


 68%|██████▊   | 20486916/30000000 [08:37<1:30:03, 1760.52it/s]

20000 steps; 20478724 examples so far; train ppl: 190.13, valid ppl: 183.58


 85%|████████▌ | 25604549/30000000 [10:12<53:12, 1376.65it/s]  

25000 steps; 25598405 examples so far; train ppl: 176.90, valid ppl: 172.79


 86%|████████▋ | 25942469/30000000 [10:17<00:57, 70885.63it/s]

Epoch 25325: reducing learning rate of group 0 to 2.7000e-05.


30000262it [11:37, 43013.07it/s]                              


Avaliação no dataset de Teste

In [16]:
test_loader = DataLoader(test_dataset, batch_size=64)

with torch.no_grad():
    test_ppl = np.exp(np.average([
        validation_step(input.to(device), target.to(device))
        for input, target in tqdm(test_loader)
    ]))

print(f'test perplexity: {test_ppl}')

100%|██████████| 84699/84699 [04:18<00:00, 327.85it/s]

test perplexity: 171.10386737891344





Testando gerar uma sentença

In [17]:
prompt = 'Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to '
max_output_tokens = 10

for _ in range(max_output_tokens):
    input_ids = tokenizer(text=prompt)
    input_ids_truncated = input_ids[-context_size:]# Usamos apenas os últimos <context_size> tokens como entrada para o modelo.
    logits = model(torch.LongTensor([input_ids_truncated]).to(device))
    # Ao usarmos o argmax, a saída do modelo em cada passo é token de maior probabilidade.
    # Isso se chama decodificação gulosa (greedy decoding).
    predicted_id = torch.argmax(logits).item()
    predicted_word = tokenizer.decode([predicted_id,])
    prompt += predicted_word[0]
    print(prompt)

Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to t
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to tt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to ttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to tttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to ttttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to tttttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to ttttttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to tttttttt
Frankenstein tells the story of gifted scientist Victor Frankenstein who succeeds in giving life to ttttttttt
Frankenstein tells the story of gifted scienti