In [1]:
nome = 'Matheus Lindino'
print(f'Meu nome é {nome}')

Meu nome é Matheus Lindino


#  Exercício: Modelos de Linguagem com Bilhões de parâmetros e Implementação de Algoritmos de Decodificação

Neste exercicio iremos:
1. Avaliar a perplexidade do modelo GPT-2 (1.3B) ou GTP-J (6B) no dataset de testes do IMDB.
2. Ainda usando o GPT-2/J, iremos implementar algorimos de decodificação (sampling, top-k e top-k) e compara-los com os implementados pela função generate da HuggingFace. Esta análise será qualitativa.
3. Implementar uma função similar à torch.multinomial e checar se ela é igual ao do pytorch.

## Instalação/Carregamento dos pacotes

In [2]:
!pip install transformers accelerate bitsandbytes -q

In [20]:
import os
import copy
import random
import numpy as np
import torch
import torch.nn as nn

from collections import Counter
from tqdm.notebook import tqdm
from transformers import GPTNeoForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader, Dataset

import warnings
warnings.filterwarnings("ignore")

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Device: {device}')

Device: cuda:0


In [5]:
MAX_LENGTH = 128
BATCH_SIZE = 32

# Carregando o Modelo

In [6]:
# Para desenvolvimento, sugerimos usar o GPT menor, com 1.3B parametros.
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B", device_map='auto')#, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

# Se estiver usando o colab pro, conseguimos rodar um GPT-J de 6B de parametros.
# model = transformers.GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
# tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

# Testando geração com um prompt

In [7]:
prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(
    input_ids.cuda(),
    do_sample=True,
    temperature=0.9,
    max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The unicorn was “the star of the show” of a nearby community, the inhabitants of this remote community were “the best in the world”, and the inhabitants of a nearby town were “the best in the world.”



# Carregamento do dataset IMDB (conjunto de teste apenas)

Primeiro, fazemos download do dataset:

In [8]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
File ‘aclImdb.tgz’ already there; not retrieving.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_test = x_test_pos + x_test_neg

print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras validação:')
for x in x_test[:3]:
    print(x[:100])

print('3 últimas amostras validação:')
for x in x_test[-3:]:
    print(x[:100])

25000 amostras de teste.
3 primeiras amostras validação:
...the world may never know. (The film that did take the "best animated short" Oscar that year, "Ann
Brilliant work. Marvelous actors dissolve as brave and courageous characters .All unforgettable part
If you "been there" and "done that" you will absolutely love this film. I have and by "there" I mean
3 últimas amostras validação:
this movie sucks. did anyone notice that the entire movie was shot in like 2 rooms. there are NEVER 
Snake Island is one of those films that, whilst one sits and watches its amazing level of stupidity,
The film's tagline is "You think you know who you are. You have no idea." I reject both the suggeste


In [10]:
class IMDBDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        super().__init__()
        self.data = tokenizer(data, max_length=max_length, truncation=True, padding=True, return_overflowing_tokens=True, stride=1, return_tensors='pt')
        self.pad_token_id = tokenizer.eos_token_id

    def __getitem__(self, index):
        input_ids = self.data.input_ids[index][:-1]
        attention_mask = self.data.attention_mask[index][:-1]
        
        labels = self.data.input_ids[index][1:]
        labels = labels.masked_fill(labels == self.pad_token_id, -100)
        
        return input_ids, attention_mask, labels
    def __len__(self):
        return len(self.data.input_ids)

In [11]:
tokenizer.pad_token = tokenizer.eos_token
dataset = IMDBDataset(x_test, tokenizer, 15)

In [12]:
test_dataset = IMDBDataset(x_test, tokenizer, max_length=MAX_LENGTH)

test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

In [13]:
def evaluate(model, dataloader, criterion):
    running_loss = 0.0
    running_corrects = 0
    n_tokens = 0

    model.eval()
    for input_ids, attention_mask, labels in tqdm(dataloader, total=len(dataloader)):
        input_ids      = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels         = labels.to(device)
    
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask).logits

        n_tokens += torch.sum(labels != -100).item()

        running_loss += criterion(outputs.transpose(-2, -1), labels)
        preds = outputs.argmax(dim = -1)
        running_corrects += (preds == labels).sum().item()

    return running_loss/n_tokens, running_corrects/n_tokens

In [15]:
criterion = nn.CrossEntropyLoss(reduction='sum')
loss, acc = evaluate(model, test_loader, criterion)
print(f'Loss test: {loss:.2f} - Perplexity: {np.exp(loss.cpu()):.2f} - Accuracy: {acc}')

Loss test: 3.46 - Perplexity: 31.85 - Accuracy: 0.3612083432309288


In [21]:
def multinomial(logits):
    total = sum(logits)
    r = random.uniform(0, total)
    posicao = 0
    for i, w in enumerate(logits):
        if posicao + w >= r:
            return i
        posicao += w

logits = torch.rand(4, dtype=torch.float)
ex_pytorch = []
ex_function = []

n_samples = 100
for i in range(n_samples):
    ex_pytorch.append(torch.multinomial(logits, num_samples=1)[0].item())
    ex_function.append(multinomial(logits))

ex_pytorch = Counter(ex_pytorch)
ex_function = Counter(ex_function)

print('-'*15, 'Pytorch',  '-'*15)
for i, k in ex_pytorch.items():
    print(f'{i}: {k/n_samples}')

print('-'*15, 'Function',  '-'*15)
for i, k in ex_function.items():
    print(f'{i}: {k/n_samples}')

--------------- Pytorch ---------------
3: 0.43
2: 0.4
1: 0.09
0: 0.08
--------------- Function ---------------
2: 0.56
3: 0.31
1: 0.08
0: 0.05


In [29]:
class Decoder():
    def __init__(self, temperature=1):
        self.temperature = temperature
        
    def sampling(self, logits):
        probs = torch.softmax(logits/self.temperature, dim=-1)
        return probs

    def top_k_filtering(self, logits, k = 1):
        probs = self.sampling(logits)
        top_k = probs.topk(k)
        return top_k.indices[multinomial(top_k.values)].item()

    def top_p_filtering(self, logits, p = 1):
        probs = self.sampling(logits)
        
        top_p_idx = []
        top_p_values = []
        idx = torch.argsort(logits, descending = True)
        for i in idx:
            top_p_idx.append(i)
            top_p_values.append(probs[i])
            if sum(top_p_values) >= p:
                break
        return top_p_idx[multinomial(top_p_values)].item()

In [23]:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
    input_ids.cuda(),
    do_sample=True,
    temperature=0.9,
    max_length=100 + input_ids.shape[-1], 
)
original = tokenizer.batch_decode(gen_tokens)[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [25]:
generator = Decoder(temperature=0.9)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids[0]

for _ in range(MAX_LENGTH):
    with torch.no_grad():
        logits = model(input_ids.cuda()).logits[-1]
    next_token = multinomial(generator.sampling(logits))
    input_ids = torch.cat([input_ids, torch.LongTensor([next_token])])

scratch = tokenizer.decode(input_ids)

In [26]:
generator = Decoder(temperature=200)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids[0]

for _ in range(MAX_LENGTH):
    with torch.no_grad():
        logits = model(input_ids.cuda()).logits[-1]
    next_token = generator.top_k_filtering(logits, k=10)
    input_ids = torch.cat([input_ids, torch.LongTensor([next_token])])

high_temp = tokenizer.decode(input_ids)

In [27]:
generator = Decoder(temperature=1)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids[0]

for _ in range(MAX_LENGTH):
    with torch.no_grad():
        logits = model(input_ids.cuda()).logits[-1]
    next_token = generator.top_k_filtering(logits, k=5)
    input_ids = torch.cat([input_ids, torch.LongTensor([next_token])])

top_k_ = tokenizer.decode(input_ids)

In [30]:
generator = Decoder(temperature=1)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids[0]

for _ in range(MAX_LENGTH):
    with torch.no_grad():
        logits = model(input_ids.cuda()).logits[-1]
    next_token = generator.top_p_filtering(logits, p=0.9)
    input_ids = torch.cat([input_ids, torch.LongTensor([next_token])])

top_p_ = tokenizer.decode(input_ids)

In [33]:
print(f'Original: {original}\n')
print('-'*100)
print(f'Scratch: {scratch}\n')
print('-'*100)
print(f'High Temp (temp = 200): {high_temp}\n')
print('-'*100)
print(f'Top k (k = 5): {top_k_}\n')
print('-'*100)
print(f'Top p (p=0.9): {top_p_}\n')
print('-'*100)

Original: In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The Andean mountain range is known as the "unicorn range," because unicorns are rare inhabitants of the most isolated locations in the subcontinent, and the Andes have the highest concentrations of unicorns.

"Their body language is very well-modulated, they really sound like a well-trained horse," said Fernando A. Barrios, a professor of psychology at UNAM, in an interview with BBC Mundo.

Some of the unicorns were

----------------------------------------------------------------------------------------------------
Scratch: In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect Eng