<a href="https://colab.research.google.com/github/pedrogengo/DLforNLP/blob/main/Pedro_Gengo_Aula_7_Exerc%C3%ADcio_SelfAttention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Pedro Gabriel Gengo Lourenço

In [None]:
!pip install neptune-client

Collecting neptune-client
  Downloading neptune-client-0.12.0.tar.gz (275 kB)
[K     |████████████████████████████████| 275 kB 2.8 MB/s 
[?25hCollecting bravado
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 38.9 MB/s 
Collecting PyJWT
  Downloading PyJWT-2.1.0-py3-none-any.whl (16 kB)
Collecting websocket-client!=1.0.0,>=0.35.0
  Downloading websocket_client-1.2.1-py2.py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.7 MB/s 
[?25hCollecting GitPython>=2.0.8
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 49.6 MB/s 
[?25hCollecting boto3>=1.16.0
  Downloading boto3-1.18.51-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 47.6 MB/s 
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |████████████

## Instruções:

Treinar e medir a acurácia de um modelo de classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
O modelo deverá ter uma camada de auto-atenção simplificada igual à apresentada no slide 96.

Deverão ser entregues duas implementações da camada de auto-atenção, como apresentado no slide 100:
1. Usando laços (ineficiente, mas bom para o aprendizado)
2. Matricial

Devemos usar embeddings pretreinados (glove) como entrada para a camada de auto-atenção. Lembrar de congelá-los pois, caso contrário,  pode ocorrer overfit.

Ao corrigir o exercicio, iremos também nos atentar na eficiencia/velocidade das implementações.

Dicas:
- A dificuldade deste exercício será implementar a auto-atenção de forma matricial usando minibatches. Para lidar com exemplos de tamanho variável, deve-se truncá-los e aplicar padding.

- Evitar usar qualquer laço na implementação matricial, pois isso a deixará muito ineficiente.

## Definindo os parametros

In [None]:
params = {
    'vocabulary_size': 400000,
    'padding_idx': 400001,
    'max_length': 200,
}

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np

import neptune.new as neptune

In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7f8f49c4cc30>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

--2021-09-30 11:22:18--  http://files.fast.ai/data/aclImdb.tgz
Resolving files.fast.ai (files.fast.ai)... 104.26.2.19, 172.67.69.159, 104.26.3.19, ...
Connecting to files.fast.ai (files.fast.ai)|104.26.2.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://files.fast.ai/data/aclImdb.tgz [following]
--2021-09-30 11:22:18--  https://files.fast.ai/data/aclImdb.tgz
Connecting to files.fast.ai (files.fast.ai)|104.26.2.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145982645 (139M) [application/x-gtar-compressed]
Saving to: ‘aclImdb.tgz’


2021-09-30 11:22:34 (8.87 MB/s) - ‘aclImdb.tgz’ saved [145982645/145982645]



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False This, and Immoral Tales, both left a bad taste in my mouth. It seems to me that Borowczyk is disgust
False Phew--I don't what to say. This is a film that could be really good a with a bunch of stoned viewers
True I'm giving ten out of ten it's one of the best movies ever. Absolutely smashed, stunned and dazed by
3 últimas amostras treino:
False I'm not looking for quality; I'm just trying to get through the 74 famous video nasties that were ba
True Best animated movie ever made. This film explores not only the vast world of modern animation with a
True Young Mr.Lincoln is a poetic,beautiful film that captures the myth of one of the most revered figure
3 primeiras amostras validação:
True I remember seeing this movie a long time ago, back then even though it didn't have any special effec
True I love Claire Danes, and Kate Beckinsale looks amazingly immature in her role

# Carregando os embeddings do Glove

In [None]:
!wget -nc http://nlp.stanford.edu/data/glove.6B.zip
!unzip -o glove.6B.zip -d glove_dir

--2021-09-30 11:22:51--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-09-30 11:22:51--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-30 11:22:52--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
from torchtext.vocab import GloVe
glove_vectors = GloVe(name='6B', dim=300, cache='./glove_dir')

100%|█████████▉| 399999/400000 [00:52<00:00, 7669.47it/s]


In [None]:
print(glove_vectors.vectors.shape)
print('Primeiras 20 palavras e seus índices:', list(glove_vectors.stoi.items())[:20])

torch.Size([400000, 300])
Primeiras 20 palavras e seus índices: [('the', 0), (',', 1), ('.', 2), ('of', 3), ('to', 4), ('and', 5), ('in', 6), ('a', 7), ('"', 8), ("'s", 9), ('for', 10), ('-', 11), ('that', 12), ('on', 13), ('is', 14), ('was', 15), ('said', 16), ('with', 17), ('he', 18), ('as', 19)]


In [None]:
vocab = glove_vectors.stoi
vocab['<UNK>'] = params['vocabulary_size'] # The last row is for the unknown token.

# We create a random vector for the unknown token
unk_vector = torch.FloatTensor(1, glove_vectors.vectors.shape[1]).uniform_(-0.5, 0.5)

# We create a vector of zeros for the pad token
pad_vector = torch.zeros(1, glove_vectors.vectors.shape[1])

# And add them to the embeddings matrix.
embeddings = torch.cat((glove_vectors.vectors, unk_vector, pad_vector), dim=0)

print(f'Total de palavras: {len(vocab)}')
print(f'embeddings.shape: {embeddings.shape}')

Total de palavras: 400001
embeddings.shape: torch.Size([400002, 300])


# Definindo o tokenizador

In [None]:
import collections
import re


def tokenize(text):
    return [token.lower() for token in re.compile('\w+').findall(text)]


def to_token_ids(text, vocab, max_length, padding_idx):
    tokens = tokenize(text)[:max_length]  # Truncating.
    token_ids = []
    for token in tokens:
        # We use the id of the "<UNK>" token if we don't find it in the vocabulary.
        token_id = vocab.get(token, vocab['<UNK>'])
        token_ids.append(token_id)

    # Adding PAD tokens, if necessary.
    token_ids += [padding_idx] * max(0, max_length - len(token_ids))
    return token_ids

# Definindo a camada de atenção

## Com loop

In [None]:
class SelfAttentionLayerLoop(torch.nn.Module):

    def __init__(self, embeddings, padding_idx):
        super(SelfAttentionLayerLoop, self).__init__()
        self.embeddings = torch.nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx)
        self.padding_idx = padding_idx
            
    def forward(self, batch_token_ids):
        batch_emb_tokens = self.embeddings(batch_token_ids)
        batch_mean_embeddings = []
        for emb_tokens, ids in zip(batch_emb_tokens, batch_token_ids): # Iterando cada elemento do batch
          Q_means = []

          for emb_token_Q in emb_tokens[torch.nonzero(ids != self.padding_idx)]: # Pego apenas os exemplos que nao sao padding como Q
            scores = []
            for i, emb_token_K in enumerate(emb_tokens):
              if ids[i] == self.padding_idx:
                score_not_norm = -float("Inf")
              else:
                score_not_norm = torch.matmul(emb_token_Q, emb_token_K)
              scores.append(score_not_norm)
            scores = torch.tensor(scores)
            scores_norm = torch.nn.functional.softmax(scores, dim=0)

            emb_sum = torch.zeros_like(emb_tokens[0]) # Inicializando vetor que sera a soma dos vetores contextualizados dos tokens
            for emb_token_V, score_norm in zip(emb_tokens, scores_norm):
              emb_sum += emb_token_V * score_norm

            Q_means.append(emb_sum)
          batch_mean_embeddings.append(torch.stack(Q_means).sum(dim=0) / len(Q_means))
        return torch.vstack(batch_mean_embeddings)

## Sem Loop (multiplicação de matriz)

In [None]:
class SelfAttentionLayer(torch.nn.Module):

    def __init__(self, embeddings, padding_idx):
        super(SelfAttentionLayer, self).__init__()
        self.embeddings = torch.nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx)
        self.padding_idx = padding_idx
            
    def forward(self, batch_token_ids):
        batch_emb_tokens = self.embeddings(batch_token_ids)
        pad_ids = torch.nonzero(batch_token_ids == self.padding_idx) # Posicoes, em cada batch, onde estao os PAD
        scores = torch.bmm(batch_emb_tokens, batch_emb_tokens.transpose(2,1))
        scores[pad_ids[:, 0], :, pad_ids[:, 1]] = -float("Inf") # Colocando Inf nos scores das Keys
        scores[pad_ids[:, 0],  pad_ids[:, 1], :] = -float("Inf") # Colocando Inf nos scores da Queries
        probs = torch.nn.functional.softmax(scores, dim=2)
        lengths = torch.logical_not(torch.isnan(probs)).sum(dim=1)[:, 0] # Calculando o tamanho de cada sentenca do batch
        probs[torch.isnan(probs)] = 0. # Zerando os embeddings Nan, devido a Queries do token PAD
        new_embs = torch.matmul(probs, batch_emb_tokens).sum(dim=1)
        emb_means = new_embs / lengths.reshape(batch_token_ids.shape[0], -1) # Reshape para ficar no formato correto
        return emb_means

## Testando a implementação com embeddings "falsos"

In [None]:
fake_vocab = {
    'a': 0,
    'b': 1,
    'c': 2,
    '<UNK>': 3 
}

fake_embeddings = torch.arange(0, 2 * len(fake_vocab)).reshape(len(fake_vocab), 2).float()
pad_vector = torch.zeros(1, 2)
fake_embeddings = torch.cat((fake_embeddings, pad_vector), dim=0)

fake_examples = [
    'a', # Testing PAD
    'a b',
    'a c b', # Testing truncation
    'a z', # Testing <UNK>
    ]

print(f'Total de palavras: {len(fake_vocab)}')
print(f'embeddings.shape: {fake_embeddings.shape}')

Total de palavras: 4
embeddings.shape: torch.Size([5, 2])


In [None]:
fake_embeddings

tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.],
        [0., 0.]])

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=fake_embeddings,
    padding_idx=4)
self_attention_layer_loop = SelfAttentionLayerLoop(
    embeddings=fake_embeddings,
    padding_idx=4)

batch_token_ids = []
for example in fake_examples:
    token_ids = to_token_ids(
        text=example,
        vocab=fake_vocab,
        max_length=2,
        padding_idx=4)
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)
my_output_loop = self_attention_layer_loop(batch_token_ids)

In [None]:
target_output = torch.FloatTensor([
    [0.00000000, 1.00000000],
    [1.88075161, 2.88075161],
    [3.96402740, 4.96402740],
    [5.99258232, 6.99258232]])

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)
assert torch.allclose(my_output_loop, target_output, atol=1e-6)

## Testando a implementação com 8 exemplos do dataset do IMDB

In [None]:
examples = [
    "THE TEMP (1993) didn't do much theatrical business, but here's the direct-to-video rip-off you didn't want, anyway! Ellen Bradford (Mel Harris) is the new woman at Millennium Investments, a high scale brokerage firm, who starts getting helpful hints from wide-eyed secretary Deidre (Sheila Kelley). Deidre turns out to be an ambitious daddy's girl who will stop at nothing to move up the corporate ladder, including screwing a top broker she can't stand and murdering anyone who gets on her bad side. She digs up skeletons in Ellen's closet, tries to cause problems with her husband (Barry Bostwick), kills while making it look like she is responsible, kidnaps her daughter and tries to get her to embezzle money from the company.<br /><br />Harris and Kelley deliver competent performances, the supporting cast is alright and it's reasonably well put-together, but that doesn't fully compensate for a script that travels down a well-worn path and offers few surprises.",
    "Sondra Locke stinks in this film, but then she was an awful 'actress' anyway. Unfortunately, she drags everyone else (including then =real life boyfriend Clint Eastwood down the drain with her. But what was Clint Eastwood thinking when he agreed to star in this one? One read of the script should have told him that this one was going to be a real snorer. It's an exceptionally weak story, basically no story or plot at all. Add in bored, poor acting, even from the normally good Eastwood. There's absolutely no action except a couple arguments and as far as I was concerned, this film ranks up at the top of the heap of natural sleep enhancers. Wow! Could a film BE any more boring? I think watching paint dry or the grass grow might be more fun. A real stinker. Don't bother with this one.",
    "Judy Davis shows us here why she is one of Australia's most respected and loved actors - her portrayal of a lonely, directionless nomad is first-rate. A teenaged Claudia Karvan also gives us a glimpse of what would make her one of this country's most popular actors in years to come, with future roles in THE BIG STEAL, THE HEARTBREAK KID, DATING THE ENEMY, RISK and the acclaimed TV series THE SECRET LIFE OF US. (Incidentally, Karvan, as a child, was a young girl whose toy Panda was stolen outside a chemist's shop in the 1983 drama GOING DOWN with Tracey Mann.) If this films comes your way, make sure you see it!! Rating: 79/100. See also: HOTEL SORRENTO, RADIANCE, VACANT POSSESSION, LANTANA.",
    'New York playwright Michael Caine (as Sidney Bruhl) is 46-years-old and fading fast; as the film opens, Mr. Caine\'s latest play flops on Broadway. TV reviewers poke fun at Caine, and he gets drunk. Passing out on the Long Island Railroad lands Caine in Montauk, instead of his residence in East Hampton. Finally arriving home, Caine is comforted by tightly-attired wife Dyan Cannon (as Myra), an unfortunately high-strung heart patient. There, Caine and Ms. Cannon discuss a new play called "Deathtrap", written by hunky young Christopher Reeve (as Clifford "Cliff" Anderson), one of Caine\'s former students. The couple believe Mr. Reeve\'s "Deathtrap" is the hit needed to revive Caine\'s career.<br /><br />"The Trap Is Set\x85 For A Wickedly Funny Who\'ll-Do-It." <br /><br />Directed by Sidney Lumet, Ira Levin\'s long-running Broadway hit doesn\'t stray too far from its stage origin. The cast is enjoyable and the story\'s twists are still engrossing. One thing that did not work (for me) was the curtain call ending; surely, it played better on stage. "Deathtrap" is a fun film to watch again; the performances are dead on - but, in hindsight, the greeting Reeve gives Caine at the East Hampton train station should have been simplified to a smiling "Hello." The location isn\'t really East Hampton, but the windmill and pond look similar. And, the much ballyhooed love scene is shockingly tepid. But, the play was so good, "even a gifted director couldn\'t ruin it." And, Mr. Lumet doesn\'t disappoint.<br /><br />******** Deathtrap (3/19/82) Sidney Lumet ~ Michael Caine, Christopher Reeve, Dyan Cannon, Irene Worth',
    'Students often ask me why I choose this version of Othello. Shakespeare\'s text is strongly truncated and the film contains material which earned it an "R" rating.<br /><br />I have several reasons for using this production: First, I had not seen a depiction of the Moor that actually made me sympathetic to Othello until I saw Fishburne play him. I saw James Earl Jones and Christopher Plummer play Othello and Iago on Broadway, and it was wonderful. Plummer\'s energy was especially noticeable. But in spite of Jone\'s incredible presence both physically and vocally, the character he played just seemed too passive to illicit from me a complete emotional purgation in the Aristotelian sense. Jones, in fact, affirmed what I felt when in an interview he noted that he had played Othello as passive--seeing Iago as basically doing him over. Unfortunately this sapped my grief for the character destruction. Thus, I felt sympathy for Jone\'s Moor but not the horror over his corruption by an evil man. In contrast, Fishburne\'s Othello is a strong and vigorous figure familiar with taking action. Thus, Iago\'s temptation to actively deal with what is presented to Othello as his wife\'s unfaithfulness is a perversion of the general\'s positive quality to be active not passive.1 The horror of the story is that this good quality in Othello becomes perverted. Fishburne\'s depiction is therefore classically tragic.<br /><br />Second, Fishburne is the first black actor to play Othello in a film. Both Orsen Wells and Anthony Hopkins did fine film versions, but they were white men in black face.2 Why is this important? Why should a Black actor be the Black man on the stage?3 Certainly in Shakespeare\'s day they used black face just as they used boys to make girls. Perhaps then, the reason is the same. Female actors bring a special quality to female roles on the Shakespearian stage because they understand best what Shakespeare\'s genius was trying to present. A gifted black actor should play the moor because his experience in a white dominated culture is vital to understanding what Shakespeare\'s genius recognized: the pain of being marginalized because of race. An important theme in Othello is isolation caused by racism. Although it is a mistake to insert American racism into a Shakespearian play, there can be little doubt that racism is still working among the characters. Many, including Desdimona\'s father, think that a union between a Venetian white Christian woman and a North African black Christian man is UNNATURAL.<br /><br />Third, Shakespeare was never G rated. He never has been. His stage productions were always typified by violence and strong language. But Shakespeare\'s genius uses these elements not as sensationialism but for artistic honesty.',
    'Roeg has done some great movies, but this a turkey. It has a feel of a play written by an untalented high-school student for his class assignment. The set decoration is appealing in a somewhat surrealistic way, but the actual story is insufferable hokum.',
    "<br /><br />What is left of Planet Earth is populated by a few poor and starving rag-tag survivors. They must eat bugs and insects, or whatever, after a poison war, or something, has nearly wiped out all human civilization. In these dark times, one of the few people on Earth still able to live in comfort, we will call him the All Knowing Big Boss, has a great quest to prevent some secret spore seeds from being released into the air. It seems that the All Knowing Big Boss is the last person on Earth that knows that these spores even exist. The spores are located far away from any living soul, and they are highly protected by many layers of deadly defense systems. <br /><br />The All Knowing Big Boss wants the secret spores to remain in their secret protected containers. So, he makes a plan to send in a macho action team to remove the spore containers from all of the protective systems and secret location. Sending people to the location of secret spores makes them no longer a secret. Sending people to disable all of the protective systems makes it possible for the spores to be easily released into the air. How about letting sleeping dogs lie?! <br /><br />The one pleasant feature of ENCRYPT is the radiant and elegant Vivian Wu. As the unremarkable macho action team members drop off with mechanically paced predictable timing, engaging Vivian Wu's charm makes acceptable the plot idea of her old employer wanting her so much. She is an object of love, an object of desire -- a very believable concept!<br /><br />Fans of Vivian Wu may want to check out an outstanding B-movie she is in from a couple years back called DINNER RUSH. DINNER RUSH is highly recommended. ENCRYPT is not.",
    "So the other night I decided to watch Tales from the Hollywood Hills: Natica Jackson. Or Power, Passion, Murder as it is called in Holland. When I bought the film I noticed that Michelle Pfeiffer was starring in it and I thought that had to say something about the quality. Unfortunately, it didn't.<br /><br />1) The plot of the film is really confusing. There are two story lines running simultaneously during the film. Only they have nothing in common. Throughout the entire movie I was waiting for the moment these two story lines would come together so the plot would be clear to me. But it still hasn't.<br /><br />2) The title of the film says the film will be about Natica Jackson. Well it is, sometimes. Like said the film covers two different stories and the part about Natica Jackson is the shortest. So another title for this movie would not be a wrong choice.<br /><br />To conclude my story, I really recommend that you leave this movie where it belongs, on the shelf in the store on a place nobody can see it. By doing this you won't waste 90 minutes of your life, as I did."         
]

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=embeddings,
    padding_idx=params['padding_idx'])

self_attention_layer_loop = SelfAttentionLayerLoop(
    embeddings=embeddings,
    padding_idx=params['padding_idx'])

batch_token_ids = []
for example in examples:
    token_ids = to_token_ids(
        text=example,
        vocab=vocab,
        max_length=params['max_length'],
        padding_idx=params['padding_idx'])
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)
my_output_loop = self_attention_layer_loop(batch_token_ids)

Fazemos o download do tensor esperado e o comparamos com nossa saída

In [None]:
!gsutil cp gs://neuralresearcher_data/unicamp/ia376e_2021s2/aula7/target_tensor.pt target_tensor.pt

Copying gs://neuralresearcher_data/unicamp/ia376e_2021s2/aula7/target_tensor.pt...
- [1 files][ 10.1 KiB/ 10.1 KiB]                                                
Operation completed over 1 objects/10.1 KiB.                                     


In [None]:
target_output = torch.load('target_tensor.pt')

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)
assert torch.allclose(my_output_loop, target_output, atol=1e-6)

# Classificador

Diferentemente dos outros exercícios, aqui iremos usar a camada de _self attention_ implementada acima com o intuito realizar uma classificação binária das avaliações de filmes entre positiva e negativa. 

### Criação da classe Dataset

In [None]:
class Dataset(Dataset):

  def __init__(self, x, y, tokenizer, vocab, max_length, padding_idx):
    self.x = x
    self.y = y
    self.tokenizer = tokenizer
    self.vocab = vocab
    self.max_length = max_length
    self.padding_idx = padding_idx
  
  def __len__(self):
    return len(self.x)

  def __getitem__(self, idx):
    x = self.tokenizer(self.x[idx], self.vocab, self.max_length, self.padding_idx)
    return torch.tensor(x).long(), torch.tensor(self.y[idx]).float()

## Loops de treino, validação e teste

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

Tesla K80
cuda:0


In [None]:
def train(model, train, valid, criterion, optimizer, filename_save, n_epochs=10, run=None, params=None):
  
  best_valid_loss = 10e9
  best_epoch = 0
  train_losses, valid_losses = [], []
  if run:
    run['parameters'] = params
  for i in range(n_epochs):
    accumulated_loss = 0
    model.train()
    for x_train, y_train in train:
      x_train = x_train.to(device)
      y_train = y_train.to(device).reshape(-1, 1)
      outputs = model(x_train)
      batch_loss = criterion(outputs, y_train)

      optimizer.zero_grad()
      batch_loss.backward()
      optimizer.step()
      accumulated_loss += batch_loss.item()

    train_loss = accumulated_loss / len(train.dataset)
    train_losses.append(train_loss)

    # Laço de Validação, um a cada época.
    accumulated_loss = 0
    accumulated_accuracy = 0
    model.eval()
    with torch.no_grad():
        for x_valid, y_valid in valid:
            x_valid = x_valid.to(device)
            y_valid = y_valid.to(device).reshape(-1, 1)

            # predict da rede
            outputs = model(x_valid)

            # calcula a perda
            batch_loss = criterion(outputs, y_valid)
            preds = outputs > 0.5
            # preds = outputs.argmax(dim=1)

            # calcula a acurácia
            batch_accuracy = (preds == y_valid).sum()
            accumulated_loss += batch_loss
            accumulated_accuracy += batch_accuracy

    valid_loss = accumulated_loss / len(valid.dataset)
    valid_losses.append(valid_loss)

    valid_acc = accumulated_accuracy / len(valid.dataset)

    print(f'Época: {i:d}/{n_epochs - 1:d} Train Loss: {train_loss:.6f} Valid Loss: {valid_loss:.6f} Valid Acc: {valid_acc:.3f}')

    if run:
      run[f"{filename_save}_valid/loss"].log(valid_loss)
      run[f"{filename_save}_valid/acc"].log(valid_acc)
      run[f"{filename_save}_train/loss"].log(train_loss)


    # Salvando o melhor modelo de acordo com a loss de validação
    if valid_loss < best_valid_loss:
        torch.save(model.state_dict(), filename_save + '.pt')
        best_valid_loss = valid_loss
        best_epoch = i
        print('best model')

  return model, train_losses, valid_losses

In [None]:
def predict(model, state_dict, test, run=None):
  accumulated_accuracy = 0
  model.load_state_dict(torch.load(state_dict + '.pt'))
  model.eval()
  with torch.no_grad():
      for x_test, y_test in test:
          x_test = x_test.to(device)
          y_test = y_test.to(device).reshape(-1,1)

          # predict da rede
          outputs = model(x_test)
  
          # calcula a perda
          batch_loss = criterion(outputs, y_test)
          preds = outputs > 0.5
          # preds = outputs.argmax(dim=1)

          # calcula a acurácia
          batch_accuracy = (preds == y_test).sum()
          accumulated_accuracy += batch_accuracy

  test_acc = accumulated_accuracy / len(test.dataset)
  test_acc *= 100
  print('*' * 40)
  print(f'Acurácia de {test_acc:.3f} %')
  print('*' * 40)

  if run:
    run['results'] = test_acc

## Definição da rede

Usarei uma rede onde temos uma camada de atenção seguida por duas camadas lineares.

In [None]:
class Classifier(torch.nn.Module):

  def __init__(self, embeddings, padding_idx, size_lin1):
    super(Classifier, self).__init__()
    self.self_attention = SelfAttentionLayer(embeddings, padding_idx)
    self.lin1 = torch.nn.Linear(embeddings.shape[1], size_lin1)
    self.lin2 = torch.nn.Linear(size_lin1, 1)
    self.output = torch.nn.Sigmoid()

  
  def forward(self, x):
    x = self.self_attention(x)
    x = self.lin1(x)
    x = torch.nn.functional.relu(x)
    x = self.lin2(x)
    x = self.output(x)
    return x

## Experimento teste

Utilizando poucos dados, irei rodar um experimento para validar a execução do fluxo de treino e validação.

In [None]:
learning_rate = 0.01
n_epochs = 300
batch_size = 50
hidden_size = 150
filename = "self_attention"

hprams = {"learning_rate": learning_rate,
          "batch_size": batch_size,
          "hidden_size": hidden_size
          }

In [None]:
dataset_train = Dataset(x_train[:50], y_train[:50], to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_valid = Dataset(x_valid[:50], y_valid[:50], to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_test = Dataset(x_test, y_test, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])

dataloader_train = DataLoader(dataset_train, batch_size=hprams["batch_size"], shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=hprams["batch_size"], shuffle=False)
dataloader_test = DataLoader(dataset_test, batch_size=hprams["batch_size"], shuffle=False)

In [None]:
cls = Classifier(embeddings, params['padding_idx'], hprams["hidden_size"])
cls.to(device)

Classifier(
  (self_attention): SelfAttentionLayer(
    (embeddings): Embedding(400002, 300, padding_idx=400001)
  )
  (lin1): Linear(in_features=300, out_features=150, bias=True)
  (lin2): Linear(in_features=150, out_features=1, bias=True)
  (output): Sigmoid()
)

In [None]:
# criterion = torch.nn.CrossEntropyLoss()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(cls.parameters(), lr=hprams["learning_rate"])

In [None]:
_, train_losses_bow, valid_losses_bow = train(cls, dataloader_train, dataloader_valid, criterion,
          optimizer, filename, n_epochs=n_epochs)

Época: 0/299 Train Loss: 0.013848 Valid Loss: 0.015148 Valid Acc: 0.400
best model
Época: 1/299 Train Loss: 0.013766 Valid Loss: 0.013774 Valid Acc: 0.520
best model
Época: 2/299 Train Loss: 0.013324 Valid Loss: 0.013287 Valid Acc: 0.700
best model
Época: 3/299 Train Loss: 0.013136 Valid Loss: 0.013294 Valid Acc: 0.600
Época: 4/299 Train Loss: 0.012693 Valid Loss: 0.013620 Valid Acc: 0.600
Época: 5/299 Train Loss: 0.012276 Valid Loss: 0.013682 Valid Acc: 0.580
Época: 6/299 Train Loss: 0.011854 Valid Loss: 0.013137 Valid Acc: 0.560
best model
Época: 7/299 Train Loss: 0.011270 Valid Loss: 0.012699 Valid Acc: 0.620
best model
Época: 8/299 Train Loss: 0.010783 Valid Loss: 0.012704 Valid Acc: 0.640
Época: 9/299 Train Loss: 0.010191 Valid Loss: 0.013087 Valid Acc: 0.560
Época: 10/299 Train Loss: 0.009659 Valid Loss: 0.013147 Valid Acc: 0.600
Época: 11/299 Train Loss: 0.009132 Valid Loss: 0.012909 Valid Acc: 0.620
Época: 12/299 Train Loss: 0.008622 Valid Loss: 0.013098 Valid Acc: 0.620
Época:

In [None]:
accumulated_accuracy = 0
cls.eval()
with torch.no_grad():
    for x_test, y_test in dataloader_train:
        x_test = x_test.to(device)
        y_test = y_test.to(device).reshape(-1,1)

        # predict da rede
        outputs = cls(x_test)

        # calcula a perda
        batch_loss = criterion(outputs, y_test)
        preds = outputs > 0.5
        # preds = outputs.argmax(dim=1)

        # calcula a acurácia
        batch_accuracy = (preds == y_test).sum()
        accumulated_accuracy += batch_accuracy

test_acc = accumulated_accuracy / len(dataloader_train.dataset)
test_acc *= 100
print('*' * 40)
print(f'Acurácia de {test_acc:.3f} %')
print('*' * 40)

****************************************
Acurácia de 100.000 %
****************************************


## Experimento final

In [None]:
run = neptune.init(
    project="pedro.gengo/IA-376",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiIxZjYyNDA1MS1hZDJlLTRiZDctYjIxNy0xMTNhY2FmNzZhYmIifQ==",
)

https://app.neptune.ai/pedro.gengo/IA-376/e/IA-21
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


In [None]:
learning_rate = 0.001
n_epochs = 20
batch_size = 50
hidden_size = 150
filename = "self_attention"

hprams = {"learning_rate": learning_rate,
          "batch_size": batch_size,
          "hidden_size": hidden_size
          }

In [None]:
dataset_train = Dataset(x_train, y_train, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_valid = Dataset(x_valid, y_valid, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_test = Dataset(x_test, y_test, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])

dataloader_train = DataLoader(dataset_train, batch_size=hprams["batch_size"], shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=hprams["batch_size"], shuffle=False)
dataloader_test = DataLoader(dataset_test, batch_size=hprams["batch_size"], shuffle=False)

In [None]:
cls = Classifier(embeddings, params['padding_idx'], hprams["hidden_size"])
cls.to(device)

Classifier(
  (self_attention): SelfAttentionLayer(
    (embeddings): Embedding(400002, 300, padding_idx=400001)
  )
  (lin1): Linear(in_features=300, out_features=150, bias=True)
  (lin2): Linear(in_features=150, out_features=1, bias=True)
  (output): Sigmoid()
)

In [None]:
# criterion = torch.nn.CrossEntropyLoss()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(cls.parameters(), lr=hprams["learning_rate"])

In [None]:
_, train_losses_bow, valid_losses_bow = train(cls, dataloader_train, dataloader_valid, criterion,
          optimizer, filename, n_epochs=n_epochs, run=run, params=hprams)

Época: 0/19 Train Loss: 0.011082 Valid Loss: 0.009220 Valid Acc: 0.790
best model
Época: 1/19 Train Loss: 0.008975 Valid Loss: 0.008554 Valid Acc: 0.810
best model
Época: 2/19 Train Loss: 0.008586 Valid Loss: 0.008187 Valid Acc: 0.818
best model
Época: 3/19 Train Loss: 0.008402 Valid Loss: 0.008229 Valid Acc: 0.811
Época: 4/19 Train Loss: 0.008354 Valid Loss: 0.008257 Valid Acc: 0.813
Época: 5/19 Train Loss: 0.008304 Valid Loss: 0.008048 Valid Acc: 0.823
best model
Época: 6/19 Train Loss: 0.008229 Valid Loss: 0.008162 Valid Acc: 0.815
Época: 7/19 Train Loss: 0.008191 Valid Loss: 0.008480 Valid Acc: 0.804
Época: 8/19 Train Loss: 0.008168 Valid Loss: 0.007996 Valid Acc: 0.826
best model
Época: 9/19 Train Loss: 0.008125 Valid Loss: 0.008005 Valid Acc: 0.824
Época: 10/19 Train Loss: 0.008130 Valid Loss: 0.008352 Valid Acc: 0.807
Época: 11/19 Train Loss: 0.008063 Valid Loss: 0.008019 Valid Acc: 0.819
Época: 12/19 Train Loss: 0.008023 Valid Loss: 0.007981 Valid Acc: 0.824
best model
Época: 1

In [None]:
predict(cls, filename, dataloader_test, run=run)

****************************************
Acurácia de 81.440 %
****************************************


In [None]:
run.stop()

Shutting down background jobs, please wait a moment...
Done!


Waiting for the remaining 8 operations to synchronize with Neptune. Do not kill this process.


All 8 operations synced, thanks for waiting!
