<a href="https://colab.research.google.com/github/pedrogengo/DLforNLP/blob/main/Pedro_Gengo_Aula_8_Exerc%C3%ADcio_MultiHeadSelfAttention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Pedro Gabriel Gengo Lourenço

In [None]:
!pip install neptune-client

Collecting neptune-client
  Downloading neptune-client-0.12.0.tar.gz (275 kB)
[K     |████████████████████████████████| 275 kB 13.1 MB/s 
[?25hCollecting bravado
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 48.6 MB/s 
Collecting PyJWT
  Downloading PyJWT-2.2.0-py3-none-any.whl (16 kB)
Collecting websocket-client!=1.0.0,>=0.35.0
  Downloading websocket_client-1.2.1-py2.py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[?25hCollecting GitPython>=2.0.8
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 48.9 MB/s 
[?25hCollecting boto3>=1.16.0
  Downloading boto3-1.18.56-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 51.9 MB/s 
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |███████████

## Instruções:

Treinar e medir a acurácia de um modelo de classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
O modelo deverá ter uma camada de auto-atenção completa igual à do artigo do "Attention is All You Need".

Implementar a Análise de Sentimento do IMDB, igual ao da semana passada (IMDB), mas agora usando a atenção "completa":
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Scaled Dot-product
- Multi-head
- Layer Normalization
- Conexões residuais
- Camada de feed forward (2-layer MLP)

Deverá ser entregue apenas a implementação matricial, ou seja, não precisa implementar a forma em laço.

Devemos usar embeddings pretreinados do Glove como entrada para a camada de auto-atenção. Lembrar de congelá-los pois, caso contrário,  pode ocorrer overfit.

Ao corrigir o exercicio, iremos também nos atentar na eficiencia/velocidade das implementações.

Dicas:
- A dificuldade deste exercício será implementar a auto-atenção de forma matricial usando minibatches. Para lidar com exemplos de tamanho variável, deve-se truncá-los e aplicar padding.

- Evitar usar qualquer laço na implementação matricial, pois isso a deixará muito ineficiente.

## Definindo os parametros

In [None]:
params = {
    'vocabulary_size': 400000,
    'padding_idx': 400001,
    'max_length': 200,
    'dim': 300,
    'n_heads': 6,
}

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
import numpy as np
from torch.utils.data import DataLoader, Dataset

import neptune.new as neptune

In [None]:
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x7efbdc977c10>

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

--2021-10-07 12:00:58--  http://files.fast.ai/data/aclImdb.tgz
Resolving files.fast.ai (files.fast.ai)... 104.26.3.19, 172.67.69.159, 104.26.2.19, ...
Connecting to files.fast.ai (files.fast.ai)|104.26.3.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://files.fast.ai/data/aclImdb.tgz [following]
--2021-10-07 12:00:59--  https://files.fast.ai/data/aclImdb.tgz
Connecting to files.fast.ai (files.fast.ai)|104.26.3.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145982645 (139M) [application/x-gtar-compressed]
Saving to: ‘aclImdb.tgz’


2021-10-07 12:01:11 (11.6 MB/s) - ‘aclImdb.tgz’ saved [145982645/145982645]



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False It's nothing more than a weird coincidence that I decided to watch STARLIFT on the 59th anniversary 
False I am wanting to make a "Holmes with Doors" pun but I can't quite string it all together. Suitably gr
True All the folks who sit here and say that this movie's weak link is the Ramones would probably say tha
3 últimas amostras treino:
False It's somewhat telling that most of the great reviews for the film on IMDb all come from people who h
True This is a bit long (2 hours, 20 minutes) but it had a a lot of the famous Pearl Buck novel in it. In
True Surprisingly good. The acting was fun, the screenplay was fun, the music was cheesie fun, the plot w
3 primeiras amostras validação:
True Of all the kung-fu films made through the 70's and 80's this is one that has developed a real cult f
True Excellent film dealing with the life of an old man as he looks back over the 

# Carregando os embeddings do Glove

In [None]:
!wget -nc http://nlp.stanford.edu/data/glove.6B.zip
!unzip -o glove.6B.zip -d glove_dir

--2021-10-07 12:01:28--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-10-07 12:01:28--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-10-07 12:01:29--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-1

In [None]:
from torchtext.vocab import GloVe
glove_vectors = GloVe(name='6B', dim=300, cache='./glove_dir')

100%|█████████▉| 399999/400000 [00:50<00:00, 7995.98it/s]


In [None]:
print(glove_vectors.vectors.shape)
print('Primeiras 20 palavras e seus índices:', list(glove_vectors.stoi.items())[:20])

torch.Size([400000, 300])
Primeiras 20 palavras e seus índices: [('the', 0), (',', 1), ('.', 2), ('of', 3), ('to', 4), ('and', 5), ('in', 6), ('a', 7), ('"', 8), ("'s", 9), ('for', 10), ('-', 11), ('that', 12), ('on', 13), ('is', 14), ('was', 15), ('said', 16), ('with', 17), ('he', 18), ('as', 19)]


In [None]:
vocab = glove_vectors.stoi
vocab['<UNK>'] = params['vocabulary_size'] # The last row is for the unknown token.

# We create a random vector for the unknown token
unk_vector = torch.FloatTensor(1, glove_vectors.vectors.shape[1]).uniform_(-0.5, 0.5)

# We create a vector of zeros for the pad token
pad_vector = torch.zeros(1, glove_vectors.vectors.shape[1])

# And add them to the embeddings matrix.
embeddings = torch.cat((glove_vectors.vectors, unk_vector, pad_vector), dim=0)

print(f'Total de palavras: {len(vocab)}')
print(f'embeddings.shape: {embeddings.shape}')

Total de palavras: 400001
embeddings.shape: torch.Size([400002, 300])


# Definindo o tokenizador

In [None]:
import collections
import re


def tokenize(text):
    return [token.lower() for token in re.compile('\w+').findall(text)]


def to_token_ids(text, vocab, max_length, padding_idx):
    tokens = tokenize(text)[:max_length]  # Truncating.
    token_ids = []
    for token in tokens:
        # We use the id of the "<UNK>" token if we don't find it in the vocabulary.
        token_id = vocab.get(token, vocab['<UNK>'])
        token_ids.append(token_id)

    # Adding PAD tokens, if necessary.
    token_ids += [padding_idx] * max(0, max_length - len(token_ids))
    return token_ids

## Definindo o Modelo

In [None]:
# É recomendado reiniciar as seeds antes de inicializar o modelo, pois assim
# garantimos que os pesos vao ser sempre os mesmos.
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

class SelfAttentionLayer(torch.nn.Module):

    def __init__(self, embeddings, padding_idx, n_heads, dim, max_length):
        super().__init__()
        # Escreva o codigo aqui.
        # É importante que as camadas seja criadas na ordem abaixo, para
        # garantimos que terão os mesmos pesos usados para criar o vetor target
        # usado nos asserts:
        embedding_dim = embeddings.shape[-1]
        self.n_heads = n_heads
        self.max_length = max_length
        self.dim = dim
        self.padding_idx = padding_idx

        # Rede
        self.embedding_layer = torch.nn.Embedding.from_pretrained(embeddings, padding_idx = padding_idx)
        self.positional_embeddings = torch.nn.Linear(dim, max_length, bias=False)
        self.W_q = torch.nn.Linear(dim, dim, bias=False)
        self.W_k = torch.nn.Linear(dim, dim, bias=False)
        self.W_v = torch.nn.Linear(dim, dim, bias=False)
        self.W_o = torch.nn.Linear(dim, dim, bias=False)
        self.layer_norm1 = torch.nn.LayerNorm(dim, eps=1e-06)
        self.feed_forward = torch.nn.Sequential(
                                                  torch.nn.Linear(dim, dim),
                                                  torch.nn.ReLU(),
                                                  torch.nn.Linear(dim, dim),
                                                ) 
        self.layer_norm2 = torch.nn.LayerNorm(dim, eps=1e-06)
    
    def attention(self, q, k, v, mask):
        scores = torch.matmul(q, k.transpose(-1, -2))# B, H, L, L
        # B, L -> B, 1, 1, L -> B, H, L, L -> Como cada linha de scores eh o score para os outros tokens, queremos replicar a mascara para todos as linhas, ja que os tokens sao os mesmos
        mask_expanded = mask[:, None, None, :].expand_as(scores)
        scores.masked_fill_(~mask_expanded, float('-inf')) # B, H, L, L -> Preenchemos onde temos PAD com -inf
        scores = scores / np.sqrt(self.dim // self.n_heads)
        probs = F.softmax(scores, dim=-1) # shape = B, H, L, L -> Tenho as probs para cada token de cada linha
        E = torch.matmul(probs, v) # B, H, L, L x B, H, L, D/H = B, H, L, D/H
        return E

    def forward(self, batch_token_ids):
        batch_size = batch_token_ids.shape[0]

        embs = self.embedding_layer(batch_token_ids)
        x = embs + self.positional_embeddings.weight[None, :, :] # B, L, D
        dim_heads = self.dim // self.n_heads

        q = self.W_q(x).reshape(batch_size, self.max_length, self.n_heads, dim_heads)
        k = self.W_k(x).reshape(batch_size, self.max_length, self.n_heads, dim_heads)
        v = self.W_v(x).reshape(batch_size, self.max_length, self.n_heads, dim_heads)
        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # H, L, D/H
        mask = batch_token_ids != self.padding_idx # B, L -> Estamos pegando os token que nao sao PAD

        e = self.attention(q, k, v, mask) # B, H, L, D/H
        e = e.transpose(2, 1).contiguous() # B, L, H, D/H
        e = e.reshape(batch_size, self.max_length, -1) # B, L, D
        e = self.W_o(e) # B, L, D

        residual1 = x + e
        norm1 = self.layer_norm1(residual1)

        feed_forward_x = self.feed_forward(norm1)
        residual2 = norm1 + feed_forward_x
        norm2 = self.layer_norm2(residual2)
        e_out = norm2 * mask[:, :, None]
        mean_embeddings = e_out.sum(1) / torch.clamp(mask.sum(1)[:, None], min=1)
        return mean_embeddings

## Testando a implementação com embeddings "falsos"

In [None]:
fake_vocab = {
    'a': 0,
    'b': 1,
    'c': 2,
    '<UNK>': 3 
}

fake_embeddings = torch.arange(0, 2 * len(fake_vocab)).reshape(len(fake_vocab), 2).float()
pad_vector = torch.zeros(1, 2)
fake_embeddings = torch.cat((fake_embeddings, pad_vector), dim=0)

fake_examples = [
    'a', # Testing PAD
    'a b',
    'a c b', # Testing truncation
    'a z', # Testing <UNK>
    ]

print(f'Total de palavras: {len(fake_vocab)}')
print(f'embeddings.shape: {fake_embeddings.shape}')

Total de palavras: 4
embeddings.shape: torch.Size([5, 2])


In [None]:
fake_embeddings

tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.],
        [0., 0.]])

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=fake_embeddings,
    padding_idx=4,
    dim=2,
    n_heads=2,
    max_length=2)

batch_token_ids = []
for example in fake_examples:
    token_ids = to_token_ids(
        text=example,
        vocab=fake_vocab,
        max_length=2,
        padding_idx=4)
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

In [None]:
torch.set_printoptions(precision=10)

In [None]:
my_output

tensor([[-0.9999975562,  0.9999975562],
        [-0.9999975562,  0.9999976158],
        [-0.9999975562,  0.9999974966],
        [-0.9999974966,  0.9999974966]], grad_fn=<DivBackward0>)

In [None]:
target_output = torch.FloatTensor([
    [-0.9999975562,  0.9999975562],
    [-0.9999975562,  0.9999976158],
    [-0.9999975562,  0.9999974966],
    [-0.9999974966,  0.9999974966]])

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-8)

## Testando a implementação com 8 exemplos do dataset do IMDB

In [None]:
examples = [
    "THE TEMP (1993) didn't do much theatrical business, but here's the direct-to-video rip-off you didn't want, anyway! Ellen Bradford (Mel Harris) is the new woman at Millennium Investments, a high scale brokerage firm, who starts getting helpful hints from wide-eyed secretary Deidre (Sheila Kelley). Deidre turns out to be an ambitious daddy's girl who will stop at nothing to move up the corporate ladder, including screwing a top broker she can't stand and murdering anyone who gets on her bad side. She digs up skeletons in Ellen's closet, tries to cause problems with her husband (Barry Bostwick), kills while making it look like she is responsible, kidnaps her daughter and tries to get her to embezzle money from the company.<br /><br />Harris and Kelley deliver competent performances, the supporting cast is alright and it's reasonably well put-together, but that doesn't fully compensate for a script that travels down a well-worn path and offers few surprises.",
    "Sondra Locke stinks in this film, but then she was an awful 'actress' anyway. Unfortunately, she drags everyone else (including then =real life boyfriend Clint Eastwood down the drain with her. But what was Clint Eastwood thinking when he agreed to star in this one? One read of the script should have told him that this one was going to be a real snorer. It's an exceptionally weak story, basically no story or plot at all. Add in bored, poor acting, even from the normally good Eastwood. There's absolutely no action except a couple arguments and as far as I was concerned, this film ranks up at the top of the heap of natural sleep enhancers. Wow! Could a film BE any more boring? I think watching paint dry or the grass grow might be more fun. A real stinker. Don't bother with this one.",
    "Judy Davis shows us here why she is one of Australia's most respected and loved actors - her portrayal of a lonely, directionless nomad is first-rate. A teenaged Claudia Karvan also gives us a glimpse of what would make her one of this country's most popular actors in years to come, with future roles in THE BIG STEAL, THE HEARTBREAK KID, DATING THE ENEMY, RISK and the acclaimed TV series THE SECRET LIFE OF US. (Incidentally, Karvan, as a child, was a young girl whose toy Panda was stolen outside a chemist's shop in the 1983 drama GOING DOWN with Tracey Mann.) If this films comes your way, make sure you see it!! Rating: 79/100. See also: HOTEL SORRENTO, RADIANCE, VACANT POSSESSION, LANTANA.",
    'New York playwright Michael Caine (as Sidney Bruhl) is 46-years-old and fading fast; as the film opens, Mr. Caine\'s latest play flops on Broadway. TV reviewers poke fun at Caine, and he gets drunk. Passing out on the Long Island Railroad lands Caine in Montauk, instead of his residence in East Hampton. Finally arriving home, Caine is comforted by tightly-attired wife Dyan Cannon (as Myra), an unfortunately high-strung heart patient. There, Caine and Ms. Cannon discuss a new play called "Deathtrap", written by hunky young Christopher Reeve (as Clifford "Cliff" Anderson), one of Caine\'s former students. The couple believe Mr. Reeve\'s "Deathtrap" is the hit needed to revive Caine\'s career.<br /><br />"The Trap Is Set\x85 For A Wickedly Funny Who\'ll-Do-It." <br /><br />Directed by Sidney Lumet, Ira Levin\'s long-running Broadway hit doesn\'t stray too far from its stage origin. The cast is enjoyable and the story\'s twists are still engrossing. One thing that did not work (for me) was the curtain call ending; surely, it played better on stage. "Deathtrap" is a fun film to watch again; the performances are dead on - but, in hindsight, the greeting Reeve gives Caine at the East Hampton train station should have been simplified to a smiling "Hello." The location isn\'t really East Hampton, but the windmill and pond look similar. And, the much ballyhooed love scene is shockingly tepid. But, the play was so good, "even a gifted director couldn\'t ruin it." And, Mr. Lumet doesn\'t disappoint.<br /><br />******** Deathtrap (3/19/82) Sidney Lumet ~ Michael Caine, Christopher Reeve, Dyan Cannon, Irene Worth',
    'Students often ask me why I choose this version of Othello. Shakespeare\'s text is strongly truncated and the film contains material which earned it an "R" rating.<br /><br />I have several reasons for using this production: First, I had not seen a depiction of the Moor that actually made me sympathetic to Othello until I saw Fishburne play him. I saw James Earl Jones and Christopher Plummer play Othello and Iago on Broadway, and it was wonderful. Plummer\'s energy was especially noticeable. But in spite of Jone\'s incredible presence both physically and vocally, the character he played just seemed too passive to illicit from me a complete emotional purgation in the Aristotelian sense. Jones, in fact, affirmed what I felt when in an interview he noted that he had played Othello as passive--seeing Iago as basically doing him over. Unfortunately this sapped my grief for the character destruction. Thus, I felt sympathy for Jone\'s Moor but not the horror over his corruption by an evil man. In contrast, Fishburne\'s Othello is a strong and vigorous figure familiar with taking action. Thus, Iago\'s temptation to actively deal with what is presented to Othello as his wife\'s unfaithfulness is a perversion of the general\'s positive quality to be active not passive.1 The horror of the story is that this good quality in Othello becomes perverted. Fishburne\'s depiction is therefore classically tragic.<br /><br />Second, Fishburne is the first black actor to play Othello in a film. Both Orsen Wells and Anthony Hopkins did fine film versions, but they were white men in black face.2 Why is this important? Why should a Black actor be the Black man on the stage?3 Certainly in Shakespeare\'s day they used black face just as they used boys to make girls. Perhaps then, the reason is the same. Female actors bring a special quality to female roles on the Shakespearian stage because they understand best what Shakespeare\'s genius was trying to present. A gifted black actor should play the moor because his experience in a white dominated culture is vital to understanding what Shakespeare\'s genius recognized: the pain of being marginalized because of race. An important theme in Othello is isolation caused by racism. Although it is a mistake to insert American racism into a Shakespearian play, there can be little doubt that racism is still working among the characters. Many, including Desdimona\'s father, think that a union between a Venetian white Christian woman and a North African black Christian man is UNNATURAL.<br /><br />Third, Shakespeare was never G rated. He never has been. His stage productions were always typified by violence and strong language. But Shakespeare\'s genius uses these elements not as sensationialism but for artistic honesty.',
    'Roeg has done some great movies, but this a turkey. It has a feel of a play written by an untalented high-school student for his class assignment. The set decoration is appealing in a somewhat surrealistic way, but the actual story is insufferable hokum.',
    "<br /><br />What is left of Planet Earth is populated by a few poor and starving rag-tag survivors. They must eat bugs and insects, or whatever, after a poison war, or something, has nearly wiped out all human civilization. In these dark times, one of the few people on Earth still able to live in comfort, we will call him the All Knowing Big Boss, has a great quest to prevent some secret spore seeds from being released into the air. It seems that the All Knowing Big Boss is the last person on Earth that knows that these spores even exist. The spores are located far away from any living soul, and they are highly protected by many layers of deadly defense systems. <br /><br />The All Knowing Big Boss wants the secret spores to remain in their secret protected containers. So, he makes a plan to send in a macho action team to remove the spore containers from all of the protective systems and secret location. Sending people to the location of secret spores makes them no longer a secret. Sending people to disable all of the protective systems makes it possible for the spores to be easily released into the air. How about letting sleeping dogs lie?! <br /><br />The one pleasant feature of ENCRYPT is the radiant and elegant Vivian Wu. As the unremarkable macho action team members drop off with mechanically paced predictable timing, engaging Vivian Wu's charm makes acceptable the plot idea of her old employer wanting her so much. She is an object of love, an object of desire -- a very believable concept!<br /><br />Fans of Vivian Wu may want to check out an outstanding B-movie she is in from a couple years back called DINNER RUSH. DINNER RUSH is highly recommended. ENCRYPT is not.",
    "So the other night I decided to watch Tales from the Hollywood Hills: Natica Jackson. Or Power, Passion, Murder as it is called in Holland. When I bought the film I noticed that Michelle Pfeiffer was starring in it and I thought that had to say something about the quality. Unfortunately, it didn't.<br /><br />1) The plot of the film is really confusing. There are two story lines running simultaneously during the film. Only they have nothing in common. Throughout the entire movie I was waiting for the moment these two story lines would come together so the plot would be clear to me. But it still hasn't.<br /><br />2) The title of the film says the film will be about Natica Jackson. Well it is, sometimes. Like said the film covers two different stories and the part about Natica Jackson is the shortest. So another title for this movie would not be a wrong choice.<br /><br />To conclude my story, I really recommend that you leave this movie where it belongs, on the shelf in the store on a place nobody can see it. By doing this you won't waste 90 minutes of your life, as I did."         
]

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=embeddings,
    padding_idx=params['padding_idx'],
    dim=params['dim'],
    n_heads=params['n_heads'],
    max_length=params['max_length'])

batch_token_ids = []
for example in examples:
    token_ids = to_token_ids(
        text=example,
        vocab=vocab,
        max_length=params['max_length'],
        padding_idx=params['padding_idx'])
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

Fazemos o download do tensor esperado e o comparamos com nossa saída

In [None]:
!wget https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula8/target_tensor.pt

--2021-10-07 12:05:29--  https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula8/target_tensor.pt
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.15.128, 173.194.76.128, 66.102.1.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.15.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10347 (10K) [application/octet-stream]
Saving to: ‘target_tensor.pt’


2021-10-07 12:05:29 (78.8 MB/s) - ‘target_tensor.pt’ saved [10347/10347]



In [None]:
target_output = torch.load('target_tensor.pt')

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

# Classificador

Diferentemente dos outros exercícios, aqui iremos usar a camada de _self attention_ implementada acima com o intuito realizar uma classificação binária das avaliações de filmes entre positiva e negativa. 

### Criação da classe Dataset

In [None]:
class Dataset(Dataset):

  def __init__(self, x, y, tokenizer, vocab, max_length, padding_idx):
    self.x = x
    self.y = y
    self.tokenizer = tokenizer
    self.vocab = vocab
    self.max_length = max_length
    self.padding_idx = padding_idx
  
  def __len__(self):
    return len(self.x)

  def __getitem__(self, idx):
    x = self.tokenizer(self.x[idx], self.vocab, self.max_length, self.padding_idx)
    return torch.tensor(x).long(), torch.tensor(self.y[idx]).float()

## Loops de treino, validação e teste

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

Tesla K80
cuda:0


In [None]:
def train(model, train, valid, criterion, optimizer, filename_save, n_epochs=10, run=None, params=None):
  
  best_valid_loss = 10e9
  best_epoch = 0
  train_losses, valid_losses = [], []
  if run:
    run['parameters'] = params
  for i in range(n_epochs):
    accumulated_loss = 0
    model.train()
    for x_train, y_train in train:
      x_train = x_train.to(device)
      y_train = y_train.to(device).reshape(-1, 1)
      outputs = model(x_train)
      batch_loss = criterion(outputs, y_train)

      optimizer.zero_grad()
      batch_loss.backward()
      optimizer.step()
      accumulated_loss += batch_loss.item()

    train_loss = accumulated_loss / len(train.dataset)
    train_losses.append(train_loss)

    # Laço de Validação, um a cada época.
    accumulated_loss = 0
    accumulated_accuracy = 0
    model.eval()
    with torch.no_grad():
        for x_valid, y_valid in valid:
            x_valid = x_valid.to(device)
            y_valid = y_valid.to(device).reshape(-1, 1)

            # predict da rede
            outputs = model(x_valid)

            # calcula a perda
            batch_loss = criterion(outputs, y_valid)
            preds = outputs > 0.5
            # preds = outputs.argmax(dim=1)

            # calcula a acurácia
            batch_accuracy = (preds == y_valid).sum()
            accumulated_loss += batch_loss
            accumulated_accuracy += batch_accuracy

    valid_loss = accumulated_loss / len(valid.dataset)
    valid_losses.append(valid_loss)

    valid_acc = accumulated_accuracy / len(valid.dataset)

    print(f'Época: {i:d}/{n_epochs - 1:d} Train Loss: {train_loss:.6f} Valid Loss: {valid_loss:.6f} Valid Acc: {valid_acc:.3f}')

    if run:
      run[f"{filename_save}_valid/loss"].log(valid_loss)
      run[f"{filename_save}_valid/acc"].log(valid_acc)
      run[f"{filename_save}_train/loss"].log(train_loss)


    # Salvando o melhor modelo de acordo com a loss de validação
    if valid_loss < best_valid_loss:
        torch.save(model.state_dict(), filename_save + '.pt')
        best_valid_loss = valid_loss
        best_epoch = i
        print('best model')

  return model, train_losses, valid_losses

In [None]:
def predict(model, state_dict, test, run=None):
  accumulated_accuracy = 0
  model.load_state_dict(torch.load(state_dict + '.pt'))
  model.eval()
  with torch.no_grad():
      for x_test, y_test in test:
          x_test = x_test.to(device)
          y_test = y_test.to(device).reshape(-1,1)

          # predict da rede
          outputs = model(x_test)
  
          # calcula a perda
          batch_loss = criterion(outputs, y_test)
          preds = outputs > 0.5
          # preds = outputs.argmax(dim=1)

          # calcula a acurácia
          batch_accuracy = (preds == y_test).sum()
          accumulated_accuracy += batch_accuracy

  test_acc = accumulated_accuracy / len(test.dataset)
  test_acc *= 100
  print('*' * 40)
  print(f'Acurácia de {test_acc:.3f} %')
  print('*' * 40)

  if run:
    run['results'] = test_acc

## Definição da rede

Usarei uma rede onde temos uma camada de atenção seguida por duas camadas lineares.

In [None]:
class Classifier(torch.nn.Module):

  def __init__(self, embeddings, padding_idx, n_heads, dim, max_length, size_lin1):
    super(Classifier, self).__init__()
    self.self_attention = SelfAttentionLayer(embeddings, padding_idx, n_heads, dim, max_length)
    self.lin1 = torch.nn.Linear(dim, size_lin1)
    self.lin2 = torch.nn.Linear(size_lin1, 1)
    self.output = torch.nn.Sigmoid()

  
  def forward(self, x):
    x = self.self_attention(x)
    x = self.lin1(x)
    x = torch.nn.functional.relu(x)
    x = self.lin2(x)
    x = self.output(x)
    return x

## Experimento teste

Utilizando poucos dados, irei rodar um experimento para validar a execução do fluxo de treino e validação e verificar se está ocorrendo overfit com poucos dados.

In [None]:
learning_rate = 0.01
n_epochs = 300
batch_size = 50
hidden_size = 150
filename = "self_attention"

hprams = {"learning_rate": learning_rate,
          "batch_size": batch_size,
          "hidden_size": hidden_size
          }

In [None]:
dataset_train = Dataset(x_train[:50], y_train[:50], to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_valid = Dataset(x_valid[:50], y_valid[:50], to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_test = Dataset(x_test, y_test, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])

dataloader_train = DataLoader(dataset_train, batch_size=hprams["batch_size"], shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=hprams["batch_size"], shuffle=False)
dataloader_test = DataLoader(dataset_test, batch_size=hprams["batch_size"], shuffle=False)

In [None]:
cls = Classifier(embeddings, params['padding_idx'], params['n_heads'], params['dim'], params['max_length'], hprams["hidden_size"])
cls.to(device)

Classifier(
  (self_attention): SelfAttentionLayer(
    (embedding_layer): Embedding(400002, 300, padding_idx=400001)
    (positional_embeddings): Linear(in_features=300, out_features=200, bias=False)
    (W_q): Linear(in_features=300, out_features=300, bias=False)
    (W_k): Linear(in_features=300, out_features=300, bias=False)
    (W_v): Linear(in_features=300, out_features=300, bias=False)
    (W_o): Linear(in_features=300, out_features=300, bias=False)
    (layer_norm1): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
    (feed_forward): Sequential(
      (0): Linear(in_features=300, out_features=300, bias=True)
      (1): ReLU()
      (2): Linear(in_features=300, out_features=300, bias=True)
    )
    (layer_norm2): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
  )
  (lin1): Linear(in_features=300, out_features=150, bias=True)
  (lin2): Linear(in_features=150, out_features=1, bias=True)
  (output): Sigmoid()
)

In [None]:
# criterion = torch.nn.CrossEntropyLoss()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(cls.parameters(), lr=hprams["learning_rate"])

In [None]:
_, train_losses_bow, valid_losses_bow = train(cls, dataloader_train, dataloader_valid, criterion,
          optimizer, filename, n_epochs=n_epochs)

Época: 0/299 Train Loss: 0.013815 Valid Loss: 0.036262 Valid Acc: 0.400
best model
Época: 1/299 Train Loss: 0.028075 Valid Loss: 0.024280 Valid Acc: 0.600
best model
Época: 2/299 Train Loss: 0.032322 Valid Loss: 0.015749 Valid Acc: 0.400
best model
Época: 3/299 Train Loss: 0.014194 Valid Loss: 0.014523 Valid Acc: 0.400
best model
Época: 4/299 Train Loss: 0.013814 Valid Loss: 0.013507 Valid Acc: 0.600
best model
Época: 5/299 Train Loss: 0.015013 Valid Loss: 0.019556 Valid Acc: 0.400
Época: 6/299 Train Loss: 0.016226 Valid Loss: 0.015194 Valid Acc: 0.400
Época: 7/299 Train Loss: 0.013928 Valid Loss: 0.013575 Valid Acc: 0.600
Época: 8/299 Train Loss: 0.015141 Valid Loss: 0.013542 Valid Acc: 0.600
Época: 9/299 Train Loss: 0.013912 Valid Loss: 0.016211 Valid Acc: 0.400
Época: 10/299 Train Loss: 0.014030 Valid Loss: 0.016556 Valid Acc: 0.400
Época: 11/299 Train Loss: 0.013891 Valid Loss: 0.014410 Valid Acc: 0.400
Época: 12/299 Train Loss: 0.012674 Valid Loss: 0.013885 Valid Acc: 0.520
Época:

In [None]:
accumulated_accuracy = 0
cls.eval()
with torch.no_grad():
    for x_test_, y_test_ in dataloader_train:
        x_test_ = x_test_.to(device)
        y_test_ = y_test_.to(device).reshape(-1,1)

        # predict da rede
        outputs = cls(x_test_)

        # calcula a perda
        batch_loss = criterion(outputs, y_test_)
        preds = outputs > 0.5
        # preds = outputs.argmax(dim=1)

        # calcula a acurácia
        batch_accuracy = (preds == y_test_).sum()
        accumulated_accuracy += batch_accuracy

test_acc = accumulated_accuracy / len(dataloader_train.dataset)
test_acc *= 100
print('*' * 40)
print(f'Acurácia de {test_acc:.3f} %')
print('*' * 40)

****************************************
Acurácia de 100.000 %
****************************************


## Experimento final

In [None]:
run = neptune.init(
    project="pedro.gengo/IA-376",
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiIxZjYyNDA1MS1hZDJlLTRiZDctYjIxNy0xMTNhY2FmNzZhYmIifQ==",
)

https://app.neptune.ai/pedro.gengo/IA-376/e/IA-23
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


In [None]:
learning_rate = 0.001
n_epochs = 20
batch_size = 50
hidden_size = 150
filename = "self_attention"

hprams = {"learning_rate": learning_rate,
          "batch_size": batch_size,
          "hidden_size": hidden_size
          }

In [None]:
dataset_train = Dataset(x_train, y_train, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_valid = Dataset(x_valid, y_valid, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])
dataset_test = Dataset(x_test, y_test, to_token_ids, vocab=vocab, max_length=params['max_length'], padding_idx=params['padding_idx'])

dataloader_train = DataLoader(dataset_train, batch_size=hprams["batch_size"], shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=hprams["batch_size"], shuffle=False)
dataloader_test = DataLoader(dataset_test, batch_size=hprams["batch_size"], shuffle=False)

In [None]:
cls = Classifier(embeddings, params['padding_idx'], params['n_heads'], params['dim'], params['max_length'], hprams["hidden_size"])
cls.to(device)

Classifier(
  (self_attention): SelfAttentionLayer(
    (embedding_layer): Embedding(400002, 300, padding_idx=400001)
    (positional_embeddings): Linear(in_features=300, out_features=200, bias=False)
    (W_q): Linear(in_features=300, out_features=300, bias=False)
    (W_k): Linear(in_features=300, out_features=300, bias=False)
    (W_v): Linear(in_features=300, out_features=300, bias=False)
    (W_o): Linear(in_features=300, out_features=300, bias=False)
    (layer_norm1): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
    (feed_forward): Sequential(
      (0): Linear(in_features=300, out_features=300, bias=True)
      (1): ReLU()
      (2): Linear(in_features=300, out_features=300, bias=True)
    )
    (layer_norm2): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
  )
  (lin1): Linear(in_features=300, out_features=150, bias=True)
  (lin2): Linear(in_features=150, out_features=1, bias=True)
  (output): Sigmoid()
)

In [None]:
# criterion = torch.nn.CrossEntropyLoss()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(cls.parameters(), lr=hprams["learning_rate"])

In [None]:
_, train_losses_bow, valid_losses_bow = train(cls, dataloader_train, dataloader_valid, criterion,
          optimizer, filename, n_epochs=n_epochs, run=run, params=hprams)

Época: 0/19 Train Loss: 0.008201 Valid Loss: 0.007379 Valid Acc: 0.838
best model
Época: 1/19 Train Loss: 0.007012 Valid Loss: 0.007278 Valid Acc: 0.837
best model
Época: 2/19 Train Loss: 0.006454 Valid Loss: 0.007979 Valid Acc: 0.824
Época: 3/19 Train Loss: 0.006120 Valid Loss: 0.008190 Valid Acc: 0.821
Época: 4/19 Train Loss: 0.005526 Valid Loss: 0.007172 Valid Acc: 0.843
best model
Época: 5/19 Train Loss: 0.004933 Valid Loss: 0.007573 Valid Acc: 0.838
Época: 6/19 Train Loss: 0.004274 Valid Loss: 0.008480 Valid Acc: 0.833
Época: 7/19 Train Loss: 0.003635 Valid Loss: 0.008671 Valid Acc: 0.847
Época: 8/19 Train Loss: 0.003008 Valid Loss: 0.009078 Valid Acc: 0.843
Época: 9/19 Train Loss: 0.002343 Valid Loss: 0.009892 Valid Acc: 0.842
Época: 10/19 Train Loss: 0.001968 Valid Loss: 0.011372 Valid Acc: 0.844
Época: 11/19 Train Loss: 0.001671 Valid Loss: 0.011905 Valid Acc: 0.833
Época: 12/19 Train Loss: 0.001233 Valid Loss: 0.012317 Valid Acc: 0.835
Época: 13/19 Train Loss: 0.001123 Valid L

In [None]:
predict(cls, filename, dataloader_test, run=run)

****************************************
Acurácia de 84.760 %
****************************************
