<a href="https://colab.research.google.com/github/leolellisr/npl_natural_language_processing_projects/blob/main/07_Auto_Attention_IMDB_Binary_Classification/08_Auto_Attention_IMDB_Binary_Classification_Attention_is_All_You_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Leonardo de Lellis Rossi

https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-52/charts

## Instruções:

Treinar e medir a acurácia de um modelo de classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
O modelo deverá ter uma camada de auto-atenção completa igual à do artigo do "Attention is All You Need".

Implementar a Análise de Sentimento do IMDB, igual ao da semana passada (IMDB), mas agora usando a atenção "completa":
- Embeddings de posição
- Projeções lineares (WQ, WK, WV, WO)
- Scaled Dot-product
- Multi-head
- Layer Normalization
- Conexões residuais
- Camada de feed forward (2-layer MLP)

Note que não devemos usar dropout para passar nos asserts.

Deverá ser entregue apenas a implementação matricial, ou seja, não precisa implementar a forma em laço.

Devemos usar embeddings pretreinados do Glove como entrada para a camada de auto-atenção. Lembrar de congelá-los pois, caso contrário,  pode ocorrer overfit.

Ao corrigir o exercicio, iremos também nos atentar na eficiencia/velocidade das implementações.

Dicas:
- A dificuldade deste exercício será implementar a auto-atenção de forma matricial usando minibatches. Para lidar com exemplos de tamanho variável, deve-se truncá-los e aplicar padding.

- Evitar usar qualquer laço na implementação matricial, pois isso a deixará muito ineficiente.


## Definindo os parametros

In [None]:
params = {
    'vocabulary_size': 400000,
    'padding_idx': 400001,
    'max_length': 200,
    'dim': 300,
    'n_heads': 6,
}

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
import numpy as np

In [None]:
def set_seeds():
  random.seed(123)
  np.random.seed(123)
  torch.manual_seed(123)

set_seeds()

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False It's nothing more than a weird coincidence that I decided to watch STARLIFT on the 59th anniversary 
False I am wanting to make a "Holmes with Doors" pun but I can't quite string it all together. Suitably gr
True All the folks who sit here and say that this movie's weak link is the Ramones would probably say tha
3 últimas amostras treino:
False It's somewhat telling that most of the great reviews for the film on IMDb all come from people who h
True This is a bit long (2 hours, 20 minutes) but it had a a lot of the famous Pearl Buck novel in it. In
True Surprisingly good. The acting was fun, the screenplay was fun, the music was cheesie fun, the plot w
3 primeiras amostras validação:
True Of all the kung-fu films made through the 70's and 80's this is one that has developed a real cult f
True Excellent film dealing with the life of an old man as he looks back over the 

# Carregando os embeddings do Glove

In [None]:
!wget -nc http://nlp.stanford.edu/data/glove.6B.zip
!unzip -o glove.6B.zip -d glove_dir

File ‘glove.6B.zip’ already there; not retrieving.

Archive:  glove.6B.zip
  inflating: glove_dir/glove.6B.50d.txt  
  inflating: glove_dir/glove.6B.100d.txt  
  inflating: glove_dir/glove.6B.200d.txt  
  inflating: glove_dir/glove.6B.300d.txt  


In [None]:
from torchtext.vocab import GloVe
glove_vectors = GloVe(name='6B', dim=300, cache='./glove_dir')

In [None]:
print(glove_vectors.vectors.shape)
print('Primeiras 20 palavras e seus índices:', list(glove_vectors.stoi.items())[:20])

torch.Size([400000, 300])
Primeiras 20 palavras e seus índices: [('the', 0), (',', 1), ('.', 2), ('of', 3), ('to', 4), ('and', 5), ('in', 6), ('a', 7), ('"', 8), ("'s", 9), ('for', 10), ('-', 11), ('that', 12), ('on', 13), ('is', 14), ('was', 15), ('said', 16), ('with', 17), ('he', 18), ('as', 19)]


In [None]:
vocab = glove_vectors.stoi
vocab['<UNK>'] = params['vocabulary_size'] # The last row is for the unknown token.

# We create a random vector for the unknown token
unk_vector = torch.FloatTensor(1, glove_vectors.vectors.shape[1]).uniform_(-0.5, 0.5)

# We create a vector of zeros for the pad token
pad_vector = torch.zeros(1, glove_vectors.vectors.shape[1])

# And add them to the embeddings matrix.
embeddings = torch.cat((glove_vectors.vectors, unk_vector, pad_vector), dim=0)

print(f'Total de palavras: {len(vocab)}')
print(f'embeddings.shape: {embeddings.shape}')

Total de palavras: 400001
embeddings.shape: torch.Size([400002, 300])


# Definindo o tokenizador

In [None]:
import collections
import re


def tokenize(text):
    return [token.lower() for token in re.compile('\w+').findall(text)]


def to_token_ids(text, vocab, max_length, padding_idx):
    tokens = tokenize(text)[:max_length]  # Truncating.
    token_ids = []
    for token in tokens:
        # We use the id of the "<UNK>" token if we don't find it in the vocabulary.
        token_id = vocab.get(token, vocab['<UNK>'])
        token_ids.append(token_id)

    # Adding PAD tokens, if necessary.
    token_ids += [padding_idx] * max(0, max_length - len(token_ids))
    return token_ids

## Definindo camada atencional

In [None]:
# É recomendado reiniciar as seeds antes de inicializar o modelo, pois assim
# garantimos que os pesos vao ser sempre os mesmos.
set_seeds()

class SelfAttentionLayer(torch.nn.Module):

    def __init__(self, embeddings, padding_idx, n_heads, dim, max_length):
        super().__init__()
        # n_heads: H
        # dim: D
        # max_lenght: L
        # vocab_lenght: V
        self.n_heads = n_heads
        self.dim = dim
        self.max_length = max_length
        self.vocab_lenght = embeddings.shape[0]
        self.padding_idx = padding_idx
        self.dim_head = self.dim // self.n_heads # D / H
        self.embedding_layer = torch.nn.Embedding.from_pretrained(embeddings, padding_idx = padding_idx) # (V, D) 
        self.positional_embeddings = torch.nn.Linear(self.dim, self.max_length, bias=False) # (D, L)
        self.W_q = torch.nn.Linear(self.dim, self.dim, bias=False) # (D, D)
        self.W_k = torch.nn.Linear(self.dim, self.dim, bias=False)
        self.W_v = torch.nn.Linear(self.dim, self.dim, bias=False)
        self.W_o = torch.nn.Linear(self.dim, self.dim, bias=False)
        self.layer_norm1  = torch.nn.LayerNorm(self.dim, eps=1e-6)
        self.feed_forward = torch.nn.Sequential(
            torch.nn.Linear(self.dim, self.dim),  
            torch.nn.ReLU(),
            torch.nn.Linear(self.dim, self.dim)
        )
        self.layer_norm2  = torch.nn.LayerNorm(self.dim, eps=1e-6)

    def multihead_self_attention(self, Q, K, V, padMask):        
        qKt = torch.matmul(Q, K.transpose(-1,-2)) # Q * K
        padMaskExp = padMask[:, None, None, :].expand_as(qKt)
        qKt.masked_fill_(~padMaskExp, float('-inf'))
        qKt_dk = qKt / np.sqrt(self.dim_head) 
        soft_QKt_dk = F.softmax(qKt_dk, dim=-1)
        attention = torch.matmul(soft_QKt_dk, V)
        return attention

    def forward(self, x):
        padMask  = ~(x == self.padding_idx)  # pad: False
        x_size = x.shape[0]
        x_embeddings = self.embedding_layer(x) # (B, L, D)
        x_embeddings = x_embeddings + self.positional_embeddings.weight
        residual = x_embeddings

        fQ = self.W_q(x_embeddings).reshape(x_size, self.max_length, self.n_heads, self.dim_head) # (B, L, H, D/H)
        fK = self.W_k(x_embeddings).reshape(x_size, self.max_length, self.n_heads, self.dim_head)
        fV = self.W_v(x_embeddings).reshape(x_size, self.max_length, self.n_heads, self.dim_head)

        # (B, L, H, D/H) -> (B, H, L, D/H)
        fQ_transposed = fQ.transpose(1, 2)
        fK_transposed = fK.transpose(1, 2)
        fV_transposed = fV.transpose(1, 2)

        attention = self.multihead_self_attention(fQ_transposed, fK_transposed, fV_transposed, padMask)
        att_transposed = attention.transpose(1, 2).contiguous()                   # (B, L, H, D/H)
        att_reshaped = att_transposed.reshape(x_size, self.max_length, self.dim) # (B, L, D) 
        att_output = self.W_o(att_reshaped)                                  
        att_output += residual
        att_norm1 = self.layer_norm1(att_output)
        residual = att_norm1
        att_ff = self.feed_forward(att_norm1)
        att_ff += residual
        att_norm2 = self.layer_norm2(att_ff)
        att_mean = att_norm2 * padMask.unsqueeze(-1)      
        mean_embeddings = att_mean.sum(dim=1) / padMask.count_nonzero(-1).unsqueeze(1)  
        return mean_embeddings

## Testando a implementação com embeddings "falsos"

In [None]:
fake_vocab = {
    'a': 0,
    'b': 1,
    'c': 2,
    '<UNK>': 3 
}

fake_embeddings = torch.arange(0, 2 * len(fake_vocab)).reshape(len(fake_vocab), 2).float()
pad_vector = torch.zeros(1, 2)
fake_embeddings = torch.cat((fake_embeddings, pad_vector), dim=0)

fake_examples = [
    'a', # Testing PAD
    'a b',
    'a c b', # Testing truncation
    'a z', # Testing <UNK>
    ]

print(f'Total de palavras: {len(fake_vocab)}')
print(f'embeddings.shape: {fake_embeddings.shape}')

Total de palavras: 4
embeddings.shape: torch.Size([5, 2])


In [None]:
fake_embeddings

tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.],
        [0., 0.]])

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=fake_embeddings,
    padding_idx=4,
    dim=2,
    n_heads=2,
    max_length=2)

batch_token_ids = []
for example in fake_examples:
    token_ids = to_token_ids(
        text=example,
        vocab=fake_vocab,
        max_length=2,
        padding_idx=4)
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

In [None]:
torch.set_printoptions(precision=10)

In [None]:
my_output

tensor([[-0.9999975562,  0.9999975562],
        [-0.9999975562,  0.9999976158],
        [-0.9999975562,  0.9999974966],
        [-0.9999974966,  0.9999974966]], grad_fn=<DivBackward0>)

In [None]:
target_output = torch.FloatTensor([
    [-0.9999975562,  0.9999975562],
    [-0.9999975562,  0.9999976158],
    [-0.9999975562,  0.9999974966],
    [-0.9999974966,  0.9999974966]])

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-8)

## Testando a implementação com 8 exemplos do dataset do IMDB

In [None]:
examples = [
    "THE TEMP (1993) didn't do much theatrical business, but here's the direct-to-video rip-off you didn't want, anyway! Ellen Bradford (Mel Harris) is the new woman at Millennium Investments, a high scale brokerage firm, who starts getting helpful hints from wide-eyed secretary Deidre (Sheila Kelley). Deidre turns out to be an ambitious daddy's girl who will stop at nothing to move up the corporate ladder, including screwing a top broker she can't stand and murdering anyone who gets on her bad side. She digs up skeletons in Ellen's closet, tries to cause problems with her husband (Barry Bostwick), kills while making it look like she is responsible, kidnaps her daughter and tries to get her to embezzle money from the company.<br /><br />Harris and Kelley deliver competent performances, the supporting cast is alright and it's reasonably well put-together, but that doesn't fully compensate for a script that travels down a well-worn path and offers few surprises.",
    "Sondra Locke stinks in this film, but then she was an awful 'actress' anyway. Unfortunately, she drags everyone else (including then =real life boyfriend Clint Eastwood down the drain with her. But what was Clint Eastwood thinking when he agreed to star in this one? One read of the script should have told him that this one was going to be a real snorer. It's an exceptionally weak story, basically no story or plot at all. Add in bored, poor acting, even from the normally good Eastwood. There's absolutely no action except a couple arguments and as far as I was concerned, this film ranks up at the top of the heap of natural sleep enhancers. Wow! Could a film BE any more boring? I think watching paint dry or the grass grow might be more fun. A real stinker. Don't bother with this one.",
    "Judy Davis shows us here why she is one of Australia's most respected and loved actors - her portrayal of a lonely, directionless nomad is first-rate. A teenaged Claudia Karvan also gives us a glimpse of what would make her one of this country's most popular actors in years to come, with future roles in THE BIG STEAL, THE HEARTBREAK KID, DATING THE ENEMY, RISK and the acclaimed TV series THE SECRET LIFE OF US. (Incidentally, Karvan, as a child, was a young girl whose toy Panda was stolen outside a chemist's shop in the 1983 drama GOING DOWN with Tracey Mann.) If this films comes your way, make sure you see it!! Rating: 79/100. See also: HOTEL SORRENTO, RADIANCE, VACANT POSSESSION, LANTANA.",
    'New York playwright Michael Caine (as Sidney Bruhl) is 46-years-old and fading fast; as the film opens, Mr. Caine\'s latest play flops on Broadway. TV reviewers poke fun at Caine, and he gets drunk. Passing out on the Long Island Railroad lands Caine in Montauk, instead of his residence in East Hampton. Finally arriving home, Caine is comforted by tightly-attired wife Dyan Cannon (as Myra), an unfortunately high-strung heart patient. There, Caine and Ms. Cannon discuss a new play called "Deathtrap", written by hunky young Christopher Reeve (as Clifford "Cliff" Anderson), one of Caine\'s former students. The couple believe Mr. Reeve\'s "Deathtrap" is the hit needed to revive Caine\'s career.<br /><br />"The Trap Is Set\x85 For A Wickedly Funny Who\'ll-Do-It." <br /><br />Directed by Sidney Lumet, Ira Levin\'s long-running Broadway hit doesn\'t stray too far from its stage origin. The cast is enjoyable and the story\'s twists are still engrossing. One thing that did not work (for me) was the curtain call ending; surely, it played better on stage. "Deathtrap" is a fun film to watch again; the performances are dead on - but, in hindsight, the greeting Reeve gives Caine at the East Hampton train station should have been simplified to a smiling "Hello." The location isn\'t really East Hampton, but the windmill and pond look similar. And, the much ballyhooed love scene is shockingly tepid. But, the play was so good, "even a gifted director couldn\'t ruin it." And, Mr. Lumet doesn\'t disappoint.<br /><br />******** Deathtrap (3/19/82) Sidney Lumet ~ Michael Caine, Christopher Reeve, Dyan Cannon, Irene Worth',
    'Students often ask me why I choose this version of Othello. Shakespeare\'s text is strongly truncated and the film contains material which earned it an "R" rating.<br /><br />I have several reasons for using this production: First, I had not seen a depiction of the Moor that actually made me sympathetic to Othello until I saw Fishburne play him. I saw James Earl Jones and Christopher Plummer play Othello and Iago on Broadway, and it was wonderful. Plummer\'s energy was especially noticeable. But in spite of Jone\'s incredible presence both physically and vocally, the character he played just seemed too passive to illicit from me a complete emotional purgation in the Aristotelian sense. Jones, in fact, affirmed what I felt when in an interview he noted that he had played Othello as passive--seeing Iago as basically doing him over. Unfortunately this sapped my grief for the character destruction. Thus, I felt sympathy for Jone\'s Moor but not the horror over his corruption by an evil man. In contrast, Fishburne\'s Othello is a strong and vigorous figure familiar with taking action. Thus, Iago\'s temptation to actively deal with what is presented to Othello as his wife\'s unfaithfulness is a perversion of the general\'s positive quality to be active not passive.1 The horror of the story is that this good quality in Othello becomes perverted. Fishburne\'s depiction is therefore classically tragic.<br /><br />Second, Fishburne is the first black actor to play Othello in a film. Both Orsen Wells and Anthony Hopkins did fine film versions, but they were white men in black face.2 Why is this important? Why should a Black actor be the Black man on the stage?3 Certainly in Shakespeare\'s day they used black face just as they used boys to make girls. Perhaps then, the reason is the same. Female actors bring a special quality to female roles on the Shakespearian stage because they understand best what Shakespeare\'s genius was trying to present. A gifted black actor should play the moor because his experience in a white dominated culture is vital to understanding what Shakespeare\'s genius recognized: the pain of being marginalized because of race. An important theme in Othello is isolation caused by racism. Although it is a mistake to insert American racism into a Shakespearian play, there can be little doubt that racism is still working among the characters. Many, including Desdimona\'s father, think that a union between a Venetian white Christian woman and a North African black Christian man is UNNATURAL.<br /><br />Third, Shakespeare was never G rated. He never has been. His stage productions were always typified by violence and strong language. But Shakespeare\'s genius uses these elements not as sensationialism but for artistic honesty.',
    'Roeg has done some great movies, but this a turkey. It has a feel of a play written by an untalented high-school student for his class assignment. The set decoration is appealing in a somewhat surrealistic way, but the actual story is insufferable hokum.',
    "<br /><br />What is left of Planet Earth is populated by a few poor and starving rag-tag survivors. They must eat bugs and insects, or whatever, after a poison war, or something, has nearly wiped out all human civilization. In these dark times, one of the few people on Earth still able to live in comfort, we will call him the All Knowing Big Boss, has a great quest to prevent some secret spore seeds from being released into the air. It seems that the All Knowing Big Boss is the last person on Earth that knows that these spores even exist. The spores are located far away from any living soul, and they are highly protected by many layers of deadly defense systems. <br /><br />The All Knowing Big Boss wants the secret spores to remain in their secret protected containers. So, he makes a plan to send in a macho action team to remove the spore containers from all of the protective systems and secret location. Sending people to the location of secret spores makes them no longer a secret. Sending people to disable all of the protective systems makes it possible for the spores to be easily released into the air. How about letting sleeping dogs lie?! <br /><br />The one pleasant feature of ENCRYPT is the radiant and elegant Vivian Wu. As the unremarkable macho action team members drop off with mechanically paced predictable timing, engaging Vivian Wu's charm makes acceptable the plot idea of her old employer wanting her so much. She is an object of love, an object of desire -- a very believable concept!<br /><br />Fans of Vivian Wu may want to check out an outstanding B-movie she is in from a couple years back called DINNER RUSH. DINNER RUSH is highly recommended. ENCRYPT is not.",
    "So the other night I decided to watch Tales from the Hollywood Hills: Natica Jackson. Or Power, Passion, Murder as it is called in Holland. When I bought the film I noticed that Michelle Pfeiffer was starring in it and I thought that had to say something about the quality. Unfortunately, it didn't.<br /><br />1) The plot of the film is really confusing. There are two story lines running simultaneously during the film. Only they have nothing in common. Throughout the entire movie I was waiting for the moment these two story lines would come together so the plot would be clear to me. But it still hasn't.<br /><br />2) The title of the film says the film will be about Natica Jackson. Well it is, sometimes. Like said the film covers two different stories and the part about Natica Jackson is the shortest. So another title for this movie would not be a wrong choice.<br /><br />To conclude my story, I really recommend that you leave this movie where it belongs, on the shelf in the store on a place nobody can see it. By doing this you won't waste 90 minutes of your life, as I did."         
]

In [None]:
self_attention_layer = SelfAttentionLayer(
    embeddings=embeddings,
    padding_idx=params['padding_idx'],
    dim=params['dim'],
    n_heads=params['n_heads'],
    max_length=params['max_length'])

batch_token_ids = []
for example in examples:
    token_ids = to_token_ids(
        text=example,
        vocab=vocab,
        max_length=params['max_length'],
        padding_idx=params['padding_idx'])
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

Fazemos o download do tensor esperado e o comparamos com nossa saída

In [None]:
!wget https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula8/target_tensor.pt

--2021-10-06 22:39:45--  https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula8/target_tensor.pt
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.159.128, 74.125.70.128, 74.125.69.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.159.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10347 (10K) [application/octet-stream]
Saving to: ‘target_tensor.pt.1’


2021-10-06 22:39:45 (84.7 MB/s) - ‘target_tensor.pt.1’ saved [10347/10347]



In [None]:
target_output = torch.load('target_tensor.pt')

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

In [None]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

Tesla K80
cuda:0


# Dataset

In [None]:
class Ex8_ds(torch.utils.data.Dataset):
    def __init__(self, x, y, vocabulary, max_length, transformer, padding_idx):
        self.x = x
        self.y = y
        self.vocabulary = vocabulary
        self.max_length = max_length
        self.transformer = transformer
        self.padding_idx = padding_idx

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        x = torch.tensor(self.transformer(self.x[index], self.vocabulary, self.max_length, self.padding_idx))
        y = torch.tensor(self.y[index]).float()
        return x, y

# Definindo o Modelo

In [None]:
set_seeds()

In [None]:
class Ex8_model(torch.nn.Module):
    def __init__(self, hidden, embeddings, padding_idx, dim, n_heads, max_lenght):
        super(Ex8_model, self).__init__()
        self.attentional = SelfAttentionLayer(embeddings=embeddings, padding_idx=padding_idx, dim=dim, n_heads=n_heads, max_length=max_lenght)
        self.linear = nn.Sequential(
            nn.Linear(embeddings.shape[1], hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
            nn.Sigmoid()
        )
          
    def forward(self, x):
        att_embeddings = self.attentional(x)
        output = self.linear(att_embeddings)
        return output.squeeze()

In [None]:
! pip install neptune-client



In [None]:
import neptune.new as neptune

run = neptune.init(project='leolellisr/nlp-imbd-large', api_token='eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiI1NjY1YmJkZi1hYmM5LTQ3M2QtOGU1ZC1iZTFlNWY4NjE1NDQifQ==')

https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-52
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


# Train Loop

In [None]:
def train_loop(dataloader_train, dataloader_val, hyperparameters, model):
    min_val_loss = 10e9
    best_model = 'best_model.pt'
    criterion = nn.BCELoss()
    # Gradient descent
    optimizer = torch.optim.Adam(model.parameters(), lr=hyperparameters['learning_rate'])
    best_epoch = 0

    for epoch in range(hyperparameters['n_epochs']):
      train_loss = 0
      model.train()
      for x_train, y_train in dataloader_train:
            # transform to one dimmension
        x_train = x_train.to(device)
        y_train = y_train.to(device) 
        
        outputs = model(x_train)

            # batch loss
        batch_loss = criterion(outputs, y_train)

            # reset gradients, backpropagation, optimizer step and sum loss
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()
        train_loss += batch_loss.item()
            #print(f'{hyperparameters["name"]}_train/batch_loss: {batch_loss}')
        run[f'{hyperparameters["mode"]}_train/batch_loss'].log(batch_loss)

      train_loss = train_loss / len(dataloader_train.dataset)
        #print(f'Epoch {epoch} / {hyperparameters["name"]} train loss: {train_loss}')
      run[f'{hyperparameters["mode"]}_train/train_loss'].log(train_loss) 

        # Validation (end of epoch).
      total_loss = 0
      total_acc = 0
      model.eval()
      with torch.no_grad():
        for x_val, y_val in dataloader_val:
          x_val = x_val.to(device)
          y_val = y_val.to(device)

                # predict
          outputs = model(x_val)

                # batch loss
          batch_loss = criterion(outputs, y_val)
          preds = outputs > 0.5
          total_loss += batch_loss

          batch_acc = (preds == y_val).sum()
          total_acc += batch_acc
      val_loss = total_loss / len(dataloader_val.dataset)
      run[f'{hyperparameters["mode"]}_val/val_loss'].log(val_loss)

      val_acc = total_acc / len(dataloader_val.dataset)
      run[f'{hyperparameters["mode"]}_val/val_accuracy'].log(total_acc / len(dataloader_val.dataset))
      
      print(f'Model: {hyperparameters["mode"]}, Epoch: {epoch+1}/{hyperparameters["n_epochs"]} - train_loss: {train_loss} - val_loss: {val_loss} - acc: {total_acc / len(dataloader_val.dataset)*100} %')

        # Save best model
      if val_loss < min_val_loss:
        torch.save(model.state_dict(), 'best_model.pt')
        min_val_loss = val_loss
        best_epoch = epoch
        print(f'Model: {hyperparameters["mode"]} - best model in epoch: {best_epoch+1}')

In [None]:
def predict(model, dataloader_test):
    criterion = nn.BCELoss()

    best_model = 'best_model.pt'
    model.load_state_dict(torch.load(best_model))
    model.eval()
    model.to(device)
    floss = 0
    total_acc = 0
    with torch.no_grad():
      for x_t, y_t in dataloader_test:
        x_t = x_t.to(device)
        y_t = y_t.to(device)

        outputs = model(x_t)
        loss = criterion(outputs, y_t)
        floss += loss
        preds = outputs > 0.5

        batch_acc = (preds == y_t).sum()
        total_acc += batch_acc
    
      test_acc = total_acc / len(dataloader_test.dataset)    
    return { 
        'loss':  floss / len(dataloader_test.dataset),
        'acc': test_acc
    }

In [None]:
# Transform list to dict
x_train = {num: i for num, i in enumerate(x_train)}
y_train = {num: i for num, i in enumerate(y_train)}
x_valid = {num: i for num, i in enumerate(x_valid)}
y_valid = {num: i for num, i in enumerate(y_valid)}
x_test = {num: i for num, i in enumerate(x_test)}
y_test = {num: i for num, i in enumerate(y_test)}

In [None]:
set_seeds()

In [None]:
from torch.utils.data import DataLoader


In [None]:
hyperparameters = { "mode": "210930_Aula8", 
          "learning_rate": 1e-4,
          "n_epochs": 20,
          "batch_size": 128,
          "hidden_size": 1024
          }
model = Ex8_model(hyperparameters["hidden_size"], embeddings,  params['padding_idx'], params['dim'], params['n_heads'], params['max_length'])
model.to(device) 
print(model)

Ex8_model(
  (attentional): SelfAttentionLayer(
    (embedding_layer): Embedding(400002, 300, padding_idx=400001)
    (positional_embeddings): Linear(in_features=300, out_features=200, bias=False)
    (W_q): Linear(in_features=300, out_features=300, bias=False)
    (W_k): Linear(in_features=300, out_features=300, bias=False)
    (W_v): Linear(in_features=300, out_features=300, bias=False)
    (W_o): Linear(in_features=300, out_features=300, bias=False)
    (layer_norm1): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
    (feed_forward): Sequential(
      (0): Linear(in_features=300, out_features=300, bias=True)
      (1): ReLU()
      (2): Linear(in_features=300, out_features=300, bias=True)
    )
    (layer_norm2): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
  )
  (linear): Sequential(
    (0): Linear(in_features=300, out_features=1024, bias=True)
    (1): ReLU()
    (2): Linear(in_features=1024, out_features=1, bias=True)
    (3): Sigmoid()
  )
)


In [None]:
train_ds = Ex8_ds(x_train, y_train, vocab, params['max_length'], to_token_ids, params['padding_idx'])
val_ds = Ex8_ds(x_valid, y_valid, vocab, params['max_length'], to_token_ids, params['padding_idx'])
dataloader_train = DataLoader(train_ds, batch_size=hyperparameters['batch_size'], shuffle=True)
dataloader_val = DataLoader(val_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  

In [None]:
train_loop(dataloader_train, dataloader_val, hyperparameters, model)   

Model: 210930_Aula8, Epoch: 1/20 - train_loss: 0.004256534589827061 - val_loss: 0.0033306549303233624 - acc: 81.6199951171875 %
Model: 210930_Aula8 - best model in epoch: 1
Model: 210930_Aula8, Epoch: 2/20 - train_loss: 0.002972125410288572 - val_loss: 0.003020460717380047 - acc: 83.5199966430664 %
Model: 210930_Aula8 - best model in epoch: 2
Model: 210930_Aula8, Epoch: 3/20 - train_loss: 0.0028059385791420937 - val_loss: 0.0029770592227578163 - acc: 83.80000305175781 %
Model: 210930_Aula8 - best model in epoch: 3
Model: 210930_Aula8, Epoch: 4/20 - train_loss: 0.002752784412354231 - val_loss: 0.003158652689307928 - acc: 82.04000091552734 %
Model: 210930_Aula8, Epoch: 5/20 - train_loss: 0.0026589181378483774 - val_loss: 0.002874917583540082 - acc: 84.37999725341797 %
Model: 210930_Aula8 - best model in epoch: 5
Model: 210930_Aula8, Epoch: 6/20 - train_loss: 0.0026126854710280894 - val_loss: 0.002819853601977229 - acc: 84.6199951171875 %
Model: 210930_Aula8 - best model in epoch: 6
Model

In [None]:
del train_ds
del val_ds
del dataloader_train
del dataloader_val

In [None]:
test_ds = Ex8_ds(x_test, y_test, vocab, params['max_length'], to_token_ids, params['padding_idx'])
dataloader_test = DataLoader(test_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  
print(predict(model,dataloader_test))

{'loss': tensor(0.0026017819, device='cuda:0'), 'acc': tensor(0.8535599709, device='cuda:0')}


In [None]:
run.stop()

Shutting down background jobs, please wait a moment...
Done!


Waiting for the remaining 3 operations to synchronize with Neptune. Do not kill this process.


All 3 operations synced, thanks for waiting!
