<a href="https://colab.research.google.com/github/leolellisr/npl_natural_language_processing_projects/blob/main/07_Auto_Attention_IMDB_Binary_Classification/07_Auto_Attention_IMDB_Binary_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Leonardo de Lellis Rossi

https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-50/charts

## Instruções:

Treinar e medir a acurácia de um modelo de classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).
O modelo deverá ter uma camada de auto-atenção simplificada igual à apresentada no slide 96.

Deverão ser entregues duas implementações da camada de auto-atenção, como apresentado no slide 100:
1. Usando laços (ineficiente, mas bom para o aprendizado)
2. Matricial

Devemos usar embeddings pretreinados (glove) como entrada para a camada de auto-atenção. Lembrar de congelá-los pois, caso contrário,  pode ocorrer overfit.

Ao corrigir o exercicio, iremos também nos atentar na eficiencia/velocidade das implementações.

Dicas:
- A dificuldade deste exercício será implementar a auto-atenção de forma matricial usando minibatches. Para lidar com exemplos de tamanho variável, deve-se truncá-los e aplicar padding.

- Evitar usar qualquer laço na implementação matricial, pois isso a deixará muito ineficiente.

## Definindo os parametros

In [None]:
params = {
    'vocabulary_size': 400000,
    'padding_idx': 400001,
    'max_length': 200,
}

# Fixando a seed

In [None]:
import random
import torch
import torch.nn.functional as F
import numpy as np

In [None]:
def set_seeds():
  random.seed(123)
  np.random.seed(123)
  torch.manual_seed(123)

set_seeds()

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

--2021-09-27 14:11:55--  http://files.fast.ai/data/aclImdb.tgz
Resolving files.fast.ai (files.fast.ai)... 104.26.2.19, 104.26.3.19, 172.67.69.159, ...
Connecting to files.fast.ai (files.fast.ai)|104.26.2.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://files.fast.ai/data/aclImdb.tgz [following]
--2021-09-27 14:11:56--  https://files.fast.ai/data/aclImdb.tgz
Connecting to files.fast.ai (files.fast.ai)|104.26.2.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145982645 (139M) [application/x-gtar-compressed]
Saving to: ‘aclImdb.tgz’


2021-09-27 14:12:09 (10.9 MB/s) - ‘aclImdb.tgz’ saved [145982645/145982645]



## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [None]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False Wow, a movie about NYC politics seemingly written by someone who has never set foot in NYC. You know
False I wrote a review of this movie further down after buying it on DVD and being sorely disapointed.<br 
True This film is available from David Shepard and Kino on the Before Hollywood There Was Fort Lee, NJ, a
3 últimas amostras treino:
False Yes, it can be done. John De Bello and Costa Dillon cleaned out the garbage of their minds and come 
True This is a good movie, although people unfamiliar with the Modesty Blaise comics and books may find i
True Four teenage girlfriends drive to Fort Laurdale for spring break.Unfortunately they get a flat tire 
3 primeiras amostras validação:
True Hold Your Man finds Jean Harlow, working class girl from Brooklyn falling for con man Clark Gable an
True Like his elder brothers, Claude Sautet and Jean-Pierre Melville, Alain Cornea

# Carregando os embeddings do Glove

In [None]:
!wget -nc http://nlp.stanford.edu/data/glove.6B.zip
!unzip -o glove.6B.zip -d glove_dir

--2021-09-27 14:12:44--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-09-27 14:12:45--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-27 14:12:45--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
from torchtext.vocab import GloVe
glove_vectors = GloVe(name='6B', dim=300, cache='./glove_dir')

100%|█████████▉| 399999/400000 [00:51<00:00, 7824.12it/s]


In [None]:
print(glove_vectors.vectors.shape)
print('Primeiras 20 palavras e seus índices:', list(glove_vectors.stoi.items())[:20])

torch.Size([400000, 300])
Primeiras 20 palavras e seus índices: [('the', 0), (',', 1), ('.', 2), ('of', 3), ('to', 4), ('and', 5), ('in', 6), ('a', 7), ('"', 8), ("'s", 9), ('for', 10), ('-', 11), ('that', 12), ('on', 13), ('is', 14), ('was', 15), ('said', 16), ('with', 17), ('he', 18), ('as', 19)]


In [None]:
vocab = glove_vectors.stoi
vocab['<UNK>'] = params['vocabulary_size'] # The last row is for the unknown token.

# We create a random vector for the unknown token
unk_vector = torch.FloatTensor(1, glove_vectors.vectors.shape[1]).uniform_(-0.5, 0.5)

# We create a vector of zeros for the pad token
pad_vector = torch.zeros(1, glove_vectors.vectors.shape[1])

# And add them to the embeddings matrix.
embeddings = torch.cat((glove_vectors.vectors, unk_vector, pad_vector), dim=0)

print(f'Total de palavras: {len(vocab)}')
print(f'embeddings.shape: {embeddings.shape}')

Total de palavras: 400001
embeddings.shape: torch.Size([400002, 300])


# Definindo o tokenizador

In [None]:
import collections
import re


def tokenize(text):
    return [token.lower() for token in re.compile('\w+').findall(text)]


def to_token_ids(text, vocab, max_length, padding_idx):
    tokens = tokenize(text)[:max_length]  # Truncating.
    token_ids = []
    for token in tokens:
        # We use the id of the "<UNK>" token if we don't find it in the vocabulary.
        token_id = vocab.get(token, vocab['<UNK>'])
        token_ids.append(token_id)

    # Adding PAD tokens, if necessary.
    token_ids += [padding_idx] * max(0, max_length - len(token_ids))
    return token_ids

## Definindo o Modelo

In [None]:
class SelfAttentionLayerLoop(torch.nn.Module):

    def __init__(self, embeddings, padding_idx):
        super(SelfAttentionLayerLoop, self).__init__()
        self.C = embeddings
        self.padding_idx = padding_idx

    def forward(self, batch_token_ids):
        batchE = []
        for token_ids in batch_token_ids:
          tokenE = []

          for q in token_ids:
            if q == self.padding_idx: continue # Skip PAD values

            scores = []
            for k in token_ids:
              if k == self.padding_idx:
                scores.append(-1e9)
                continue
              score = torch.matmul(self.C[q], self.C[k])   # q and k similarity
              scores.append(score)

            scores = torch.tensor(scores)
            probs = F.softmax(scores, dim=0)  # probabilities

            new_embedding = 0
            for v, p in zip(token_ids, probs):
              new_embedding += self.C[v]*p
            tokenE.append(new_embedding)

          tokenE = torch.stack(tokenE)
          tokenE = torch.mean(tokenE, dim=0)
          batchE.append(tokenE)

        mean_embeddings = torch.stack(batchE)
        return mean_embeddings

In [None]:
class SelfAttentionLayerMatrix(torch.nn.Module):

    def __init__(self, embeddings, padding_idx):
        super(SelfAttentionLayerMatrix, self).__init__()
        self.C = torch.nn.Embedding.from_pretrained(embeddings, padding_idx=padding_idx)

    def forward(self, batch_token_ids):
        Q = K = V = self.C(batch_token_ids)

        # Greetings for collegue Talles Viana for the pad mask code
        dim1Q = Q.shape[1]
        padMask = Q.abs().sum(dim=-1) == 0
        padValue = Q[padMask]
        padMaskScores = padMask.unsqueeze(1).repeat(1, dim1Q, 1)

        scores = torch.matmul(Q, K.transpose(1, 2))   # q and k similarity
        scores[padMaskScores] = -1e9

        probs = F.softmax(scores, dim=-1)  # probabilities

        mean_embeddings = torch.matmul(probs, V)
        mean_embeddings[padMask] = padValue
        mean_embeddings = mean_embeddings.sum(dim=1)

        non_pad = (padMask == False).count_nonzero(dim=-1)

        mean_embeddings = mean_embeddings / non_pad.unsqueeze(-1)

        return mean_embeddings

## Testando a implementação com embeddings "falsos"

In [None]:
fake_vocab = {
    'a': 0,
    'b': 1,
    'c': 2,
    '<UNK>': 3 
}

fake_embeddings = torch.arange(0, 2 * len(fake_vocab)).reshape(len(fake_vocab), 2).float()
pad_vector = torch.zeros(1, 2)
fake_embeddings = torch.cat((fake_embeddings, pad_vector), dim=0)

fake_examples = [
    'a', # Testing PAD
    'a b',
    'a c b', # Testing truncation
    'a z', # Testing <UNK>
    ]

print(f'Total de palavras: {len(fake_vocab)}')
print(f'embeddings.shape: {fake_embeddings.shape}')

Total de palavras: 4
embeddings.shape: torch.Size([5, 2])


In [None]:
fake_embeddings

tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.],
        [0., 0.]])

In [None]:
self_attention_layer = SelfAttentionLayerLoop(
    embeddings=fake_embeddings,
    padding_idx=4)

batch_token_ids = []
for example in fake_examples:
    token_ids = to_token_ids(
        text=example,
        vocab=fake_vocab,
        max_length=2,
        padding_idx=4)
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

In [None]:
target_output = torch.FloatTensor([
    [0.00000000, 1.00000000],
    [1.88075161, 2.88075161],
    [3.96402740, 4.96402740],
    [5.99258232, 6.99258232]])

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

In [None]:
self_attention_layer = SelfAttentionLayerMatrix(
    embeddings=fake_embeddings,
    padding_idx=4)

batch_token_ids = []
for example in fake_examples:
    token_ids = to_token_ids(
        text=example,
        vocab=fake_vocab,
        max_length=2,
        padding_idx=4)
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

## Testando a implementação com 8 exemplos do dataset do IMDB

In [None]:
examples = [
    "THE TEMP (1993) didn't do much theatrical business, but here's the direct-to-video rip-off you didn't want, anyway! Ellen Bradford (Mel Harris) is the new woman at Millennium Investments, a high scale brokerage firm, who starts getting helpful hints from wide-eyed secretary Deidre (Sheila Kelley). Deidre turns out to be an ambitious daddy's girl who will stop at nothing to move up the corporate ladder, including screwing a top broker she can't stand and murdering anyone who gets on her bad side. She digs up skeletons in Ellen's closet, tries to cause problems with her husband (Barry Bostwick), kills while making it look like she is responsible, kidnaps her daughter and tries to get her to embezzle money from the company.<br /><br />Harris and Kelley deliver competent performances, the supporting cast is alright and it's reasonably well put-together, but that doesn't fully compensate for a script that travels down a well-worn path and offers few surprises.",
    "Sondra Locke stinks in this film, but then she was an awful 'actress' anyway. Unfortunately, she drags everyone else (including then =real life boyfriend Clint Eastwood down the drain with her. But what was Clint Eastwood thinking when he agreed to star in this one? One read of the script should have told him that this one was going to be a real snorer. It's an exceptionally weak story, basically no story or plot at all. Add in bored, poor acting, even from the normally good Eastwood. There's absolutely no action except a couple arguments and as far as I was concerned, this film ranks up at the top of the heap of natural sleep enhancers. Wow! Could a film BE any more boring? I think watching paint dry or the grass grow might be more fun. A real stinker. Don't bother with this one.",
    "Judy Davis shows us here why she is one of Australia's most respected and loved actors - her portrayal of a lonely, directionless nomad is first-rate. A teenaged Claudia Karvan also gives us a glimpse of what would make her one of this country's most popular actors in years to come, with future roles in THE BIG STEAL, THE HEARTBREAK KID, DATING THE ENEMY, RISK and the acclaimed TV series THE SECRET LIFE OF US. (Incidentally, Karvan, as a child, was a young girl whose toy Panda was stolen outside a chemist's shop in the 1983 drama GOING DOWN with Tracey Mann.) If this films comes your way, make sure you see it!! Rating: 79/100. See also: HOTEL SORRENTO, RADIANCE, VACANT POSSESSION, LANTANA.",
    'New York playwright Michael Caine (as Sidney Bruhl) is 46-years-old and fading fast; as the film opens, Mr. Caine\'s latest play flops on Broadway. TV reviewers poke fun at Caine, and he gets drunk. Passing out on the Long Island Railroad lands Caine in Montauk, instead of his residence in East Hampton. Finally arriving home, Caine is comforted by tightly-attired wife Dyan Cannon (as Myra), an unfortunately high-strung heart patient. There, Caine and Ms. Cannon discuss a new play called "Deathtrap", written by hunky young Christopher Reeve (as Clifford "Cliff" Anderson), one of Caine\'s former students. The couple believe Mr. Reeve\'s "Deathtrap" is the hit needed to revive Caine\'s career.<br /><br />"The Trap Is Set\x85 For A Wickedly Funny Who\'ll-Do-It." <br /><br />Directed by Sidney Lumet, Ira Levin\'s long-running Broadway hit doesn\'t stray too far from its stage origin. The cast is enjoyable and the story\'s twists are still engrossing. One thing that did not work (for me) was the curtain call ending; surely, it played better on stage. "Deathtrap" is a fun film to watch again; the performances are dead on - but, in hindsight, the greeting Reeve gives Caine at the East Hampton train station should have been simplified to a smiling "Hello." The location isn\'t really East Hampton, but the windmill and pond look similar. And, the much ballyhooed love scene is shockingly tepid. But, the play was so good, "even a gifted director couldn\'t ruin it." And, Mr. Lumet doesn\'t disappoint.<br /><br />******** Deathtrap (3/19/82) Sidney Lumet ~ Michael Caine, Christopher Reeve, Dyan Cannon, Irene Worth',
    'Students often ask me why I choose this version of Othello. Shakespeare\'s text is strongly truncated and the film contains material which earned it an "R" rating.<br /><br />I have several reasons for using this production: First, I had not seen a depiction of the Moor that actually made me sympathetic to Othello until I saw Fishburne play him. I saw James Earl Jones and Christopher Plummer play Othello and Iago on Broadway, and it was wonderful. Plummer\'s energy was especially noticeable. But in spite of Jone\'s incredible presence both physically and vocally, the character he played just seemed too passive to illicit from me a complete emotional purgation in the Aristotelian sense. Jones, in fact, affirmed what I felt when in an interview he noted that he had played Othello as passive--seeing Iago as basically doing him over. Unfortunately this sapped my grief for the character destruction. Thus, I felt sympathy for Jone\'s Moor but not the horror over his corruption by an evil man. In contrast, Fishburne\'s Othello is a strong and vigorous figure familiar with taking action. Thus, Iago\'s temptation to actively deal with what is presented to Othello as his wife\'s unfaithfulness is a perversion of the general\'s positive quality to be active not passive.1 The horror of the story is that this good quality in Othello becomes perverted. Fishburne\'s depiction is therefore classically tragic.<br /><br />Second, Fishburne is the first black actor to play Othello in a film. Both Orsen Wells and Anthony Hopkins did fine film versions, but they were white men in black face.2 Why is this important? Why should a Black actor be the Black man on the stage?3 Certainly in Shakespeare\'s day they used black face just as they used boys to make girls. Perhaps then, the reason is the same. Female actors bring a special quality to female roles on the Shakespearian stage because they understand best what Shakespeare\'s genius was trying to present. A gifted black actor should play the moor because his experience in a white dominated culture is vital to understanding what Shakespeare\'s genius recognized: the pain of being marginalized because of race. An important theme in Othello is isolation caused by racism. Although it is a mistake to insert American racism into a Shakespearian play, there can be little doubt that racism is still working among the characters. Many, including Desdimona\'s father, think that a union between a Venetian white Christian woman and a North African black Christian man is UNNATURAL.<br /><br />Third, Shakespeare was never G rated. He never has been. His stage productions were always typified by violence and strong language. But Shakespeare\'s genius uses these elements not as sensationialism but for artistic honesty.',
    'Roeg has done some great movies, but this a turkey. It has a feel of a play written by an untalented high-school student for his class assignment. The set decoration is appealing in a somewhat surrealistic way, but the actual story is insufferable hokum.',
    "<br /><br />What is left of Planet Earth is populated by a few poor and starving rag-tag survivors. They must eat bugs and insects, or whatever, after a poison war, or something, has nearly wiped out all human civilization. In these dark times, one of the few people on Earth still able to live in comfort, we will call him the All Knowing Big Boss, has a great quest to prevent some secret spore seeds from being released into the air. It seems that the All Knowing Big Boss is the last person on Earth that knows that these spores even exist. The spores are located far away from any living soul, and they are highly protected by many layers of deadly defense systems. <br /><br />The All Knowing Big Boss wants the secret spores to remain in their secret protected containers. So, he makes a plan to send in a macho action team to remove the spore containers from all of the protective systems and secret location. Sending people to the location of secret spores makes them no longer a secret. Sending people to disable all of the protective systems makes it possible for the spores to be easily released into the air. How about letting sleeping dogs lie?! <br /><br />The one pleasant feature of ENCRYPT is the radiant and elegant Vivian Wu. As the unremarkable macho action team members drop off with mechanically paced predictable timing, engaging Vivian Wu's charm makes acceptable the plot idea of her old employer wanting her so much. She is an object of love, an object of desire -- a very believable concept!<br /><br />Fans of Vivian Wu may want to check out an outstanding B-movie she is in from a couple years back called DINNER RUSH. DINNER RUSH is highly recommended. ENCRYPT is not.",
    "So the other night I decided to watch Tales from the Hollywood Hills: Natica Jackson. Or Power, Passion, Murder as it is called in Holland. When I bought the film I noticed that Michelle Pfeiffer was starring in it and I thought that had to say something about the quality. Unfortunately, it didn't.<br /><br />1) The plot of the film is really confusing. There are two story lines running simultaneously during the film. Only they have nothing in common. Throughout the entire movie I was waiting for the moment these two story lines would come together so the plot would be clear to me. But it still hasn't.<br /><br />2) The title of the film says the film will be about Natica Jackson. Well it is, sometimes. Like said the film covers two different stories and the part about Natica Jackson is the shortest. So another title for this movie would not be a wrong choice.<br /><br />To conclude my story, I really recommend that you leave this movie where it belongs, on the shelf in the store on a place nobody can see it. By doing this you won't waste 90 minutes of your life, as I did."         
]

In [None]:
self_attention_layer = SelfAttentionLayerLoop(
    embeddings=embeddings,
    padding_idx=params['padding_idx'])

batch_token_ids = []
for example in examples:
    token_ids = to_token_ids(
        text=example,
        vocab=vocab,
        max_length=params['max_length'],
        padding_idx=params['padding_idx'])
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

Fazemos o download do tensor esperado e o comparamos com nossa saída

In [None]:
!wget https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula7/target_tensor.pt

--2021-09-27 14:31:19--  https://storage.googleapis.com/neuralresearcher_data/unicamp/ia376e_2021s2/aula7/target_tensor.pt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.71.128, 74.125.133.128, 74.125.140.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.71.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10347 (10K) [application/octet-stream]
Saving to: ‘target_tensor.pt’


2021-09-27 14:31:19 (69.8 MB/s) - ‘target_tensor.pt’ saved [10347/10347]



In [None]:
target_output = torch.load('target_tensor.pt')

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

In [None]:
self_attention_layer = SelfAttentionLayerMatrix(
    embeddings=embeddings,
    padding_idx=params['padding_idx'])

batch_token_ids = []
for example in examples:
    token_ids = to_token_ids(
        text=example,
        vocab=vocab,
        max_length=params['max_length'],
        padding_idx=params['padding_idx'])
    batch_token_ids.append(token_ids)

batch_token_ids = torch.LongTensor(batch_token_ids)
my_output = self_attention_layer(batch_token_ids)

In [None]:
assert torch.allclose(my_output, target_output, atol=1e-6)

# Env config

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

Tesla K80
cuda:0


# Dataset

In [None]:
class Ex7_ds(torch.utils.data.Dataset):
    def __init__(self, x, y, vocabulary, max_length, padding_idx):
        self.x = x
        self.y = y
        self.vocabulary = vocabulary
        self.max_length = max_length
        self.padding_idx = padding_idx

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        tokens_idx = to_token_ids(
            self.x[index],
            self.vocabulary,
            self.max_length,
            self.padding_idx)
        tokens_idx = torch.tensor(tokens_idx, dtype=torch.long)
        outputs = torch.tensor(self.y[index], dtype=torch.long)
        return tokens_idx, outputs

# Model

In [None]:
class Ex7_model(torch.nn.Module):
    def __init__(self, input, hidden, output, embeddings, padding_idx):
        super(Ex7_model, self).__init__()
        self.attention_layer = SelfAttentionLayerMatrix(embeddings=embeddings, padding_idx=padding_idx)
        self.fst_linear_layer = torch.nn.Linear(input, hidden, device=device)        
        self.snd_linear_layer = torch.nn.Linear(hidden, output, device=device)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.attention_layer(x)
        x = self.fst_linear_layer(x)
        x = self.relu(x)
        x = self.snd_linear_layer(x)
        return x


In [None]:
hyperparameters = { "mode": "210923_Aula7", 
          "learning_rate": 1e-2,
          "n_epochs": 5,
          "batch_size": 50,
          "hidden_size": 128,
          "input_size": glove_vectors.vectors.shape[1],
          "output_size": 2 }
model = Ex7_model(hyperparameters["input_size"], hyperparameters["hidden_size"],
                  hyperparameters["output_size"], embeddings, params['padding_idx'])
model.to(device)
print(model)

Ex7_model(
  (attention_layer): SelfAttentionLayerMatrix(
    (C): Embedding(400002, 300, padding_idx=400001)
  )
  (fst_linear_layer): Linear(in_features=300, out_features=128, bias=True)
  (snd_linear_layer): Linear(in_features=128, out_features=2, bias=True)
  (relu): ReLU()
)


# Install and config Neptune

In [None]:
! pip install neptune-client

Collecting neptune-client
  Downloading neptune-client-0.11.0.tar.gz (269 kB)
[K     |████████████████████████████████| 269 kB 12.5 MB/s 
[?25hCollecting bravado
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 49.8 MB/s 
Collecting PyJWT
  Downloading PyJWT-2.1.0-py3-none-any.whl (16 kB)
Collecting websocket-client!=1.0.0,>=0.35.0
  Downloading websocket_client-1.2.1-py2.py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[?25hCollecting GitPython>=2.0.8
  Downloading GitPython-3.1.24-py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 51.4 MB/s 
[?25hCollecting boto3>=1.16.0
  Downloading boto3-1.18.48-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 37.1 MB/s 
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |███████████

In [None]:
import neptune.new as neptune

run = neptune.init(project='leolellisr/nlp-imbd-large', api_token='eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiI1NjY1YmJkZi1hYmM5LTQ3M2QtOGU1ZC1iZTFlNWY4NjE1NDQifQ==')

https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-50
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


# Train Loop

In [None]:
criterion = torch.nn.CrossEntropyLoss()

In [None]:
def train_loop(dataloader_train, dataloader_val, hyperparameters, model):
    min_val_loss = 10e9
    best_model = 'best_model.pt'

    # Gradient descent
    optimizer = torch.optim.Adam(model.parameters(), lr=hyperparameters['learning_rate'])
    best_epoch = 0

    for epoch in range(hyperparameters['n_epochs']):
      train_loss = 0
      model.train()
      for x_train, y_train in dataloader_train:
            # transform to one dimmension
        x_train = x_train.to(device)
        y_train = y_train.to(device) 
        
        outputs = model(x_train)

            # batch loss
        batch_loss = criterion(outputs, y_train)

            # reset gradients, backpropagation, optimizer step and sum loss
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()
        train_loss += batch_loss.item()
            #print(f'{hyperparameters["name"]}_train/batch_loss: {batch_loss}')
        run[f'{hyperparameters["mode"]}_train/batch_loss'].log(batch_loss)

      train_loss = train_loss / len(dataloader_train.dataset)
        #print(f'Epoch {epoch} / {hyperparameters["name"]} train loss: {train_loss}')
      run[f'{hyperparameters["mode"]}_train/train_loss'].log(train_loss) 

        # Validation (end of epoch).
      total_loss = 0
      total_acc = 0
      model.eval()
      with torch.no_grad():
        for x_val, y_val in dataloader_val:
          x_val = x_val.to(device)
          y_val = y_val.to(device)

                # predict
          outputs = model(x_val)

                # batch loss
          batch_loss = criterion(outputs, y_val)
          preds = outputs.argmax(dim=1)
          total_loss += batch_loss.item()

          batch_acc = (preds == y_val).sum()
          total_acc += batch_acc
      val_loss = total_loss / len(dataloader_val.dataset)
      run[f'{hyperparameters["mode"]}_val/val_loss'].log(val_loss)

      val_acc = total_acc / len(dataloader_val.dataset)
      run[f'{hyperparameters["mode"]}_val/val_acuracy'].log(total_acc / len(dataloader_val.dataset))
      
      print(f'Model: {hyperparameters["mode"]}, Epoch: {epoch+1}/{hyperparameters["n_epochs"]} - train_loss: {train_loss} - val_loss: {val_loss} - acc: {total_acc / len(dataloader_val.dataset)*100} %')

        # Save best model
      if val_loss < min_val_loss:
        torch.save(model.state_dict(), 'best_model.pt')
        min_val_loss = val_loss
        best_epoch = epoch
        print(f'Model: {hyperparameters["mode"]} - best model in epoch: {best_epoch+1}')


In [None]:
def predict(model, dataloader_test):
    best_model = 'best_model.pt'
    model.load_state_dict(torch.load(best_model))
    model.eval()
    model.to(device)
    floss = 0
    total_acc = 0
    with torch.no_grad():
      for x_t, y_t in dataloader_test:
        x_t = x_t.to(device)
        y_t = y_t.to(device)

        outputs = model(x_t)
        loss = criterion(outputs, y_t)
        floss += loss
        pred = outputs.argmax(dim=1)
        batch_acc = (pred == y_t).sum()
        total_acc += batch_acc
    
      test_acc = total_acc / len(dataloader_test.dataset)    
    return { 
        'loss':  floss / len(dataloader_test.dataset),
        'acc': test_acc
    }

# List to dict

In [None]:
# Transform list to dict
x_train = {num: i for num, i in enumerate(x_train)}
y_train = {num: i for num, i in enumerate(y_train)}
x_valid = {num: i for num, i in enumerate(x_valid)}
y_valid = {num: i for num, i in enumerate(y_valid)}
x_test = {num: i for num, i in enumerate(x_test)}
y_test = {num: i for num, i in enumerate(y_test)}

# Training

In [None]:
from torch.utils.data import DataLoader

In [None]:
train_ds = Ex7_ds(x_train, y_train, vocab, params['max_length'], params['padding_idx'])
val_ds = Ex7_ds(x_valid, y_valid, vocab, params['max_length'], params['padding_idx'])
dataloader_train = DataLoader(train_ds, batch_size=hyperparameters['batch_size'], shuffle=True)
dataloader_val = DataLoader(val_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  

In [None]:
train_loop(dataloader_train, dataloader_val, hyperparameters, model)   

Model: 210923_Aula7, Epoch: 1/5 - train_loss: 0.009604181499779225 - val_loss: 0.008486265176534653 - acc: 80.86000061035156 %
Model: 210923_Aula7 - best model in epoch: 1
Model: 210923_Aula7, Epoch: 2/5 - train_loss: 0.008652994471788406 - val_loss: 0.00839058507680893 - acc: 80.87999725341797 %
Model: 210923_Aula7 - best model in epoch: 2
Model: 210923_Aula7, Epoch: 3/5 - train_loss: 0.008632353458553553 - val_loss: 0.008271433514356613 - acc: 81.54000091552734 %
Model: 210923_Aula7 - best model in epoch: 3
Model: 210923_Aula7, Epoch: 4/5 - train_loss: 0.008469433856010436 - val_loss: 0.008206213411688805 - acc: 81.55999755859375 %
Model: 210923_Aula7 - best model in epoch: 4
Model: 210923_Aula7, Epoch: 5/5 - train_loss: 0.008328245906531811 - val_loss: 0.008602176690101623 - acc: 80.1199951171875 %


In [None]:
del train_ds
del val_ds
del dataloader_train
del dataloader_val

In [None]:
test_ds = Ex7_ds(x_test, y_test, vocab, params['max_length'], params['padding_idx'])
dataloader_test = DataLoader(test_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  
print(predict(model,dataloader_test))

{'loss': tensor(0.0083, device='cuda:0'), 'acc': tensor(0.8068, device='cuda:0')}


In [None]:
run.stop()

Shutting down background jobs, please wait a moment...
Done!


Waiting for the remaining 1 operations to synchronize with Neptune. Do not kill this process.


All 1 operations synced, thanks for waiting!
