<a href="https://colab.research.google.com/github/leolellisr/npl_natural_language_processing_projects/blob/main/05_Embeddings/06_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook de referência 

Nome: Leonardo de Lellis Rossi

https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-47/charts

## Definindo os parametros

In [None]:
params = {
    'vocabulary_size': 10000
}

# Fixando a seed

In [None]:
import random
import torch
import numpy as np

In [None]:
def set_seeds():
  random.seed(123)
  np.random.seed(123)
  torch.manual_seed(123)
  torch.cuda.manual_seed(123)
set_seeds()  

## Preparando Dados

Primeiro, fazemos download do dataset:

In [None]:
!wget -nc http://files.fast.ai/data/aclImdb.tgz 
!tar -xzf aclImdb.tgz

File ‘aclImdb.tgz’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (24k exemplos) e validação (1k exemplos) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

Neste exercicio, iremos usar apenas 1000 exemplos de validação e 1000 de teste pois precisamos executar uma inferencia do modelo para cada _palavra_ do dataset.

Como o aprendizado é não supervisionado, não iremos utilizar os rótulos.

In [None]:
import os
import random


max_valid = 1000
max_test = 1000


def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

# Embaralhamos o teste para diminuir a chance de algum viés nos 1000 exemplos amostrados.
c = list(zip(x_test, x_test))
random.shuffle(c)
x_test, x_test = zip(*c)
x_test = x_test[:max_test]
y_test = y_test[:max_test]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

24000 amostras de treino.
1000 amostras de desenvolvimento.
1000 amostras de teste.
3 primeiras amostras treino:
False I'm not sure if the filmmakers were after a Saw-type movie or 12 Angry Men (people piecing together 
False A note to all of you budding film writers: Study this film. If your dialog reads like the dialog in 
True Kalifornia is the story of a writer and his girlfriend photographer who are looking for someone to h
3 últimas amostras treino:
False What's written on the poster is: "At birth he was given 6 years to live... At 34 he takes the journe
False "I'm a cartoon!" "You're an illustration!" what does that suppose to mean?! This plot could not be w
False Anyone who actually had the ability to sit through this movie and walk away feeling like it was a go
3 primeiras amostras validação:
True With title like this you know you get pretty much lot of junk. Acting bad. Script bad. Director bad.
True And I'll tell you why: whoever decided to edit this movie to make it suitabl

In [None]:
print(f'Numero de palavras treino: {sum([len(item.split()) for item in x_train])}')
print(f'Numero de palavras validação: {sum([len(item.split()) for item in x_valid])}')
print(f'Numero de palavras teste: {sum([len(item.split()) for item in x_test])}')

Numero de palavras treino: 5609325
Numero de palavras validação: 235355
Numero de palavras teste: 230507


# Definindo o vocabulário

In [None]:
import collections
import re


def tokenize(text):
    return [token.lower() for token in re.compile('\w+').findall(text)]


vocabulary = collections.Counter([token for text in x_train for token in tokenize(text)]).most_common(params['vocabulary_size'])
vocabulary = list(dict(vocabulary).keys())
print('top 20 tokens do vocabulário:')
print('\n'.join(vocabulary[:20]))

vocabulary = {token: i for i, token in enumerate(vocabulary)}


top 20 tokens do vocabulário:
the
and
a
of
to
is
br
it
in
i
this
that
s
was
as
for
with
movie
but
film


# Imports

In [None]:
%matplotlib inline
import numpy as np
from bs4 import BeautifulSoup
import torch
from torch.utils.data import DataLoader

import re
from collections import Counter, OrderedDict
import numpy as np

from torchtext.vocab import vocab
import matplotlib.pyplot as plt


#  Input BoW - Vocab to Index

In [None]:
def idx_vocab(text, vocab):
    idx_voc = np.array(tokenize(text))
    idx_voc = np.vectorize(lambda x: vocab.get(x, -1))(idx_voc)
    inputs = []
    for i in range(2, len(idx_voc)-2):
        input = np.array([idx_voc[i-2], idx_voc[i-1], idx_voc[i+1], idx_voc[i+2], idx_voc[i]])
        if not (-1 in input):
            inputs.append(input)
    return np.array(inputs)

In [None]:
from tqdm import tqdm
eg = ['This movie surely is amazing', 'Thats something This movie is amazing']
np.vstack([idx_vocab(txt, vocabulary) for txt in tqdm(eg)])

100%|██████████| 2/2 [00:00<00:00, 957.49it/s]


array([[  10,   17,    5,  478, 1338],
       [1558,  140,   17,    5,   10],
       [ 140,   10,    5,  478,   17]])

In [None]:
x_train_t = np.vstack([idx_vocab(txt, vocabulary) for txt in tqdm(x_train)])
y_train_t = x_train_t[:,4]
x_train_t = x_train_t[:,:4]

x_valid_t = np.vstack([idx_vocab(txt, vocabulary) for txt in tqdm(x_valid)])
y_valid_t = x_valid_t[:,4]
x_valid_t = x_valid_t[:,:4]

x_test_t = np.vstack([idx_vocab(txt, vocabulary) for txt in tqdm(x_test)])
y_test_t = x_test_t[:,4]
x_test_t = x_test_t[:,:4]

print('x_train: ', x_train_t.shape)
print('y_train: ', y_train_t.shape)

print('x_valid: ', x_valid_t.shape)
print('y_valid: ', y_valid_t.shape)

print('x_test: ', x_test_t.shape)
print('y_test: ', y_test_t.shape)

100%|██████████| 24000/24000 [00:50<00:00, 478.26it/s]
100%|██████████| 1000/1000 [00:02<00:00, 478.06it/s]
100%|██████████| 1000/1000 [00:02<00:00, 480.72it/s]

x_train:  (4480789, 4)
y_train:  (4480789,)
x_valid:  (186188, 4)
y_valid:  (186188,)
x_test:  (180997, 4)
y_test:  (180997,)





# Env config

In [None]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
   print(torch. cuda. get_device_name(dev))
else: 
   dev = "cpu" 
print(dev)
device = torch.device(dev)

cpu


# Dataset

In [None]:
class Ex6_ds(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        return torch.Tensor(self.x[index]).long(), torch.Tensor([self.y[index]]).long()

# Model

In [None]:
class Ex6_model(torch.nn.Module):
    def __init__(self, input, hidden):
        super(Ex6_model, self).__init__()
        self.fst_layer = torch.nn.Embedding(input, hidden, device=device, padding_idx=None)        
        self.snd_linear_layer = torch.nn.Linear(hidden, input, device=device)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.fst_layer(x)
        x = torch.sum(x, dim=1) 
        x = self.relu(x)
        x = self.snd_linear_layer(x)
        return x


In [None]:
model = Ex6_model(len(vocabulary), 128)
model.to(device)
print(model)

Ex6_model(
  (fst_layer): Embedding(10000, 128)
  (snd_linear_layer): Linear(in_features=128, out_features=10000, bias=True)
  (relu): ReLU()
)


# Install and config Neptune

In [None]:
! pip install neptune-client



In [None]:
import neptune.new as neptune

run = neptune.init(project='leolellisr/nlp-imbd-large', api_token='eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiI1NjY1YmJkZi1hYmM5LTQ3M2QtOGU1ZC1iZTFlNWY4NjE1NDQifQ==')

Info (NVML): Driver Not Loaded. GPU usage metrics may not be reported. For more information, see https://docs-legacy.neptune.ai/logging-and-managing-experiment-results/logging-experiment-data.html#hardware-consumption 


https://app.neptune.ai/leolellisr/nlp-imbd-large/e/NIMBL-49
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api-reference/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


# Train loop

In [None]:
# CrossEntropyLoss as loss function
criterion = torch.nn.CrossEntropyLoss(reduction='sum')


In [None]:
def train_loop(dataloader_train, dataloader_val, hyperparameters, model):
    #train_loss_a=[] 
    #val_loss_a=[] 
    #train_per_a=[]
    #val_per_a=[]
    min_val_per = None
    best_model = 'best_model.pt'
    # Gradient descent
    optimizer = torch.optim.Adam(model.parameters(), lr=hyperparameters['learning_rate'])
    best_epoch = 0

    for epoch in tqdm(range(hyperparameters['n_epochs'])):
      train_loss = 0
      train_per = 0
      model.train()
      for x_train, y_train in tqdm(dataloader_train):
            # transform to one dimention
        x_train = x_train.to(device)
        y_train = y_train.to(device) # squeeze() same as reshape(-1)
        
        outputs = model(x_train)

            # batch loss
        batch_loss = criterion(outputs, y_train.reshape(-1))

            # reset gradients, backpropagation, optimizer step and sum loss
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()
        train_loss += batch_loss.item()
            #print(f'{hyperparameters["name"]}_train/batch_loss: {batch_loss}')
        run[f'{hyperparameters["mode"]}_train/batch_loss'].log(batch_loss)

      train_loss = train_loss / len(dataloader_train.dataset)
        #print(f'Epoch {epoch} / {hyperparameters["name"]} train loss: {train_loss}')
      run[f'{hyperparameters["mode"]}_train/train_loss'].log(train_loss) 

        # Validation (end of epoch).
      total_loss = 0
      val_per = 0
      model.eval()
      with torch.no_grad():
        for x_val, y_val in tqdm(dataloader_val):
          x_val = x_val.to(device)
          y_val = y_val.to(device)

                # predict
          outputs = model(x_val)

                # batch loss
          batch_loss = criterion(outputs, y_val.reshape(-1))
          preds = outputs.argmax(dim=1)
          total_loss += batch_loss

          

      val_loss = total_loss / len(dataloader_val.dataset)
      run[f'{hyperparameters["mode"]}_val/val_loss'].log(val_loss)

      val_perplexity = torch.exp(total_loss/len(dataloader_val.dataset)) #should be exp2 but other collegues used exp
      run[f'{hyperparameters["mode"]}_val/val_perplexity'].log(val_perplexity)
      
      print(f'Model: {hyperparameters["mode"]}, Epoch: {epoch+1}/{hyperparameters["n_epochs"]} - train_loss: {train_loss} - val_loss: {val_loss} - perplexity: {val_perplexity}')

        # Save best model
      if min_val_per is None or val_per < min_val_per:
        torch.save(model.state_dict(), best_model)
        min_val_per = val_per
        best_epoch = epoch
        print(f'Model: {hyperparameters["mode"]} - best model in epoch: {best_epoch+1}')


# Prediction with Test Data

In [None]:
def predict(model, dataloader_test):
    best_model = 'best_model.pt'
    model.load_state_dict(torch.load(best_model))
    model.eval()
    model.to(device)
    floss = 0
    with torch.no_grad():
      for x_t, y_t in dataloader_test:
        x_t = x_t.to(device)
        y_t = y_t.to(device)

        outputs = model(x_t)
        loss = criterion(outputs, y_t.reshape(-1))
        floss += loss
    fper = torch.exp(floss/len(dataloader_test.dataset)) #should be exp2 but other collegues used exp
    
    return { 
        'loss':  floss / len(dataloader_test.dataset),
        'perplexity': fper
    }

# Train and Prediction 

In [None]:
hyperparameters = { "mode": 210920, 
          "learning_rate": 1e-2,
          "n_epochs": 5,
          "batch_size": 2048,
          "hidden_size": 128 }

train_ds = Ex6_ds(x_train_t, y_train_t)
val_ds = Ex6_ds(x_valid_t, y_valid_t)
dataloader_train = DataLoader(train_ds, batch_size=hyperparameters['batch_size'], shuffle=False)
dataloader_val = DataLoader(val_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  

In [None]:
train_loop(dataloader_train, dataloader_val, hyperparameters, model)   

  0%|          | 0/5 [00:00<?, ?it/s]
  0%|          | 0/2188 [00:00<?, ?it/s][A
  0%|          | 1/2188 [00:00<35:35,  1.02it/s][A
  0%|          | 2/2188 [00:01<27:52,  1.31it/s][A
  0%|          | 3/2188 [00:02<24:16,  1.50it/s][A
  0%|          | 4/2188 [00:02<22:29,  1.62it/s][A
  0%|          | 5/2188 [00:03<21:26,  1.70it/s][A
  0%|          | 6/2188 [00:03<20:40,  1.76it/s][A
  0%|          | 7/2188 [00:04<20:37,  1.76it/s][A
  0%|          | 8/2188 [00:04<20:22,  1.78it/s][A
  0%|          | 9/2188 [00:05<20:15,  1.79it/s][A
  0%|          | 10/2188 [00:05<20:02,  1.81it/s][A
  1%|          | 11/2188 [00:06<19:55,  1.82it/s][A
  1%|          | 12/2188 [00:07<19:49,  1.83it/s][A
  1%|          | 13/2188 [00:07<20:56,  1.73it/s][A
  1%|          | 14/2188 [00:08<20:22,  1.78it/s][A
  1%|          | 15/2188 [00:08<20:05,  1.80it/s][A
  1%|          | 16/2188 [00:09<19:53,  1.82it/s][A
  1%|          | 17/2188 [00:09<19:47,  1.83it/s][A
  1%|          | 18/2188 [

Model: 210920, Epoch: 1/5 - train_loss: 5.577219185058703 - val_loss: 5.2577080726623535 - perplexity: 192.0408477783203
Model: 210920 - best model in epoch: 1



  0%|          | 0/2188 [00:00<?, ?it/s][A
  0%|          | 1/2188 [00:00<20:12,  1.80it/s][A
  0%|          | 2/2188 [00:01<19:59,  1.82it/s][A
  0%|          | 3/2188 [00:01<19:53,  1.83it/s][A
  0%|          | 4/2188 [00:02<20:00,  1.82it/s][A
  0%|          | 5/2188 [00:02<19:52,  1.83it/s][A
  0%|          | 6/2188 [00:03<21:03,  1.73it/s][A
  0%|          | 7/2188 [00:03<20:40,  1.76it/s][A
  0%|          | 8/2188 [00:04<20:22,  1.78it/s][A
  0%|          | 9/2188 [00:05<20:09,  1.80it/s][A
  0%|          | 10/2188 [00:05<20:00,  1.81it/s][A
  1%|          | 11/2188 [00:06<19:54,  1.82it/s][A
  1%|          | 12/2188 [00:06<19:48,  1.83it/s][A
  1%|          | 13/2188 [00:07<19:44,  1.84it/s][A
  1%|          | 14/2188 [00:07<19:41,  1.84it/s][A
  1%|          | 15/2188 [00:08<19:41,  1.84it/s][A
  1%|          | 16/2188 [00:08<19:45,  1.83it/s][A
  1%|          | 17/2188 [00:09<19:40,  1.84it/s][A
  1%|          | 18/2188 [00:09<19:37,  1.84it/s][A
  1%|     

Model: 210920, Epoch: 2/5 - train_loss: 5.037142806350623 - val_loss: 5.176286697387695 - perplexity: 177.0242462158203



  0%|          | 0/2188 [00:00<?, ?it/s][A
  0%|          | 1/2188 [00:00<21:24,  1.70it/s][A
  0%|          | 2/2188 [00:01<20:51,  1.75it/s][A
  0%|          | 3/2188 [00:01<20:30,  1.78it/s][A
  0%|          | 4/2188 [00:02<20:12,  1.80it/s][A
  0%|          | 5/2188 [00:02<20:11,  1.80it/s][A
  0%|          | 6/2188 [00:03<19:50,  1.83it/s][A
  0%|          | 7/2188 [00:03<19:54,  1.83it/s][A
  0%|          | 8/2188 [00:04<19:55,  1.82it/s][A
  0%|          | 9/2188 [00:04<19:58,  1.82it/s][A
  0%|          | 10/2188 [00:05<20:11,  1.80it/s][A
  1%|          | 11/2188 [00:06<20:16,  1.79it/s][A
  1%|          | 12/2188 [00:06<21:20,  1.70it/s][A
  1%|          | 13/2188 [00:07<20:52,  1.74it/s][A
  1%|          | 14/2188 [00:07<20:34,  1.76it/s][A
  1%|          | 15/2188 [00:08<20:27,  1.77it/s][A
  1%|          | 16/2188 [00:08<20:05,  1.80it/s][A
  1%|          | 17/2188 [00:09<20:10,  1.79it/s][A
  1%|          | 18/2188 [00:10<19:54,  1.82it/s][A
  1%|     

Model: 210920, Epoch: 3/5 - train_loss: 4.891797519069047 - val_loss: 5.174109935760498 - perplexity: 176.6393280029297



  0%|          | 0/2188 [00:00<?, ?it/s][A
  0%|          | 1/2188 [00:00<21:25,  1.70it/s][A
  0%|          | 2/2188 [00:01<20:45,  1.75it/s][A
  0%|          | 3/2188 [00:01<20:29,  1.78it/s][A
  0%|          | 4/2188 [00:02<20:37,  1.77it/s][A
  0%|          | 5/2188 [00:02<20:32,  1.77it/s][A
  0%|          | 6/2188 [00:03<20:29,  1.77it/s][A
  0%|          | 7/2188 [00:03<20:23,  1.78it/s][A
  0%|          | 8/2188 [00:04<20:33,  1.77it/s][A
  0%|          | 9/2188 [00:05<20:11,  1.80it/s][A
  0%|          | 10/2188 [00:05<20:00,  1.81it/s][A
  1%|          | 11/2188 [00:06<19:53,  1.82it/s][A
  1%|          | 12/2188 [00:06<21:11,  1.71it/s][A
  1%|          | 13/2188 [00:07<20:42,  1.75it/s][A
  1%|          | 14/2188 [00:07<20:32,  1.76it/s][A
  1%|          | 15/2188 [00:08<20:24,  1.77it/s][A
  1%|          | 16/2188 [00:09<20:22,  1.78it/s][A
  1%|          | 17/2188 [00:09<20:40,  1.75it/s][A
  1%|          | 18/2188 [00:10<20:20,  1.78it/s][A
  1%|     

Model: 210920, Epoch: 4/5 - train_loss: 4.820544560570833 - val_loss: 5.19140625 - perplexity: 179.7211151123047



  0%|          | 0/2188 [00:00<?, ?it/s][A
  0%|          | 1/2188 [00:00<22:13,  1.64it/s][A
  0%|          | 2/2188 [00:01<21:07,  1.72it/s][A
  0%|          | 3/2188 [00:01<20:50,  1.75it/s][A
  0%|          | 4/2188 [00:02<20:25,  1.78it/s][A
  0%|          | 5/2188 [00:02<20:40,  1.76it/s][A
  0%|          | 6/2188 [00:03<20:26,  1.78it/s][A
  0%|          | 7/2188 [00:03<20:28,  1.78it/s][A
  0%|          | 8/2188 [00:04<21:22,  1.70it/s][A
  0%|          | 9/2188 [00:05<21:02,  1.73it/s][A
  0%|          | 10/2188 [00:05<20:45,  1.75it/s][A
  1%|          | 11/2188 [00:06<20:49,  1.74it/s][A
  1%|          | 12/2188 [00:06<20:45,  1.75it/s][A
  1%|          | 13/2188 [00:07<20:49,  1.74it/s][A
  1%|          | 14/2188 [00:08<20:49,  1.74it/s][A
  1%|          | 15/2188 [00:08<20:27,  1.77it/s][A
  1%|          | 16/2188 [00:09<20:24,  1.77it/s][A
  1%|          | 17/2188 [00:09<20:24,  1.77it/s][A
  1%|          | 18/2188 [00:10<20:19,  1.78it/s][A
  1%|     

Model: 210920, Epoch: 5/5 - train_loss: 4.778852806261144 - val_loss: 5.218485355377197 - perplexity: 184.65428161621094





In [None]:
test_ds = Ex6_ds(x_test_t, y_test_t)
dataloader_test = DataLoader(test_ds, batch_size=hyperparameters['batch_size'], shuffle=False)  
print(predict(model,dataloader_test))

{'loss': tensor(5.2296), 'perplexity': tensor(186.7154)}


In [None]:
run.stop()

Shutting down background jobs, please wait a moment...
Done!


Waiting for the remaining 5 operations to synchronize with Neptune. Do not kill this process.


All 5 operations synced, thanks for waiting!
