<a href="https://colab.research.google.com/github/patrickctrf/IA024_2022S2/blob/main/ex03/patrick_ferreira/ex03_patrick_ferreira_175480.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from torch.utils.data import DataLoader
from tqdm import tqdm

nome = 'Patrick de Carvalho Tavares Rezende Ferreira'
print(f'Meu nome é {nome}')

Meu nome é Patrick de Carvalho Tavares Rezende Ferreira


## Instruções

- Treinar uma rede neural de duas camadas como classificador binário na tarefa de análise de sentimentos usando dataset IMDB usando TF-IDF como entrada.

Deve-se implementar o laço de treinamento e validação da rede neural.

Neste exercício usaremos o IMDB com 20k exemplos para treino, 5k para desenvolvimento e 25k para teste.

# Importando os pacotes necessários

In [2]:
import collections
import os
import random
import re
import torch
import numpy as np

# Verificando se a GPU está disponível

In [16]:
if torch.cuda.is_available():
    dev = "cuda:0"
    print(torch.cuda.get_device_name(dev))
else:
    dev = "cpu"
print(dev)
device = torch.device(dev)

cpu


## Preparando Dados

Primeiro, fazemos download do dataset:

In [4]:
!wget -nc http: // files.fast.ai / data / aclImdb.tgz
!tar -xzf aclImdb.tgz

--2022-09-08 03:24:36--  ftp://http/
           => ‘.listing’
Resolving http (http)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘http’
//: Scheme missing.
File ‘index.html’ already there; not retrieving.

/: Scheme missing.
File ‘index.html’ already there; not retrieving.

/: Scheme missing.
File ‘index.html’ already there; not retrieving.



## Carregando o dataset

Criaremos uma divisão de treino (80%) e validação (20%) artificialmente.

Nota: Evitar de olhar ao máximo o dataset de teste para não ficar enviseado no que será testado. Em aplicações reais, o dataset de teste só estará disponível no futuro, ou seja, é quando o usuário começa a testar o seu produto.

In [5]:
def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts


x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

n_train = int(0.8 * len(x_train))

x_valid = x_train[n_train:]
y_valid = y_train[n_train:]
x_train = x_train[:n_train]
y_train = y_train[:n_train]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
False I watched the Malayalam movie "Boeing Boeing" made in 1985 (which in turn is probably inspired by an
False As I am from Hungary I have heard many people saying better and better things about Üvegtigris so fa
False I saw this film yesterday.. I rented the DVD from Blockbuster.. In fact, I know one of the actresses
3 últimas amostras treino:
False What the heck was this. Somebody obviously read Stephen King and Sartre in the same semester. We get
False Universal Soldier: The Return is not the worst movie ever made. No, that honor would have to go to a
True NOTHING (3+ outta 5 stars) Another weird premise from the director of the movie "Cube". This time ar
3 primeiras amostras validação:
True I think you would have to be from the USA to get a lot of the jokes. But if you liked Princess Bride
True I'm so confused. I've been a huge Seagal fan for 25 years. I've seen all of

### Definindo funções de manipulação de texto.

In [6]:
from typing import List

import re
import string


def tokenize(text: str):
    """
    Convert string to a list of tokens (i.e., words).
    This function lower cases everything and removes punctuation.
    """

    # return re.sub('[' + string.punctuation + ']', '', text.lower()).split()

    pattern = r'\W+'

    return re.split(pattern, text.lower())


from collections import Counter


def create_vocab(texts: List[str], max_tokens: int):
    """
    Returns a dictionary whose keys are tokens and values are token ids (from 0 to max_tokens - 1).
    """

    tokens = []

    for t in texts:
        tokens.extend(tokenize(t))

    return dict(Counter(tokens).most_common(max_tokens))


def concatenate_list_of_str(texts: List[str]):
    return "".join(texts)

### Criando classe do dataset

In [7]:
from torch import nn
import torch


class TfIdfDataset(torch.utils.data.Dataset):
    def __init__(self, documents, targets, max_tokens=1000):
        super().__init__()

        print("Iniciando montagem dataset")

        self.targets = targets

        every_text = concatenate_list_of_str(documents)
        tokens_ocurrences = create_vocab(tokenize(every_text), max_tokens=max_tokens)

        lista_do_vocabulario = list(tokens_ocurrences.keys())

        tf = np.zeros((len(documents), len(lista_do_vocabulario)))
        for i, doc in tqdm(enumerate(documents), total=len(documents)):
            tokenized_doc = tokenize(doc)[:-1]
            array_contagem_ocorrencias = np.array([0] * len(lista_do_vocabulario))
            for j, token in enumerate(lista_do_vocabulario):
                array_contagem_ocorrencias[j] += tokenized_doc.count(token)

            tf[i] = array_contagem_ocorrencias / len(tokenized_doc)

        idf_denominator = np.zeros((len(lista_do_vocabulario),))
        for i, doc in tqdm(enumerate(documents), total=len(documents)):
            tokenized_doc = tokenize(doc)[:-1]
            for j, token in enumerate(lista_do_vocabulario):
                if token in tokenized_doc:
                    idf_denominator[j] += 1

        idf = np.log(len(documents) / idf_denominator)

        self.tfidf = tf * idf

        print("Finalizada montagem dataset")

    def __getitem__(self, i):
        return torch.tensor(self.tfidf[i]), torch.tensor([1]) if self.targets[i] == 1 else torch.tensor([0])

    def __len__(self):
        return self.tfidf.shape[0]

### Testando dataset


In [8]:
# Teste retirado do exemplo do site:
# https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3

x_assert = ["It is going to rain today.",
            "Today I am not going outside.",
            "I am going to watch the season premiere."]

y_assert = [False, True, True]

dataset = TfIdfDataset(x_assert, y_assert, 8)

tfidf_target =  [[0., 0.06757752, 0.06757752, 0., 0., 0.18310205, 0.18310205, 0.18310205],
                [0., 0., 0.06757752, 0.06757752, 0.06757752, 0., 0., 0., ],
                [0., 0.05068314, 0., 0.05068314, 0.05068314, 0., 0., 0., ]]

tfidf_target = np.array(tfidf_target)

assert np.isclose(dataset.tfidf, tfidf_target).all()
print("Passou no assert!")

100%|██████████| 3/3 [00:00<00:00, 3749.38it/s]
100%|██████████| 3/3 [00:00<00:00, 4638.01it/s]

Iniciando montagem dataset
Finalizada montagem dataset
Passou no assert!





### Define o modelo

In [18]:
class Classifier(nn.Module):
    def __init__(self, vocab_size=1000, hidden_size=128):
        super().__init__()

        self.hidden = nn.Linear(vocab_size, hidden_size)
        self.activation = nn.LeakyReLU()
        self.output_layer = nn.Linear(hidden_size, 1)

    def forward(self, x):
        return torch.sigmoid(self.output_layer(self.activation(self.hidden(x))))

### Treina a rede

In [None]:
training_dataset = TfIdfDataset(x_train, y_train)
valid_dataset = TfIdfDataset(x_valid, y_valid)
test_dataset = TfIdfDataset(x_test, y_test)

In [24]:
def evaluate(model, valid_dataloader, criterion):
    accuracy = 0
    model.eval()
    num_examples = 0
    loss = 0
    for x, y in valid_dataloader:
        num_examples += x.shape[0]
        with torch.no_grad():
            logits = model(x.to(device, torch.float))
            loss += criterion(logits, y.to(device, torch.float)).detach() * x.shape[0]
        preds = logits.argmax(dim=1)
        accuracy += (preds == y).sum()

    return accuracy / num_examples, loss / num_examples / 10

def train(model, train_dataloader, valid_dataloader, optimizer, criterion, num_epochs: int = 10):
    for _ in range(num_epochs):
        model.train()
        acumular_loss = 0
        for x, y in train_dataloader:
            optimizer.zero_grad()
            logits = model(x.to(device, torch.float))
            loss = criterion(logits, y.to(device, torch.float))
            loss.backward()
            optimizer.step()
            acumular_loss += loss.detach()

        accuracy, valid_loss= evaluate(model, valid_dataloader, criterion)
        print("train_loss: {:.2f}".format(acumular_loss.item()), ". perplexity: {:.2f}".format(torch.exp(loss).item()), ". accuracy: {:.2f}".format(accuracy.item()))

lr = 1e-3
epochs = 20

model = Classifier().to(device)

train_loader = DataLoader(training_dataset, batch_size=1024, shuffle=True, num_workers=4)
validation_loader = DataLoader(valid_dataset, batch_size=1024, num_workers=4, )

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCELoss()

train(model, train_loader, validation_loader, optimizer, criterion, epochs)

train_loss: 13.82 . perplexity: 1.99 . accuracy: 507.48
train_loss: 13.63 . perplexity: 1.96 . accuracy: 507.48
train_loss: 13.27 . perplexity: 1.92 . accuracy: 507.48
train_loss: 12.66 . perplexity: 1.86 . accuracy: 507.48
train_loss: 11.93 . perplexity: 1.76 . accuracy: 507.48
train_loss: 11.17 . perplexity: 1.72 . accuracy: 507.48
train_loss: 10.44 . perplexity: 1.66 . accuracy: 507.48
train_loss: 9.80 . perplexity: 1.61 . accuracy: 507.48
train_loss: 9.23 . perplexity: 1.55 . accuracy: 507.48
train_loss: 8.78 . perplexity: 1.54 . accuracy: 507.48
train_loss: 8.39 . perplexity: 1.52 . accuracy: 507.48
train_loss: 8.07 . perplexity: 1.52 . accuracy: 507.48
train_loss: 7.79 . perplexity: 1.45 . accuracy: 507.48
train_loss: 7.56 . perplexity: 1.45 . accuracy: 507.48
train_loss: 7.38 . perplexity: 1.46 . accuracy: 507.48
train_loss: 7.19 . perplexity: 1.41 . accuracy: 507.48
train_loss: 7.06 . perplexity: 1.42 . accuracy: 507.48
train_loss: 6.93 . perplexity: 1.41 . accuracy: 507.48
tra

### Métricas de teste

In [36]:
acumular_loss = 0
test_loader = DataLoader(valid_dataset, batch_size=1024, num_workers=4, )
for x, y in test_loader:
    optimizer.zero_grad()
    logits = model(x.to(device, torch.float))
    loss = criterion(logits, y.to(device, torch.float))
    acumular_loss += loss.detach()

accuracy, valid_loss= evaluate(model, test_loader, criterion)
print("test_loss: {:.2f}".format(acumular_loss.item()), ". perplexity: {:.2f}".format(torch.exp(loss).item()), ". accuracy: {:.2f}".format(accuracy.item()))

test_loss: 2.76 . perplexity: 1.76 . accuracy: 50.75
