# Práctico 1 - Parte 2

[Enunciado](https://github.com/DiploDatos/AprendizajeProfundo/blob/master/Practico.md) del trabajo práctico.

**Implementación de red neuronal [Perceptrón Multicapa](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP).**

[Documentación de Pytorch](https://pytorch.org/docs/stable/index.html)

[Tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)

## Integrantes
- Mauricio Caggia
- Luciano Monforte
- Gustavo Venchiarutti
- Guillermo Robiglio

En esta segunda parte se cargan datos reducidos en la parte 1. Esto con el fin de optimizar memoria.

## Importaciones

In [1]:
import gzip
import bz2
import mlflow
import pandas as pd
import tempfile
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from gensim import corpora
from gensim.parsing import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm, trange

## Constantes

In [2]:
ARCHIVO_SET_DE_ENTRENAMIENTO = './data/training_set.csv'
ARCHIVO_SET_DE_ENTRENAMIENTO_REDUCIDO = './data/training_set_reduced.csv'
ARCHIVO_SET_DE_PRUEBA = './data/test_set.csv'
ARCHIVO_SET_DE_VALIDACION = './data/validation_set.csv'
ARCHIVO_DE_EMBEDDINGS = './data/SBW-vectors-300-min5.txt.bz2'
# ARCHIVO_DE_EMBEDDINGS = '../data/glove.6B.50d.txt.gz'
EPOCHS = 5

## Carga de datos

In [3]:
%%time
file_paths = [ARCHIVO_SET_DE_ENTRENAMIENTO,
              ARCHIVO_SET_DE_ENTRENAMIENTO_REDUCIDO,
              ARCHIVO_SET_DE_PRUEBA,
              ARCHIVO_SET_DE_VALIDACION]
i_train = 1
i_test = 4
df_train = pd.read_csv(file_paths[i_train])
df_test = pd.read_csv(file_paths[i_test])

CPU times: user 2.16 s, sys: 478 ms, total: 2.63 s
Wall time: 2.67 s


In [4]:
df_train = df_train.sample(20000)
df_train.head()

Unnamed: 0,title,category
357743,"Tijera Para El Cabello Filo Navaja 5,5 P.car...",HAIRDRESSING_SCISSORS
442017,Cooler Cooler Master Hyper Tx3 Evo 1151 1150 A...,DESKTOP_COMPUTER_COOLERS_AND_FANS
1097459,Lapicera Cactus X 50 U Merchandising Souvenirs...,PENS
547021,Mesa De Luz/cajonera Industrial - Hierro Y Madera,NIGHTSTANDS
681903,Seidio Superficie Caso Con El Metal Pata De Ca...,CELLPHONES


In [5]:
df_test = df_test.sample(20000)
df_test.head()

Unnamed: 0,title,category
911924,"Torno Lavore Mhas 200 De 1,5mts X 400 Mmg Maq...",LATHES
1037907,Estereo Boss 628ua,CAR_STEREOS
946243,Super Sale! Mimo&co. Mono Con Short Floreado. ...,JUMPSUITS_AND_OVERALLS
519679,Souvenir Regalo Reloj 20cm Con Foto,WALL_CLOCKS
618917,Paragolpes Tras Peugeot Partner 09-15 Concesio...,AUTOMOTIVE_FRONT_BUMPERS


In [6]:
print(df_train.shape)
print(df_test.shape)

(20000, 2)
(20000, 2)


## Construcción del Dataset

El dataset se construye a partir del dataframe de Pandas que tiene dos columnas:
- **title**
- **category**

In [7]:
class MeLiChallengeDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform
    
    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, item):
        item = {
            "data": self.df.iloc[item]["title"],
            "target": self.df.iloc[item]["category"]
        }

        if self.transform:
            item = self.transform(item)
        
        return item

## Preprocesamiento de los datos

El preprocesamiento de texto tiene dos propósitos:
- Tokenizar los títulos (datos) de modo que se quiten los signos de puntuación y palabras cortas como preposiciones y conjunciones (stopwords), todas las palabras queden en minúsculas, se separen en listas de palabras, etc.
- Transformar las categorías en etiquetas numéricas.

In [8]:
class RawDataProcessor:
    def __init__(self, dataset, ignore_header=True, vocab_size=50000):
        self.filters = [lambda s: s.lower(),
                        preprocessing.strip_tags,
                        preprocessing.strip_punctuation,
                        preprocessing.strip_multiple_whitespaces,
                        preprocessing.strip_numeric,
                        preprocessing.remove_stopwords,
                        preprocessing.strip_short]
        
        # Esta clase encapsula el mapeo entre las palabras normalizadas y sus correspondientes indices 
        # https://radimrehurek.com/gensim/corpora/dictionary.html
        self.dictionary = corpora.Dictionary(dataset["title"].map(self._preprocess_string).tolist())
        
        # Filter the dictionary with extremos words
        # https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html?highlight=filter_extrem
        self.dictionary.filter_extremes(no_below=2, no_above=1, keep_n=vocab_size)
        
        # Asigna nuevos índices a todas las palabras
        # https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.compactify.html
        # https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.compactify
        self.dictionary.compactify()
        
        # Se agregan tokens especiales
        self.dictionary.patch_with_special_tokens({"[PAD]": 0,
                                                   "[UNK]": 1})
        
        # Conversión de categorías a etiquetas
        self.idx_to_target = sorted(dataset["category"].unique())
        self.target_to_idx = {t: i for i, t in enumerate(self.idx_to_target)}


    def _preprocess_string(self, string):
        # Procesamiento de los títulos mediante la aplicación de una lista de filtros
        # Parámetro: str -> El título sin procesar
        # Salida: list -> Lista de strings
        # https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string
        return preprocessing.preprocess_string(string, filters=self.filters)

    def _sentence_to_indices(self, sentence):
        # Convierte una lista de palabras en una lista de índices
        # Parámetro: list -> Lista de palabras
        # Salida: list -> Lista de enteros (índices) en el mismo orden que las palabras
        # https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx
        return self.dictionary.doc2idx(sentence, unknown_word_index=1)
    
    def encode_data(self, data):
        # Convierte un string en una lista de índices
        return self._sentence_to_indices(self._preprocess_string(data))
    
    def encode_target(self, target):
        # Convierte las categorías a etiquetas
        return self.target_to_idx[target]
    
    def __call__(self, item):
        if isinstance(item["data"], str):
            data = self.encode_data(item["data"])
        else:
            data = [self.encode_data(d) for d in item["data"]]
        
        if isinstance(item["target"], str):
            target = self.encode_target(item["target"])
        else:
            target = [self.encode_target(t) for t in item["target"]]
        
        return {
            "data": data,
            "target": target
        }

In [9]:
class PadSequences:
    def __init__(self, pad_value=0, max_length=None, min_length=1):
        assert max_length is None or min_length <= max_length
        self.pad_value = pad_value
        self.max_length = max_length
        self.min_length = min_length

    def __call__(self, items):
        data, target = list(zip(*[(item["data"], item["target"]) for item in items]))
        seq_lengths = [len(d) for d in data]

        if self.max_length:
            max_length = self.max_length
            seq_lengths = [min(self.max_length, l) for l in seq_lengths]
        else:
            max_length = max(self.min_length, max(seq_lengths))

        data = [d[:l] + [self.pad_value] * (max_length - l)
                for d, l in zip(data, seq_lengths)]
            
        return {
            "data": torch.LongTensor(data),
            "target": torch.LongTensor(target)
        }

In [10]:
train_processor = RawDataProcessor(df_train, vocab_size=5000)
train_dataset = MeLiChallengeDataset(df_train, transform=train_processor)

In [11]:
test_processor = RawDataProcessor(df_test, vocab_size=5000)
test_dataset = MeLiChallengeDataset(df_test, transform=test_processor)

In [12]:
i = 100
print(f"El dataset de entrenamiento tiene {len(train_dataset)} elementos.")
print(f"Elemento #{i}:\n\tData: {train_dataset[i]['data']}\n\tTarget: {train_dataset[i]['target']}")

El dataset de entrenamiento tiene 20000 elementos.
Elemento #100:
	Data: [90, 287, 396, 395, 394]
	Target: 25


## Carga del Dataset

In [13]:
batch_size = 1000
pad_sequences = PadSequences()
train_loader = DataLoader(train_dataset,
                          batch_size=batch_size,
                          shuffle=True,
                          collate_fn=pad_sequences,
                          drop_last=False)
test_loader = DataLoader(test_dataset,
                         batch_size=batch_size,
                         shuffle=True,
                         collate_fn=pad_sequences,
                         drop_last=False)

In [14]:
i = 0
for data in tqdm(train_loader):
    i += 1
print(f'{i} iteraciones')

  0%|          | 0/20 [00:00<?, ?it/s]

20 iteraciones


## Construcción del Modelo

In [15]:
class MeLiChallengeClassifier(nn.Module):
    def __init__(self, 
                 pretrained_embeddings_path, 
                 dictionary,
                 vector_size,
                 freeze_embedings):
        super().__init__()
        embeddings_matrix = torch.randn(len(dictionary), vector_size)
        embeddings_matrix[0] = torch.zeros(vector_size)
        with gzip.open(pretrained_embeddings_path, encode='utf-8', "rt") as fh:
#       with bz2.open(pretrained_embeddings_path, "rt") as fh:
            for line in fh:
                word, vector = line.strip().split(None, 1)
                if word in dictionary.token2id:
                    embeddings_matrix[dictionary.token2id[word]] = torch.FloatTensor([float(n) for n in vector.split()])
        self.embeddings = nn.Embedding.from_pretrained(embeddings_matrix,
                                                       freeze=freeze_embedings,
                                                       padding_idx=0)
        self.hidden1 = nn.Linear(vector_size, 128)
        self.hidden2 = nn.Linear(128, 128)
        self.output = nn.Linear(128, 632)
        self.vector_size = vector_size
    
    def forward(self, x):
        x = self.embeddings(x)
        x = torch.mean(x, dim=1)
        x = F.relu(self.hidden1(x))
        x = F.relu(self.hidden2(x))
        x = torch.sigmoid(self.output(x))
        return x

## Algoritmo de Optimización

In [16]:
%%time
model = MeLiChallengeClassifier(ARCHIVO_DE_EMBEDDINGS, train_processor.dictionary, 50, True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

CPU times: user 1.69 s, sys: 53 ms, total: 1.74 s
Wall time: 1.76 s


## Entrenamiento del modelo

In [17]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Utilizando {device}')
model.to(device)

Utilizando cuda:0


MeLiChallengeClassifier(
  (embeddings): Embedding(5002, 50, padding_idx=0)
  (hidden1): Linear(in_features=50, out_features=128, bias=True)
  (hidden2): Linear(in_features=128, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=632, bias=True)
)

In [18]:
# %%time
# for epoch in range(EPOCHS):  # Recorre el dataset multiples veces
#     model.train()
#     running_loss = 0.0
#     for data in train_loader:
#         inputs = data['data'].to(device)
#         labels = data['target'].to(device)
#         optimizer.zero_grad()
#         outputs = model(inputs)
#         loss = loss_function(outputs, labels.squeeze().long())
#         loss.backward()
#         optimizer.step()

In [19]:
# model.train()
# for data in train_loader:
#     inputs = data['data'].to(device)
#     target = data['target'].to(device)
#     optimizer.zero_grad()
#     output = model(inputs)
#     loss = loss_function(output, target)
#     loss.backward()
#     optimizer.step()

In [20]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, data in enumerate(dataloader):
        X, y = data['data'].to(device), data['target'].to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

## Evaluación del Modelo

In [21]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for data in dataloader:
            X, y = data['data'].to(device), data['target'].to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [22]:
for t in range(EPOCHS):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_loader, model, loss_function, optimizer)
    test(test_loader, model, loss_function)
print("Done!")

Epoch 1
-------------------------------
loss: 6.448280  [    0/20000]
Test Error: 
 Accuracy: 0.1%, Avg loss: 6.448738 

Epoch 2
-------------------------------
loss: 6.448425  [    0/20000]
Test Error: 
 Accuracy: 0.1%, Avg loss: 6.448745 

Epoch 3
-------------------------------
loss: 6.448981  [    0/20000]
Test Error: 
 Accuracy: 0.1%, Avg loss: 6.448733 

Epoch 4
-------------------------------
loss: 6.448917  [    0/20000]
Test Error: 
 Accuracy: 0.1%, Avg loss: 6.448730 

Epoch 5
-------------------------------
loss: 6.448575  [    0/20000]
Test Error: 
 Accuracy: 0.1%, Avg loss: 6.448724 

Done!


In [23]:
# i = 0
# for batch, data in enumerate(train_loader):
#     i += 1
#     print (f'Lote {i}.')
#     if i%10 == 0:
#         print(type(data['data']))
#         print(type(data['target']))
# print(f'{i} iteraciones.')

## Guardado de los parámetros del modelo