<a href="https://colab.research.google.com/github/DiploDatos/AprendizajeProfundo/blob/master/4_mlflow_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Trabajo Práctico 2
## Implementación de una red neuronal recurrente (RNN) con una capa LSTM para considerar los efectos de largo rango en las secuencias aplicado a un clasificador multiclase.

### Módulo optativo de la diplomatura en ciencia de datos del FaMAF - Universidad Nacional de Córdoba. 2022

**Ingerantes del grupo:**
- Alonso, Guillermo 
- Pfluger, Santiago
- Perez, Lucas
- Serrantes, Sebastián

**Profesores:** 
- Johanna Frau 
- Mauricio Mazuecos

[Link repositorio github](https://github.com/guillealonso/DiplodatosDeepLearning)

### Librerías

In [1]:
import gzip
import mlflow
import pandas as pd
import tempfile
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from gensim import corpora
from gensim.parsing import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.metrics import balanced_accuracy_score
from tqdm.notebook import tqdm, trange

import json
import bz2

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


import csv
import functools
import tempfile


from gensim import corpora
from gensim.models import KeyedVectors
from sklearn import metrics
from torch.utils.data import Dataset, DataLoader, IterableDataset
from tqdm.notebook import tqdm, trange

tqdm.pandas()


## Dataset

Utilizamos el dataset del Meli Challenge 2019. El mismo contiene los datos de publicaciones de ventas en Mercado Libre.

En este caso tomamos los datos en español solamente quedando 6.119.100 registros en total y 632 categorías únicas.

Se descartaron las columnas ya trabajadas y se realizó el tokenizado preprocesado desde 0 utilizando solamente la columna `title` como `data` y `category` como `target`.


In [2]:
class MeliChallenge(Dataset):
    def __init__(self, dataset, transform=None):
        self.dataset = dataset
        self.transform = transform
    
    def __len__(self):
        return self.dataset.shape[0]

    def __getitem__(self, item):
        if torch.is_tensor(item):
            item = item.to_list()
        
        item = {
            "data": self.dataset.loc[item, "title"],
            "target": self.dataset.loc[item, "category"]
        }
        
        if self.transform:
            item = self.transform(item)
        
        return item

## Preprocesamiento

En este caso vamos a utilizar un sólo módulo para transformar los datos de IMDB. Este se encargará de preprocesar el texto (i.e. normalizarlo) y transformará las palabras en índices de un diccionario para luego poder pasar una secuencia de palabras para buscar en la matriz de embeddings y así permitir mayor manipulación de los embeddings (en lugar de utilizar embeddings fijos).

Vamos a estar trabajando con la librería [gensim](https://pypi.org/project/gensim/) previamente importada para el procesamiento del lenguaje natural (pueden ver su código open source en este [link](https://github.com/RaRe-Technologies/gensim) ).



In [3]:
class RawDataProcessor:
    def __init__(self, 
                 dataset, 
                 ignore_header=True, 
                 filters=None, 
                 vocab_size=50000):
        if filters:
            self.filters = filters
        else:
            self.filters = [
                lambda s: s.lower(),
                preprocessing.strip_tags,
                preprocessing.strip_punctuation,
                preprocessing.strip_multiple_whitespaces,
                preprocessing.strip_numeric,
                preprocessing.remove_stopwords,
                preprocessing.strip_short,
            ]
        
        # Create dictionary based on all the reviews (with corresponding preprocessing)
        # https://radimrehurek.com/gensim/corpora/dictionary.html
        self.dictionary = corpora.Dictionary(
            dataset["title"].map(self._preprocess_string).tolist()
        )
        # Filter the dictionary with extremos words
        # https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html?highlight=filter_extrem
        self.dictionary.filter_extremes(no_below=2, no_above=1, keep_n=vocab_size)
        
        # Make the indices continuous after some words have been removed
        # https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.compactify.html
        self.dictionary.compactify()
        
        # Add a couple of special tokens
        self.dictionary.patch_with_special_tokens({
            "[PAD]": 0,
            "[UNK]": 1
        })
        self.idx_to_target = sorted(dataset["category"].unique())
        self.target_to_idx = {t: i for i, t in enumerate(self.idx_to_target)}


    def _preprocess_string(self, string):
        # https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string:~:text=gensim.parsing.preprocessing.preprocess_string
        return preprocessing.preprocess_string(string, filters=self.filters)

    def _sentence_to_indices(self, sentence):
      # https://radimrehurek.com/gensim/corpora/dictionary.html#:~:text=doc2idx(document,via%20unknown_word_index.
        return self.dictionary.doc2idx(sentence, unknown_word_index=1)
    
    def encode_data(self, data):
        return self._sentence_to_indices(self._preprocess_string(data))
    
    def encode_target(self, target):
        return self.target_to_idx[target]
    
    def __call__(self, item):
        if isinstance(item["data"], str):
            data = self.encode_data(item["data"])
        else:
            data = [self.encode_data(d) for d in item["data"]]
        
        if isinstance(item["target"], str):
            target = self.encode_target(item["target"])
        else:
            target = [self.encode_target(t) for t in item["target"]]
        
        return {
            "data": data,
            "target": target
            #, "sentence": item["data"]
        }

## Lectura de datos

In [4]:
train_dataset = pd.concat([x for x in pd.read_json(f"/users/galonso/data/meli-challenge-2019/spanish.train.jsonl.gz", lines=True, chunksize=100000)], ignore_index=True)
train_dataset = train_dataset[['title', 'category']]
train_dataset.head()

Unnamed: 0,title,category
0,Casita Muñecas Barbies Pintadas,DOLLHOUSES
1,Neceser Cromado Holográfico,TOILETRY_BAGS
2,Funda Asiento A Medida D20 Chevrolet,CAR_SEAT_COVERS
3,Embrague Ford Focus One 1.8 8v Td (90cv) Desde...,AUTOMOTIVE_CLUTCH_KITS
4,Bateria Panasonic Dmwbcf10 Lumix Dmc-fx60n Dmc...,CAMERA_BATTERIES


In [6]:
cat

0                 DOLLHOUSES
1              TOILETRY_BAGS
2            CAR_SEAT_COVERS
3     AUTOMOTIVE_CLUTCH_KITS
4           CAMERA_BATTERIES
5               CLASSIC_CARS
6               AV_RECEIVERS
7             POWER_GRINDERS
8               KITCHEN_POTS
9               PC_KEYBOARDS
10                ACCORDIONS
11               3D_PRINTERS
12    MOBILE_DEVICE_CHARGERS
13              BABY_WALKERS
14               LIP_GLOSSES
15                   FABRICS
16          CLOTHING_PATCHES
17                   TABLETS
18          POOL_INFLATABLES
19                   BLOUSES
Name: category, dtype: object

In [9]:
train_dataset.reset_index(drop=True, inplace=True)
train_dataset.tail()

Unnamed: 0,title,category
4895275,Kit 2 Bieletas Delanteras Monroe Vw Fox 1.6 - ...,SWAY_BAR_LINKS
4895276,Organo Teclado Casio Ct-x5000 61 Teclas Profes...,MUSICAL_KEYBOARDS
4895277,Mochila Impermeable Belvento Fausto,BACKPACKS
4895278,Mochila San Lorenzo De Espalda 16p Sl001,BACKPACKS
4895279,1 Kit Imprimible X 8 Sets Primavera Flores P/ ...,PARTY_PRINTABLE_KITS


Como el dataset ya viene separado en **train - validation - test**, armo uno completo `fulldataset` para extraer el diccionario con todas las palabras.

In [10]:
fulldataset=train_dataset.copy()

In [11]:
len(train_dataset)

4895280

## Test

In [12]:
test_dataset = pd.read_json(f"/users/galonso/data/meli-challenge-2019/spanish.test.jsonl.gz", lines=True)

In [13]:
test_dataset = test_dataset[['title', 'category']]
test_dataset.head()

Unnamed: 0,title,category
0,Mochilas Maternales Bolsos Bebe Simil Cuero Ma...,DIAPER_BAGS
1,Bolso Maternal/bebe Incluye Cambiador + Correa...,DIAPER_BAGS
2,Mochila Maternal Land + Gancho Envio Gratis-cc,DIAPER_BAGS
3,Bolso Maternal Moderno Con Cambiador Y Correa ...,DIAPER_BAGS
4,Bolso Maternal Moderno Con Cambiador Y Correa ...,DIAPER_BAGS


In [14]:
len(test_dataset)

63680

In [15]:
#test_dataset=test_dataset[test_dataset['category'].isin(list(cat))]
test_dataset.sample()

Unnamed: 0,title,category
58311,Valvula A Solenoide Jefferson 1314ba08,IRRIGATION_VALVES


In [16]:
test_dataset.reset_index(drop=True, inplace=True)
test_dataset.tail()

Unnamed: 0,title,category
63675,Gimnasio Gym Manta Bebe Tiny Love Musica Luz M...,BABY_GYMS
63676,Gimnasio Manta Con Actividades Para Bebe 846 Ath,BABY_GYMS
63677,Gimnasio Bebe Manta Didactica Tiny Love Kick A...,BABY_GYMS
63678,Gimnasio Manta Alfombra Didactica Fitchbaby Ju...,BABY_GYMS
63679,Gimnasio P/ Bebé Alfombra Zoo Animales Didácti...,BABY_GYMS


In [17]:
fulldataset=fulldataset.append(test_dataset)#ignore_index = True)

  fulldataset=fulldataset.append(test_dataset)#ignore_index = True)


## Validation

In [18]:
validation_dataset = pd.read_json(f"/users/galonso/data/meli-challenge-2019/spanish.validation.jsonl.gz", lines=True)

In [19]:
validation_dataset=validation_dataset[['title', 'category']]
validation_dataset.head()

Unnamed: 0,title,category
0,Metal Biela Dw10 Hdi 2.0,ENGINE_BEARINGS
1,Repuestos Martillo Rotoprcutor Bosch Gshsce Po...,ELECTRIC_DEMOLITION_HAMMERS
2,Pesca Caña Pejerrey Colony Brava 3m Fibra De V...,FISHING_RODS
3,Porcelanato Abitare Be 20x120 Cm. Ceramica Por...,PORCELAIN_TILES
4,Reconstruction Semi Di Lino Alfaparf Shampoo 1...,HAIR_SHAMPOOS_AND_CONDITIONERS


In [20]:
validation_dataset.shape

(1223820, 2)

In [21]:
#validation_dataset=validation_dataset[validation_dataset['category'].isin(list(cat))]
validation_dataset.sample()

Unnamed: 0,title,category
80251,Motor Para Porton Levadizo Omega Motic Oferta!!!,GATE_MOTORS


In [22]:
validation_dataset.reset_index(drop=True, inplace=True)
validation_dataset.head(5)

Unnamed: 0,title,category
0,Metal Biela Dw10 Hdi 2.0,ENGINE_BEARINGS
1,Repuestos Martillo Rotoprcutor Bosch Gshsce Po...,ELECTRIC_DEMOLITION_HAMMERS
2,Pesca Caña Pejerrey Colony Brava 3m Fibra De V...,FISHING_RODS
3,Porcelanato Abitare Be 20x120 Cm. Ceramica Por...,PORCELAIN_TILES
4,Reconstruction Semi Di Lino Alfaparf Shampoo 1...,HAIR_SHAMPOOS_AND_CONDITIONERS


In [23]:
validation_dataset.shape

(1223820, 2)

In [24]:
fulldataset=fulldataset.append(validation_dataset)
fulldataset.shape

  fulldataset=fulldataset.append(validation_dataset)


(6182780, 2)

In [26]:
preprocess_full = RawDataProcessor(fulldataset)
test_dataset = MeliChallenge(test_dataset, transform=preprocess_full)
train_dataset = MeliChallenge(train_dataset, transform=preprocess_full)
validation_dataset = MeliChallenge(validation_dataset, transform=preprocess_full)

In [27]:
preprocess_full.idx_to_target[-10:]

['WHEELCHAIRS',
 'WHISKEYS',
 'WINDOWS',
 'WINDSHIELD_WIPERS',
 'WINDSHIELD_WIPER_MOTORS',
 'WINE_CELLARS',
 'WOMEN_SWIMWEAR',
 'WORKOUT_BENCHES',
 'WRENCHES',
 'WRISTWATCHES']

## Collation function

Como en este caso trabajamos con secuencias de palabras (representadas por sus índices en un vocabulario), cuando queremos buscar un *batch* de datos, el `DataLoader` de PyTorch espera que los datos del *batch* tengan la misma dimensión (para poder llevarlos todos a un tensor de dimensión fija). Esto lo podemos lograr mediante el parámetro de `collate_fn`. Se define un módulo `PadSequences` que toma un valor mínimo, opcionalmente un valor máximo y un valor de relleno (*pad*) y dada una lista de secuencias, devuelve un tensor con *padding* sobre dichas secuencias.

In [28]:
class PadSequences:
    def __init__(self, pad_value=0, max_length=None, min_length=1):
        assert max_length is None or min_length <= max_length
        self.pad_value = pad_value
        self.max_length = max_length
        self.min_length = min_length

    def __call__(self, items):
        data, target = list(zip(*[(item["data"], item["target"]) for item in items]))
        seq_lengths = [len(d) for d in data]

        if self.max_length:
            max_length = self.max_length
            seq_lengths = [min(self.max_length, l) for l in seq_lengths]
        else:
            max_length = max(self.min_length, max(seq_lengths))

        data = [d[:l] + [self.pad_value] * (max_length - l)
                for d, l in zip(data, seq_lengths)]
            
        return {
            "data": torch.LongTensor(data),
            "target": torch.FloatTensor(target)
        }

## DataLoaders

Ya habiendo definido nuestros conjuntos de datos y nuestra `collation_fn`, podemos definir nuestros `DataLoader`, uno para entrenamiento y otro para evaluación. Ver que la diferencia fundamental está en `shuffle`, no queremos mezclar los valores de evaluación cada vez que evaluamos porque al evaluar mediante *mini-batchs* nos puede generar inconsistencias.

In [29]:
BATCHSIZE= 64

In [30]:
pad_sequences = PadSequences()

train_loader = DataLoader(train_dataset, batch_size=BATCHSIZE, shuffle=True,
                          collate_fn=pad_sequences, drop_last=False)
test_loader = DataLoader(test_dataset, batch_size=BATCHSIZE, shuffle=False,
                         collate_fn=pad_sequences, drop_last=False)
validation_loader = DataLoader(validation_dataset, batch_size=BATCHSIZE, shuffle=False,
                         collate_fn=pad_sequences, drop_last=False)


## El modelo de clasificación

Para clasificación se utilizó una red neuronal recurrente con una capa LSTM (Long short-term memory) la cual provee una aproximación básica al modelado de dependencias de largo rango en las secuencias.

Inicialmente utilizamos una capa de `Embeddings` que es rellenada con los valores de embeddings preentrenados de palabras en español con dimensión 300.

In [32]:
#!curl -L https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2 -O

## Categoría de la red neuronal recurrente 

Para el caso propuesto la red debe tener una arquitectura many-to-one con entrada dada por la secuencia máxima de palabras en nuestro dataset y salida dada por el número de categorías que esperamos clasificar.

In [34]:
import torch
import torch.nn as nn

# Check if we have a GPU available
use_cuda = torch.cuda.is_available()
device = torch.device('cuda') if use_cuda else torch.device('cpu')

In [35]:
torch.cuda.is_available()

True

In [36]:
class MeliLSTM(nn.Module):
    def __init__(self,
                 pretrained_embeddings_path, dictionary, embedding_size,
                 hidden_layer=10,
                 num_layers=2, dropout=0.2, bias=True,
                 bidirectional=False,
                 freeze_embedings=True):
        
        super(MeliLSTM, self).__init__()
        output_size = 632
        # Create the Embeddings layer and add pre-trained weights
        embeddings_matrix = torch.randn(len(dictionary), embedding_size)
        embeddings_matrix[0] = torch.zeros(embedding_size)
        with bz2.open(pretrained_embeddings_path, "rt") as fh:
            for line in fh:
                word, vector = line.strip().split(None, 1)
                if word in dictionary.token2id:
                    embeddings_matrix[dictionary.token2id[word]] =\
                        torch.FloatTensor([float(n) for n in vector.split()])
        self.embedding_config = {'freeze': freeze_embedings,
                                  'padding_idx': 0}
        self.embeddings = nn.Embedding.from_pretrained(
            embeddings_matrix, **self.embedding_config)
        
        # Set our LSTM parameters
        self.lstm_config = {'input_size': embedding_size,
                            'hidden_size': hidden_layer,
                            'num_layers': num_layers,
                            'bias': bias,
                            'batch_first': True,
                            'dropout': dropout,
                            'bidirectional': bidirectional
                           }
        
        # Set our fully connected layer parameters
        self.linear_config = {'in_features': hidden_layer,
                              'out_features': output_size,
                              'bias': bias
                             }
        
        # Instanciate the layers
        self.lstm = nn.LSTM(**self.lstm_config)
        self.classification_layer = nn.Linear(**self.linear_config)
        self.activation = nn.Softmax(dim=1)
        

    def forward(self, inputs):
        emb = self.embeddings(inputs)
        lstm_out, _ = self.lstm(emb)
        lstm_out = lstm_out[:,-1,:].squeeze()
        predictions = self.classification_layer(lstm_out)
        return predictions

In [37]:
model = MeliLSTM(
    "./SBW-vectors-300-min5.txt.bz2", preprocess_full.dictionary,
    embedding_size=300,
    hidden_layer=256,
    num_layers=2,
    dropout=0.2
)
print(model)

MeliLSTM(
  (embeddings): Embedding(50002, 300, padding_idx=0)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True, dropout=0.2)
  (classification_layer): Linear(in_features=256, out_features=632, bias=True)
  (activation): Softmax(dim=1)
)


## Entrenamiento de la red
En esta sección entrenaremos nuestra red. Primero configuramos los hiperparámetros de la red. En este momento determinamos lo siguiente:

- learning_rate
- epochs
- función de pérdida
- optimizador

También definimos los parámetros para torch.DataLoader, clase que implementa un manejador del dataset que nos dividirá los datos en batches (y los distribuirá entre distintos nodos de cómputo, en caso de contar con multi GPU).

Por último, ya tenemos todos los bloques para construir nuestro experimento de MLflow. Anotamos los parámetros del modelo y lanzamos a correr nuestro experimento. Cada vez que finaliza un epoch guardamos algunas métricas. Al finalizar todos los epochs corremos algunas métricas extras de evaluación.

In [44]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device) # mando el model a GPU
import torch.optim as optim

learning_rate = 0.005
EPOCHS = 10
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9) #optim.Adam(model.parameters(), learning_rate)

mlflow.set_experiment("a_MeliChallenge_experiment_2")

with mlflow.start_run():
    mlflow.log_param("model_name", "LSTM_20Cat")
    mlflow.log_param("freeze_embedding", True)
    mlflow.log_params({
        "embedding_size": 300,
        "model": model,
        "Epochs": EPOCHS,
        "optimizer": "optim.SGD",
    })
    for epoch in trange(EPOCHS):
        model.train()
        running_loss = []
        targets = []
        predictions = []
        
        #Train
        print("Epoch", epoch)
        for idx, batch in enumerate(tqdm(train_loader)):
            optimizer.zero_grad()
            output = model(batch["data"].to(device))
            loss_value = loss_function(output, batch["target"].to(device).squeeze().long())
            loss_value.backward()
            optimizer.step()
            running_loss.append(loss_value.item())
            train_loss = sum(running_loss) / len(running_loss)
            #print("\t Final train_loss", train_loss)
            #history['train_loss'].append(train_loss)
            targets.extend(batch["target"].numpy())
            predictions.extend(torch.argmax(output.cpu(), 1).numpy())
        #metricas de la epoch
        print("\t Final train_loss", sum(running_loss) / len(running_loss))
        print("\t Final train_bas", balanced_accuracy_score(targets, predictions))
        mlflow.log_metric("train_loss", sum(running_loss) / len(running_loss), epoch)
        mlflow.log_metric("train_bas", balanced_accuracy_score(targets, predictions), epoch)
        
        model.eval()
        running_loss = []
        targets = []
        predictions = []
        
        # Validation
        '''for batch in tqdm(validation_loader):
            output = model(batch["data"].to(device))
            running_loss.append(loss_function(output, batch["target"].to(device).squeeze().long()).item()) #.cuda()
            targets.extend(batch["target"].numpy())
            # Round up model output to get the predictions.
            # What would happen if you change the activation to tanh?
            predictions.extend(torch.argmax(output.cpu(), 1).numpy())
            #predictions.extend(output.squeeze().round().detach().numpy())
            validation_loss = sum(running_loss) / len(running_loss)
            bas = metrics.balanced_accuracy_score(targets, predictions) #balanced_accuracy_score(targets, predictions)
        print("\t Final validation_loss", validation_loss)
        print("\t Final validation_bas", bas)
        mlflow.log_metric("validation_loss", validation_loss, epoch)
        mlflow.log_metric("validation_bas", bas, epoch)
        '''        
    y_true = []
    y_pred = []
    with torch.no_grad():
        for data in tqdm(test_loader):
            inputs, labels = data
            outputs = model(data['data'].to(device))
            _, predicted = torch.max(outputs.data, 1)
            y_true.extend(data['target'].cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())
        print("----test_bas: ", balanced_accuracy_score(y_true, y_pred))
        mlflow.log_metric("test_bas", balanced_accuracy_score(y_true, y_pred), epoch)
    mlflow.end_run()    


  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 0


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 4.097304612358016
	 Final train_bas 0.23694918350894484
Epoch 1


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 1.1494053450676376
	 Final train_bas 0.7182445088042041
Epoch 2


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.8351444907294644
	 Final train_bas 0.7948419776145902
Epoch 3


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.7261965914388524
	 Final train_bas 0.8217213660252353
Epoch 4


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.6671555173749055
	 Final train_bas 0.8362301906281533
Epoch 5


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.6283178726484003
	 Final train_bas 0.8456784982868917
Epoch 6


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.5998360307759568
	 Final train_bas 0.8524624319997328
Epoch 7


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.5776936967013965
	 Final train_bas 0.8577491766180039
Epoch 8


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.5599880045764876
	 Final train_bas 0.861643444295537
Epoch 9


  0%|          | 0/76489 [00:00<?, ?it/s]

	 Final train_loss 0.5452281882078518
	 Final train_bas 0.8651602066260116


  0%|          | 0/995 [00:00<?, ?it/s]

----test_bas:  0.9263175089465766


In [None]:
torch.save(model.state_dict(), 'model2_weights.pth')
torch.save(model, 'model2.pth')