# Práctico 1

En este práctico trabajaremos con el conjuto de datos de petfinder que utilizaron en la materia *Aprendizaje Supervisado*. La tarea es predecir la velocidad de adopción de un conjunto de mascotas. Para ello, también utilizaremos [esta competencia de Kaggle](https://www.kaggle.com/t/8842af91604944a9974bd6d5a3e097c5).

Durante esta etapa implementaremos modelos MLP básicos y no tan básicos, y veremos los diferentes hiperparámetros y arquitecturas que podemos elegir. Compararemos además dos tipos de representaciones comunes para datos categóricos: *one-hot-encodings* y *embeddings*. El primer ejercicio consiste en implementar y entrenar un modelo básico, y el segundo consiste en explorar las distintas combinaciones de características e hiperparámetros.

Para resolver los ejercicios, les proveemos un esqueleto que pueden completar en el archivo `practico_1_train_petfinder.py`. Este esqueleto ya contiene muchas de las funciones para combinar las representaciones de las distintas columnas que vimos en la notebook 2, aunque pueden agregar más columnas y las columnas con valores numéricos.

## Ejercicio 1

1. Construir un pipeline de clasificación con un modelo Keras MLP. Pueden comenzar con una versión simplicada que sólo tenga una capa de Input donde pasen los valores de las columnas de *one-hot-encodings*.

2. Entrenar uno o varios modelos (con dos o tres es suficiente, veremos más de esto en el práctico 2). Evaluar los modelos en el conjunto de dev y test.

## Ejercicio 2

1. Utilizar el mismo modelo anterior y explorar cómo cambian los resultados a medida que agregamos o quitamos columnas.

2. Volver a ejecutar una exploración de hyperparámetros teniendo en cuenta la información que agregan las nuevas columnas.

4. Subir los resultados a la competencia de Kaggle.


Finalmente, tienen que reportar los hyperparámetros y resultados de todos los modelos entrenados. Para esto, pueden utilizar los resultados que recolectan con *mlflow* y procesarlos con una notebook. Tiene que presentar esa notebook o un archivo (pdf|md) con las conclusiones que puedan sacar. Dentro de este reporte tiene que describir:
  * Hyperparámetros con los que procesaron cada columna del dataset. ¿Cuáles son las columnas que más afectan al desempeño? ¿Por qué?
  * Las decisiones tomadas al construir cada modelo: regularización, batch normalization, dropout, número y tamaño de las capas, optimizador.
  * Proceso de entrenamiento: división del train/dev, tamaño del batch, número de épocas, métricas de evaluación. Seleccione los mejores hiperparámetros en función de su rendimiento. El proceso de entrenamiento debería ser el mismo para todos los modelos.
  * Analizar si el clasificador está haciendo overfitting. Esto se puede determinar a partir del resultado del método fit.




In [1]:

import argparse

import os
import mlflow
import numpy
import pandas
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models

  class HeadersDict(collections.MutableMapping):


In [2]:
TARGET_COL = 'AdoptionSpeed'

In [3]:
def read_args():
    parser = argparse.ArgumentParser(
        description='Training a MLP on the petfinder dataset')
    # Here you have some examples of classifier parameters. You can add
    # more arguments or change these if you need to.
    parser.add_argument('--dataset_dir', default='../petfinder_dataset', type=str,
                        help='Directory with the training and test files.')
    parser.add_argument('--hidden_layer_sizes', nargs='+', default=[100], type=int,
                        help='Number of hidden units of each hidden layer.')
    parser.add_argument('--epochs', default=10, type=int,
                        help='Number of epochs to train.')
    parser.add_argument('--dropout', nargs='+', default=[0.5], type=float,
                        help='Dropout ratio for every layer.')
    parser.add_argument('--batch_size', type=int, default=32,
                        help='Number of instances in each batch.')
    parser.add_argument('--experiment_name', type=str, default='Base model',
                        help='Name of the experiment, used in mlflow.')
    args = parser.parse_args()

    assert len(args.hidden_layer_sizes) == len(args.dropout)
    return args


In [20]:
tf.keras.utils.to_categorical(dataset['Color1'] - 1, 100)

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [31]:
tf.keras.utils.normalize(dataset['Age'].values)[0]

array([0.02695538, 0.00305155, 0.00101718, ..., 0.00152578, 0.00712029,
       0.00101718])

In [71]:

def process_features(df, one_hot_columns, numeric_columns, embedded_columns, test=False):
    direct_features = []

    # Create one hot encodings
    for one_hot_col, max_value in one_hot_columns.items():
        direct_features.append(tf.keras.utils.to_categorical(df[one_hot_col] - 1, max_value))

    for col_name in numeric_columns:
        direct_features.append(tf.keras.utils.normalize(df[col_name].values).reshape(-1,1))

    # Concatenate all features that don't need further embedding into a single matrix.
    features = {'direct_features': numpy.hstack(direct_features)}


    
    # Create embedding columns - nothing to do here. We will use the zero embedding for OOV
    for embedded_col in embedded_columns.keys():
        features[embedded_col] = df[embedded_col].values

    if not test:
        nlabels = df[TARGET_COL].unique().shape[0]
        # Convert labels to one-hot encodings
        targets = tf.keras.utils.to_categorical(df[TARGET_COL], nlabels)
    else:
        targets = None
    
    return features, targets

In [61]:
def load_dataset(dataset_dir, batch_size):

    # Read train dataset (and maybe dev, if you need to...)
    dataset, dev_dataset = train_test_split(
        pandas.read_csv(os.path.join(dataset_dir, 'train.csv')), test_size=0.2)
    
    test_dataset = pandas.read_csv(os.path.join(dataset_dir, 'test.csv'))
    
    print('Training samples {}, test_samples {}'.format(
        dataset.shape[0], test_dataset.shape[0]))
    
    return dataset, dev_dataset, test_dataset

In [62]:
dataset_dir = "./"
batch_size = 32

dataset, dev_dataset, test_dataset = load_dataset(dataset_dir, batch_size)
nlabels = dataset[TARGET_COL].unique().shape[0]

Training samples 8465, test_samples 4411


In [63]:
# It's important to always use the same one-hot length
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1']
}
embedded_columns = {
    embedded_col: dataset[embedded_col].max() + 1
    for embedded_col in ['Breed1']
}
numeric_columns = ['Age', 'Fee']

In [64]:
# TODO (optional) put these three types of columns in the same dictionary with "column types"
X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)


In [65]:
X_train

{'direct_features': array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         1.00000000e+00, 1.03113964e-03, 1.28849850e-03],
        [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
         0.00000000e+00, 5.15569820e-04, 0.00000000e+00],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 1.03113964e-03, 0.00000000e+00],
        ...,
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 5.15569820e-04, 0.00000000e+00],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 3.71210271e-02, 2.57699700e-02],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         1.00000000e+00, 4.64012838e-03, 0.00000000e+00]]),
 'Breed1': array([0.01227415, 0.01155449, 0.01063493, ..., 0.01223417, 0.01227415,
        0.01227415])}

In [66]:
# Create the tensorflow Dataset
batch_size = 32
# TODO shuffle the train dataset!
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
    test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

In [67]:
X_train

{'direct_features': array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         1.00000000e+00, 1.03113964e-03, 1.28849850e-03],
        [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
         0.00000000e+00, 5.15569820e-04, 0.00000000e+00],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 1.03113964e-03, 0.00000000e+00],
        ...,
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 5.15569820e-04, 0.00000000e+00],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         0.00000000e+00, 3.71210271e-02, 2.57699700e-02],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, ...,
         1.00000000e+00, 4.64012838e-03, 0.00000000e+00]]),
 'Breed1': array([0.01227415, 0.01155449, 0.01063493, ..., 0.01223417, 0.01227415,
        0.01227415])}

### Utilizando unicamente los features de one_hot_encoding 

In [75]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Dropout, concatenate
from tensorflow.keras.models import Model

tf.keras.backend.clear_session()
inputs = []

direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)


output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

model = models.Model(inputs=inputs, outputs=output_layer)

In [76]:
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 12)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 65        
Total params: 65
Trainable params: 65
Non-trainable params: 0
_________________________________________________________________


In [77]:
import mflow
mlflow.set_experiment('very_base_approach')

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('embedded_columns', embedded_columns)
    mlflow.log_param('one_hot_columns', one_hot_columns)
    # mlflow.log_param('numerical_columns', numerical_columns)  # Not using these yet
    
    # Train
    epochs = 10
    history = model.fit(train_ds, epochs=epochs)
    
    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4491049939301084 - accuracy: 0.3056211471557617


### Usando todos los features

In [78]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Dropout, concatenate
from tensorflow.keras.models import Model

tf.keras.backend.clear_session()

embedding_layers = []
inputs = []
for embedded_col, max_value in embedded_columns.items():
    input_layer = layers.Input(shape=(1,), name=embedded_col)
    inputs.append(input_layer)
    # Define the embedding layer
    embedding_size = int(max_value / 4)
    embedding_layers.append(
        tf.squeeze(layers.Embedding(input_dim=max_value, output_dim=embedding_size)(input_layer), axis=-2))
    print('Adding embedding of size {} for layer {}'.format(embedding_size, embedded_col))
    
    
direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)
features = layers.concatenate(embedding_layers + [direct_features_input])
output_layer = layers.Dense(5, activation='softmax')(features)

model = models.Model(inputs=inputs, outputs=output_layer)

Adding embedding of size 77 for layer Breed1


In [79]:
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Breed1 (InputLayer)             [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 77)        23716       Breed1[0][0]                     
__________________________________________________________________________________________________
tf_op_layer_Squeeze (TensorFlow [(None, 77)]         0           embedding[0][0]                  
__________________________________________________________________________________________________
direct_features (InputLayer)    [(None, 12)]         0                                            
______________________________________________________________________________________________

In [80]:
import mflow
mlflow.set_experiment('very_base_approach')

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('embedded_columns', embedded_columns)
    mlflow.log_param('one_hot_columns', one_hot_columns)
    # mlflow.log_param('numerical_columns', numerical_columns)  # Not using these yet
    
    # Train
    epochs = 10
    history = model.fit(train_ds, epochs=epochs)
    
    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4485660191690466 - accuracy: 0.3136513829231262


### Feature selection

In [81]:
dataset.columns

Index(['Type', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'Description',
       'AdoptionSpeed', 'PID'],
      dtype='object')

In [104]:
# It's important to always use the same one-hot length
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1', 'Color2', 'Color3','Vaccinated','Dewormed','Sterilized','FurLength']
}
embedded_columns = {
    embedded_col: dataset[embedded_col].max() + 1
    for embedded_col in ['Breed1']
}
numeric_columns = ['Age', 'Fee', 'MaturitySize']

In [105]:
# TODO (optional) put these three types of columns in the same dictionary with "column types"
X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)


In [106]:
# Create the tensorflow Dataset
batch_size = 32
# TODO shuffle the train dataset!
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
    test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

In [107]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Dropout, concatenate
from tensorflow.keras.models import Model

tf.keras.backend.clear_session()

embedding_layers = []
inputs = []
for embedded_col, max_value in embedded_columns.items():
    input_layer = layers.Input(shape=(1,), name=embedded_col)
    inputs.append(input_layer)
    # Define the embedding layer
    embedding_size = int(max_value / 4)
    embedding_layers.append(
        tf.squeeze(layers.Embedding(input_dim=max_value, output_dim=embedding_size)(input_layer), axis=-2))
    print('Adding embedding of size {} for layer {}'.format(embedding_size, embedded_col))
    
    
direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)
features = layers.concatenate(embedding_layers + [direct_features_input])
output_layer = layers.Dense(5, activation='softmax')(features)

model = models.Model(inputs=inputs, outputs=output_layer)

Adding embedding of size 77 for layer Breed1


In [108]:
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Breed1 (InputLayer)             [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 77)        23716       Breed1[0][0]                     
__________________________________________________________________________________________________
tf_op_layer_Squeeze (TensorFlow [(None, 77)]         0           embedding[0][0]                  
__________________________________________________________________________________________________
direct_features (InputLayer)    [(None, 39)]         0                                            
______________________________________________________________________________________________

In [109]:
import mflow
mlflow.set_experiment('very_base_approach')

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('embedded_columns', embedded_columns)
    mlflow.log_param('one_hot_columns', one_hot_columns)
    # mlflow.log_param('numerical_columns', numerical_columns)  # Not using these yet
    
    # Train
    epochs = 30
    history = model.fit(train_ds, epochs=epochs)
    
    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


*** Test loss: 1.4230773237779786 - accuracy: 0.3462446928024292
