# Enunciado de Práctico 1

En este práctico trabajaremos con el conjuto de datos de petfinder que utilizaron en la materia *Aprendizaje Supervisado*. La tarea es predecir la velocidad de adopción de un conjunto de mascotas. Para ello, también utilizaremos [esta competencia de Kaggle](https://www.kaggle.com/t/8842af91604944a9974bd6d5a3e097c5).

Durante esta etapa implementaremos modelos MLP básicos y no tan básicos, y veremos los diferentes hiperparámetros y arquitecturas que podemos elegir. Compararemos además dos tipos de representaciones comunes para datos categóricos: *one-hot-encodings* y *embeddings*. El primer ejercicio consiste en implementar y entrenar un modelo básico, y el segundo consiste en explorar las distintas combinaciones de características e hiperparámetros.

Para resolver los ejercicios, les proveemos un esqueleto que pueden completar en el archivo `practico_1_train_petfinder.py`. Este esqueleto ya contiene muchas de las funciones para combinar las representaciones de las distintas columnas que vimos en la notebook 2, aunque pueden agregar más columnas y las columnas con valores numéricos.

## Ejercicio 1

1. Construir un pipeline de clasificación con un modelo Keras MLP. Pueden comenzar con una versión simplicada que sólo tenga una capa de Input donde pasen los valores de las columnas de *one-hot-encodings*.

2. Entrenar uno o varios modelos (con dos o tres es suficiente, veremos más de esto en el práctico 2). Evaluar los modelos en el conjunto de dev y test.

## Ejercicio 2

1. Utilizar el mismo modelo anterior y explorar cómo cambian los resultados a medida que agregamos o quitamos columnas.

2. Volver a ejecutar una exploración de hyperparámetros teniendo en cuenta la información que agregan las nuevas columnas.

4. Subir los resultados a la competencia de Kaggle.


Finalmente, tienen que reportar los hyperparámetros y resultados de todos los modelos entrenados. Para esto, pueden utilizar los resultados que recolectan con *mlflow* y procesarlos con una notebook. Tiene que presentar esa notebook o un archivo (pdf|md) con las conclusiones que puedan sacar. Dentro de este reporte tiene que describir:
  * Hyperparámetros con los que procesaron cada columna del dataset. ¿Cuáles son las columnas que más afectan al desempeño? ¿Por qué?
  * Las decisiones tomadas al construir cada modelo: regularización, batch normalization, dropout, número y tamaño de las capas, optimizador.
  * Proceso de entrenamiento: división del train/dev, tamaño del batch, número de épocas, métricas de evaluación. Seleccione los mejores hiperparámetros en función de su rendimiento. El proceso de entrenamiento debería ser el mismo para todos los modelos.
  * Analizar si el clasificador está haciendo overfitting. Esto se puede determinar a partir del resultado del método fit.




# Resolución

## Ejercicio 1

### Comenzamos con un modelo básico que utiliza únicamente los features 'Gender' y 'Color1' aplicando one_hot_encoding

In [4]:
import os
import mlflow
import numpy
import pandas
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models
import time

In [5]:
TARGET_COL = 'AdoptionSpeed'

In [6]:
def process_features(df, one_hot_columns, numeric_columns, embedded_columns, test=False):
    direct_features = []

    # Create one hot encodings
    for one_hot_col, max_value in one_hot_columns.items():
        direct_features.append(tf.keras.utils.to_categorical(df[one_hot_col] - 1, max_value))

    for col_name in numeric_columns:
        direct_features.append(tf.keras.utils.normalize(df[col_name].values).reshape(-1,1))

    # Concatenate all features that don't need further embedding into a single matrix.
    features = {'direct_features': numpy.hstack(direct_features)}

    # Create embedding columns - nothing to do here. We will use the zero embedding for OOV
    for embedded_col in embedded_columns.keys():
        features[embedded_col] = df[embedded_col].values

    if not test:
        nlabels = df[TARGET_COL].unique().shape[0]
        # Convert labels to one-hot encodings
        targets = tf.keras.utils.to_categorical(df[TARGET_COL], nlabels)
    else:
        targets = None
    
    return features, targets

In [7]:
def load_dataset(dataset_dir, batch_size):

    # Read train dataset (and maybe dev, if you need to...)
    dataset, dev_dataset = train_test_split(
        pandas.read_csv(os.path.join(dataset_dir, 'train.csv')), test_size=0.2)
    
    test_dataset = pandas.read_csv(os.path.join(dataset_dir, 'test.csv'))
    
    print('Training samples {}, test_samples {}'.format(
        dataset.shape[0], test_dataset.shape[0]))
    
    return dataset, dev_dataset, test_dataset

In [8]:
dataset_dir = "./"
batch_size = 32
dataset, dev_dataset, test_dataset = load_dataset(dataset_dir, batch_size)
nlabels = dataset[TARGET_COL].unique().shape[0]

Training samples 8465, test_samples 4411


In [9]:
dataset

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,Description,AdoptionSpeed,PID
5838,2,2,266,0,2,2,4,0,2,1,2,2,2,1,1,0,41326,Tabby Urgent for adoption foc 😺long tail 😺soft...,2,8225
4786,1,18,307,307,2,1,5,0,2,2,2,2,1,1,2,0,41326,These two lovely doggies urgently need a home ...,4,6782
691,1,2,307,0,1,2,0,0,2,1,1,1,2,1,1,90,41326,"Mixed Breed Male, 2 months Old Cute & Loving A...",4,973
695,2,4,264,0,2,1,2,0,2,3,1,1,2,1,1,0,41336,Yucca unik bagi saya because she is double fac...,0,977
4608,1,36,205,0,2,2,5,0,1,2,1,1,1,1,1,0,41326,Sahsa was a negleted doggie...she is verry lov...,1,6534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8181,1,29,179,0,1,2,5,0,1,1,1,1,1,1,1,150,41327,"My poodle name joci,is a very active and prote...",4,11558
7302,2,2,266,0,1,2,6,0,2,1,2,1,2,1,1,0,41326,He's a handsome little kitten. Already he unde...,2,10267
8236,2,3,265,0,2,1,7,0,1,2,2,2,2,1,4,0,41326,"I have 6 cats in my house However, due to my f...",1,11639
4117,1,3,307,0,2,1,2,0,2,1,2,2,2,1,1,0,41401,Gentle 2.5 month old female puppy looking for ...,3,5828


In [10]:
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1']
}
embedded_columns = {}
numeric_columns = []

In [11]:
X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)

In [16]:
# Create the tensorflow Dataset
# Shuffling is performed on fit stage
batch_size = 32
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
    test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

In [17]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Dropout, concatenate
from tensorflow.keras.models import Model

tf.keras.backend.clear_session()
inputs = []

direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)

output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

model = models.Model(inputs=inputs, outputs=output_layer)

In [18]:
model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 10)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 55        
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________


In [19]:
import mflow
mlflow.set_experiment('exp1')

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('one_hot_columns', one_hot_columns)
    
    # Train
    epochs = 10
    history = model.fit(train_ds, epochs=epochs, shuffle=True)
    
    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)
    
    predictions = model.predict(test_ds)
    labels = numpy.argmax(predictions, axis=-1)
    timestr = time.strftime("%Y%m%d-%H%M%S")
    submission = pandas.DataFrame(list(zip(test_dataset["PID"], labels)), columns=["PID", "AdoptionSpeed"])
    filename = "./submissions/submission_" + timestr + ".csv"
    submission.to_csv(filename, header=True, index=False)
    mlflow.log_param('filename', filename)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4639827919051396 - accuracy: 0.2990080416202545


### Con 10 epochs, obtenemos un accuracy cercano al 30% con una loss cercana al 1.4 %

# Ejercicio 2

### Tomando como base los features 'Gender' y 'Color1', queremos ver que feature, dentro de los restantes, aporta el mayor incremento en el accuracy

In [31]:
columns = list(dataset.columns.values)
columns.remove('Gender')
columns.remove('Color1')
columns.remove('Description')
columns.remove('AdoptionSpeed')
columns.remove('PID')

for feature in columns:
    print(feature)

Type
Age
Breed1
Breed2
Color2
Color3
MaturitySize
FurLength
Vaccinated
Dewormed
Sterilized
Health
Quantity
Fee
State


In [32]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Dropout, concatenate
from tensorflow.keras.models import Model

case_dict = {}

def concatenate_list_str(lst):
    result= ''
    for element in lst:
        result += "_" + str(element)
    return result

for feature in columns:
    
    actual_columns = ['Gender', 'Color1']
    actual_columns.append(feature)

    one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in actual_columns
    }
    
    X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
    direct_features_input_shape = (X_train['direct_features'].shape[1],)
    X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)
    batch_size = 32
    train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
    dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
    test_ds = tf.data.Dataset.from_tensor_slices(process_features(
    test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)
    
    tf.keras.backend.clear_session()
    inputs = []

    direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
    inputs.append(direct_features_input)

    output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

    model = models.Model(inputs=inputs, outputs=output_layer)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam',
              metrics=['accuracy'])
    model.summary()
    
    import mflow
    cstr = concatenate_list_str(actual_columns)
    mlflow.set_experiment(cstr)

    with mlflow.start_run(nested=True):
        # Log model hiperparameters first
        mlflow.log_param('one_hot_columns', one_hot_columns)

        # Train
        epochs = 10
        history = model.fit(train_ds, epochs=epochs, shuffle=True)

        # Evaluate
        loss, accuracy = model.evaluate(X_dev, y_dev)
        print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
        mlflow.log_metric('epochs', epochs)
        mlflow.log_metric('loss', loss)
        mlflow.log_metric('accuracy', accuracy)
        
        case_dict[cstr] = [accuracy, loss]

        predictions = model.predict(test_ds)
        labels = numpy.argmax(predictions, axis=-1)
        timestr = time.strftime("%Y%m%d-%H%M%S")
        submission = pandas.DataFrame(list(zip(test_dataset["PID"], labels)), columns=["PID", "AdoptionSpeed"])
        filename = "./submissions/submission_" + timestr + ".csv"
        submission.to_csv(filename, header=True, index=False)
        mlflow.log_param('filename', filename)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 12)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 65        
Total params: 65
Trainable params: 65
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.457984046886482 - accuracy: 0.29522910714149475
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 265)]             0         
_________________________________________________________________
dense (Dense)                (None, 5)                 1330      
Total params: 1,330
Trainable params: 1,330
Non-trainable params: 0
_________________________________________________________________
INFO: '_Gender_Color1_Age' does not exist. Creating a new experiment
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4233182040260939 - accuracy: 0.35663676261901855
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 317)]             0         
_________________________________________________________________
dense (Dense)                (None, 5)                 1590      
Total params: 1,590
Trainable params: 1,590
Non-trainable params: 0
_________________________________________________________________
INFO: '_Gender_Color1_Breed1' does not exist. Creating a new experiment
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4446851042683078 - accuracy: 0.3103448152542114
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 317)]             0         
_________________________________________________________________
dense (Dense)                (None, 5)                 1590      
Total params: 1,590
Trainable params: 1,590
Non-trainable params: 0
_________________________________________________________________
INFO: '_Gender_Color1_Breed2' does not exist. Creating a new experiment
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4600213317866588 - accuracy: 0.29381200671195984
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 17)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 90        
Total params: 90
Trainable params: 90
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4645676705203194 - accuracy: 0.30467644333839417
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 17)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 90        
Total params: 90
Trainable params: 90
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4647070548418393 - accuracy: 0.2942843735218048
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 14)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 75        
Total params: 75
Trainable params: 75
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4608871632721494 - accuracy: 0.29853567481040955
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 13)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 70        
Total params: 70
Trainable params: 70
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.463119297473567 - accuracy: 0.30514881014823914
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 13)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 70        
Total params: 70
Trainable params: 70
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4526280096582944 - accuracy: 0.31506848335266113
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 13)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 70        
Total params: 70
Trainable params: 70
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4587576623001244 - accuracy: 0.3008975088596344
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 13)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 70        
Total params: 70
Trainable params: 70
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4422319247354356 - accuracy: 0.3249881863594055
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 13)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 70        
Total params: 70
Trainable params: 70
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4635240321571557 - accuracy: 0.31128954887390137
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 30)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 155       
Total params: 155
Trainable params: 155
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4606375433557373 - accuracy: 0.2975909411907196
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 3010)]            0         
_________________________________________________________________
dense (Dense)                (None, 5)                 15055     
Total params: 15,055
Trainable params: 15,055
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.4622830280899495 - accuracy: 0.30325934290885925
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 41425)]           0         
_________________________________________________________________
dense (Dense)                (None, 5)                 207130    
Total params: 207,130
Trainable params: 207,130
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


*** Test loss: 1.455547559863844 - accuracy: 0.30184224247932434


In [33]:
print('\tcase \t\t\t\t accuracy \t\t\t\t loss\n')
sorted_case_dict = sorted(case_dict.items(), key=lambda kv: kv[1], reverse=True)
for v in sorted_case_dict:
    print(v[0], "\t\t", v[1][0], "\t\t", v[1][1])

	case 				 accuracy 				 loss

_Gender_Color1_Age 		 0.35663676 		 1.4233182040260939
_Gender_Color1_Sterilized 		 0.3249882 		 1.4422319247354356
_Gender_Color1_Vaccinated 		 0.31506848 		 1.4526280096582944
_Gender_Color1_Health 		 0.31128955 		 1.4635240321571557
_Gender_Color1_Breed1 		 0.31034482 		 1.4446851042683078
_Gender_Color1_FurLength 		 0.3051488 		 1.463119297473567
_Gender_Color1_Color2 		 0.30467644 		 1.4645676705203194
_Gender_Color1_Fee 		 0.30325934 		 1.4622830280899495
_Gender_Color1_State 		 0.30184224 		 1.455547559863844
_Gender_Color1_Dewormed 		 0.3008975 		 1.4587576623001244
_Gender_Color1_MaturitySize 		 0.29853567 		 1.4608871632721494
_Gender_Color1_Quantity 		 0.29759094 		 1.4606375433557373
_Gender_Color1_Type 		 0.2952291 		 1.457984046886482
_Gender_Color1_Color3 		 0.29428437 		 1.4647070548418393
_Gender_Color1_Breed2 		 0.293812 		 1.4600213317866588


### Observamos que los features mas importantes resultan ser: 'Age', 'Sterilized', 'Vaccinated', 'Health', 'Breed1' y 'FurLength'

### Finalmente utilizaremos estos features agregando  'Age' y 'Fee' como numeric, 'Breed1' como embedded, y 'Sterilized', 'Vaccinated', 'Health' y 'FurLenght' con one_hot_encoding.

In [52]:
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1', 'Sterilized', 'Vaccinated', 'Health', 'FurLength']
}
embedded_columns = {
    embedded_col: dataset[embedded_col].max() + 1
    for embedded_col in ['Breed1']
}
numeric_columns = ['Age', 'Fee']

X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)
batch_size = 32
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

tf.keras.backend.clear_session()
inputs = []

direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)

output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

model = models.Model(inputs=inputs, outputs=output_layer)

model.compile(loss='categorical_crossentropy', optimizer='adam',
          metrics=['accuracy'])
model.summary()

import mflow
mlflow.set_experiment("selected_features")

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('one_hot_columns', one_hot_columns)

    # Train
    epochs = 100
    history = model.fit(train_ds, epochs=epochs, shuffle=True)

    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

    predictions = model.predict(test_ds)
    labels = numpy.argmax(predictions, axis=-1)
    timestr = time.strftime("%Y%m%d-%H%M%S")
    submission = pandas.DataFrame(list(zip(test_dataset["PID"], labels)), columns=["PID", "AdoptionSpeed"])
    filename = "./submissions/submission_" + timestr + ".csv"
    submission.to_csv(filename, header=True, index=False)
    mlflow.log_param('filename', filename)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 24)]              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 125       
Total params: 125
Trainable params: 125
Non-trainable params: 0
_________________________________________________________________
INFO: 'selected_features' does not exist. Creating a new experiment
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/10

Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


*** Test loss: 1.4374646576271903 - accuracy: 0.3268776535987854


### Ahora probamos poniendo los features 'Age'  y 'Fee' con one_hot_encoding

In [54]:
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1', 'Sterilized', 'Vaccinated', 'Health', 'FurLength', 'Age', 'Fee']
}
embedded_columns = {
    embedded_col: dataset[embedded_col].max() + 1
    for embedded_col in ['Breed1']
}
numeric_columns = []

X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)
batch_size = 32
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

tf.keras.backend.clear_session()
inputs = []

direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)

output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

model = models.Model(inputs=inputs, outputs=output_layer)

model.compile(loss='categorical_crossentropy', optimizer='adam',
          metrics=['accuracy'])
model.summary()

import mflow
mlflow.set_experiment("selected_features")

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('one_hot_columns', one_hot_columns)

    # Train
    epochs = 100
    history = model.fit(train_ds, epochs=epochs, shuffle=True)

    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

    predictions = model.predict(test_ds)
    labels = numpy.argmax(predictions, axis=-1)
    timestr = time.strftime("%Y%m%d-%H%M%S")
    submission = pandas.DataFrame(list(zip(test_dataset["PID"], labels)), columns=["PID", "AdoptionSpeed"])
    filename = "./submissions/submission_" + timestr + ".csv"
    submission.to_csv(filename, header=True, index=False)
    mlflow.log_param('filename', filename)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 3277)]            0         
_________________________________________________________________
dense (Dense)                (None, 5)                 16390     
Total params: 16,390
Trainable params: 16,390
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41

Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


*** Test loss: 1.434569920579048 - accuracy: 0.3623051345348358


### Observamos que obtenemos mejores resultados utilizando 'Age' y 'Fee' con one_hot_encoding vs como variables numericas.

### A continuación aplicamos one_hot_encoding a todos los features.

In [55]:
one_hot_columns = {
    one_hot_col: dataset[one_hot_col].max()
    for one_hot_col in ['Gender', 'Color1', 'Sterilized', 'Vaccinated', 'Health', 'FurLength', 'Age', 'Fee', 'Breed1']
}
embedded_columns = {}
numeric_columns = []

X_train, y_train = process_features(dataset, one_hot_columns, numeric_columns, embedded_columns)
direct_features_input_shape = (X_train['direct_features'].shape[1],)
X_dev, y_dev = process_features(dev_dataset, one_hot_columns, numeric_columns, embedded_columns)
batch_size = 32
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(batch_size)
dev_ds = tf.data.Dataset.from_tensor_slices((X_dev, y_dev)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices(process_features(
test_dataset, one_hot_columns, numeric_columns, embedded_columns, test=True)[0]).batch(batch_size)

tf.keras.backend.clear_session()
inputs = []

direct_features_input = layers.Input(shape=direct_features_input_shape, name='direct_features')
inputs.append(direct_features_input)

output_layer = layers.Dense(5, activation='softmax')(direct_features_input)

model = models.Model(inputs=inputs, outputs=output_layer)

model.compile(loss='categorical_crossentropy', optimizer='adam',
          metrics=['accuracy'])
model.summary()

import mflow
mlflow.set_experiment("selected_features")

with mlflow.start_run(nested=True):
    # Log model hiperparameters first
    mlflow.log_param('one_hot_columns', one_hot_columns)

    # Train
    epochs = 100
    history = model.fit(train_ds, epochs=epochs, shuffle=True)

    # Evaluate
    loss, accuracy = model.evaluate(X_dev, y_dev)
    print("*** Test loss: {} - accuracy: {}".format(loss, accuracy))
    mlflow.log_metric('epochs', epochs)
    mlflow.log_metric('loss', loss)
    mlflow.log_metric('accuracy', accuracy)

    predictions = model.predict(test_ds)
    labels = numpy.argmax(predictions, axis=-1)
    timestr = time.strftime("%Y%m%d-%H%M%S")
    submission = pandas.DataFrame(list(zip(test_dataset["PID"], labels)), columns=["PID", "AdoptionSpeed"])
    filename = "./submissions/submission_" + timestr + ".csv"
    submission.to_csv(filename, header=True, index=False)
    mlflow.log_param('filename', filename)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
direct_features (InputLayer) [(None, 3584)]            0         
_________________________________________________________________
dense (Dense)                (None, 5)                 17925     
Total params: 17,925
Trainable params: 17,925
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41

Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


*** Test loss: 1.4320617371929891 - accuracy: 0.3641946017742157


### Notamos un mejor rendimiento en entrenamiento pero no una gran mejora en el conjunto de test.

### En general no hemos notado diferencias cambiando el hiperparámetro de batch size

## Como conclusión podemos decir que sin agregar capas intermedias, y realizando una selección adecuada de features, hemos obtenido métricas de accuracy superiores al 36 %.