## Training Denoise AutoEncoder on Titanic dataset

In [1]:
# imports
import numpy as np
import pandas as pd
from tabular_dae.model import DAE
from sklearn.linear_model import RidgeClassifierCV
from sklearn.metrics.pairwise import cosine_similarity

### Data 

This is a classification problem, the goal is to predict whether an passenger survived the tragic. 

In [2]:
df = pd.read_csv('./titanic.csv')
print(df.head())

# seperating the inputs from the output
y = df['Survived']
df.drop('Survived', axis=1, inplace=True)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### DAE model

+ By default, the `DAE` model class uses a `Deep Stacked AutoEncoder` network. 
+ It will be trained on the inputs only in a self-supervised setting. By which, we will show the model corrupted data and ask it to identify the corruption and correct it.  
+ By doing so the model learns a good representation that can be passed to downstream task.  

In [3]:
dae = DAE()  
dae.fit(df, verbose=1)

epoch    0 - train loss 1.6727 - valid loss 1.6714
epoch   10 - train loss 0.8122 - valid loss 0.8907
epoch   20 - train loss 0.7478 - valid loss 0.8262
epoch   30 - train loss 0.7393 - valid loss 0.7672
Epoch    34: reducing learning rate of group 0 to 3.0000e-05.
epoch   40 - train loss 0.7345 - valid loss 0.7529
Epoch    45: reducing learning rate of group 0 to 3.0000e-06.
epoch   50 - train loss 0.7490 - valid loss 0.7447
Epoch    56: reducing learning rate of group 0 to 3.0000e-07.
epoch   60 - train loss 0.7492 - valid loss 0.7623
epoch   70 - train loss 0.7295 - valid loss 0.7276
Epoch    75: reducing learning rate of group 0 to 3.0000e-08.
epoch   80 - train loss 0.7145 - valid loss 0.7364
Epoch    86: reducing learning rate of group 0 to 3.0000e-09.
epoch   90 - train loss 0.7264 - valid loss 0.7147
epoch  100 - train loss 0.7393 - valid loss 0.7724
epoch  110 - train loss 0.7419 - valid loss 0.7394
Early Stopping Triggered, best score is:  0.6458579421043396


### Extract Hidden Representations

With a trained DAE model, we can extract the hidden representation for the dataset and use that for various tasks, like building classifiers in a supervised setting or running clustering. 

In [4]:
features = dae.transform(df)
print(features.shape)
print(features[:5, :5])

(891, 384)
[[0.20682499 0.         0.35262352 0.44356048 0.        ]
 [0.6579324  0.45734215 0.         0.         0.10632943]
 [0.35681987 0.         0.         0.492716   0.        ]
 [0.55539024 0.8284649  0.31546336 0.         0.        ]
 [0.         0.         0.11778796 0.614768   0.19499007]]


### Use the Hidden Representation for a Classifier

Lets try a simple linear classifier. 

In [5]:
classifier = RidgeClassifierCV(alphas=[1, 5, 10, 20], cv=5).fit(features, y)
print('5 Fold Cross-Validation Accuracy: {:4.2f}%'.format(np.round(classifier.best_score_ * 100, 4)))


5 Fold Cross-Validation Accuracy: 81.14%


### Similarity Query

With the learned representaions, we can calculate similarity/distance among data points in the latent space. 

In [6]:
# calculating pairwise similarity scores using the hidden representaion
similarity_matrics = cosine_similarity(features)
np.fill_diagonal(similarity_matrics, 0)

In [7]:
pd.concat([df.iloc[0, :].T, df.iloc[similarity_matrics[0, :].argmax(), :]], 1)

Unnamed: 0,0,12
PassengerId,1,13
Pclass,3,3
Name,"Braund, Mr. Owen Harris","Saundercock, Mr. William Henry"
Sex,male,male
Age,22.0,20.0
SibSp,1,0
Parch,0,0
Ticket,A/5 21171,A/5. 2151
Fare,7.25,8.05
Cabin,,


In [8]:
pd.concat([df.iloc[42, :].T, df.iloc[similarity_matrics[42, :].argmax(), :]], 1)

Unnamed: 0,42,36
PassengerId,43,37
Pclass,3,3
Name,"Kraeff, Mr. Theodor","Mamee, Mr. Hanna"
Sex,male,male
Age,,
SibSp,0,0
Parch,0,0
Ticket,349253,2677
Fare,7.8958,7.2292
Cabin,,
