# Kaggle Titanic Notebook

## Import data

In [58]:
import pandas as pd
df = pd.read_csv('./data/train.csv')
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Exploratory data analysis

### Shape of the data

In [59]:
df.shape

(891, 12)

### What is the base survival rate to begin with?
The algorithm should not simply learn to say that people on the Titanic will most likely die in general, regardless of features.

In [60]:
df[df['Survived'] == 1].shape[0] / df.shape[0]

0.3838383838383838

### Are there any columns having missing data?

In [61]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We see that the `Age` and `Cabin` columns have a significant amount of missing data. If those are to be used as features, data imputation techniques shall be explored and used to fill the missing values.

### What is the average metrics per by `Survived` value?

In [62]:
df.groupby(['Survived']).mean().sort_values('Age').head(50)

Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,444.368421,1.950292,28.34369,0.473684,0.464912,48.395408
0,447.016393,2.531876,30.626179,0.553734,0.32969,22.117887


## Machine learning

### Multilayer perceptron classifier

#### Training over varying `m` values
Train the model, increasing the number of sample data in the training set. For each trained model, compute the training and cross-validation error in order to plot them as a function of the number of training examples. This will help figure out whether we have a high bias (underfitting) problem or a high variance (overfitting) problem.

_**High bias**_

The training error converges to the cross validation error, but both are high. Possible solutions include:
* Feature engineering
* Decrease regularization parameter

_**High variance**_

The training error is small compared to the cross-validation error, and there is a large gap between both error. The ideal error is found between the two errors. Possible solutions include:
* Get more training examples (not possible in this case)
* Try smaller set of features
* Increase the regularization parameter

#### Training over varying number of hidden layers
In order to find a somewhat optimized number of hidden layer, try to run train the model using varying number of hidden layers and look at the cross-validation error to choose the "optimal" number of hidden layer for the model.

#### Training

In [77]:
import numpy as np
from sklearn.neural_network import MLPClassifier

df_train, df_validate, df_test = np.split(df.sample(frac=1, random_state=1), [int(.6 * len(df)), int(.8 * len(df))])
features_col = ["Fare", "Pclass", "SibSp", "Parch"]

clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(df_train[features_col],
        df_train["Survived"])

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(5, 2), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=1, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

#### Errors (training and cross-validation)

Calculating training error:

In [79]:
from sklearn.metrics import accuracy_score
predictions = clf.predict(df_train[features_col])
accuracy_score(df_train["Survived"], predictions)

0.5898876404494382

Calculating cross-validation error:

In [80]:
predictions = clf.predict(df_validate[features_col])
accuracy_score(df_validate["Survived"], predictions)

0.651685393258427