## Exercise notebook for the fourth session

This is the exercise notebook for the fourth session of the [Machine Learning workshop series at Harvey Mudd College](http://www.aashitak.com/ML-Workshops/). Please feel free to ask for help from the instructor and/or TAs.

First we import python modules:

In [1]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier



import warnings
warnings.simplefilter('ignore')

In today's exercise, we will work with the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic). The objective of this Kaggle competition is to predict whether a passenger survives or not given a number of features related to passengers' information such as gender, age, ticket class, etc. We are going to build a few classification models to predict whether a passenger survives. The `train.csv` file contains features along with the information about the survival of the passenger, so we will use it to train and validate our models. The `test.csv` file contains only features and we will use one of our trained models to predict the survival for these passengers and [submit our predictions to the competitions leaderboard](https://www.kaggle.com/c/titanic/submit).

For your convenience, the data preprocessing and feature engineering that we did in the previous sessions is summarized below.

In [2]:
path = 'titanic/'
df = pd.read_csv(path + 'train.csv')
train = pd.read_csv(path + 'train.csv')
target = train.Survived.astype('category', ordered=False)
train.drop('Survived', axis=1)

test = pd.read_csv(path + 'test.csv')
PassengerId = test.PassengerId

def get_Titles(df):
    df.Name = df.Name.apply(lambda name: re.findall("\s\S+[.]\s", name)[0].strip())
    df = df.rename(columns = {'Name': 'Title'})
    df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mr.', 'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 'Capt.': 'Rare', 'Countess.': 'Rare', 'Jonkheer.': 'Rare', 'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)
    return df

def fill_Age(df):
    df.Age = df.Age.fillna(df.groupby("Title").Age.transform("median"))
    return df

def get_Group_size(df):
    Ticket_counts = df.Ticket.value_counts()
    df['Ticket_counts'] = df.Ticket.apply(lambda x: Ticket_counts[x])
    df['Family_size'] = df['SibSp'] + df['Parch'] + 1
    df['Group_size'] = df[['Family_size', 'Ticket_counts']].max(axis=1)
    return df

def process_features(df):
    df.Sex = df.Sex.astype('category', ordered=False).cat.codes
    features_to_keep = ['Age', 'Fare', 'Group_size', 'Pclass', 'Sex']
    df = df[features_to_keep]
    return df

def process_data(df):
    df = df.copy()
    df = get_Titles(df)
    df = fill_Age(df)
    df = get_Group_size(df)
    df = process_features(df)
    medianFare = df['Fare'].median()
    df['Fare'] = df['Fare'].fillna(medianFare)
    return df

X_train, X_test = process_data(train), process_data(test)

Please feel free to refer to the classification algorithms notebook for the code below.

First, split the data into training and validation set using `train_test_split` and name the variables as `X_train, X_valid, y_train, y_valid `.

In [3]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, target, random_state=0)

In [4]:
X_train.head()

Unnamed: 0,Age,Fare,Group_size,Pclass,Sex
105,28.0,7.8958,1,3,1
68,17.0,7.925,7,3,0
253,30.0,16.1,2,3,1
320,22.0,7.25,1,3,1
706,45.0,13.5,1,2,0


In [5]:
y_train.head()

105    0
68     1
253    0
320    0
706    1
Name: Survived, dtype: category
Categories (2, int64): [0, 1]

Train a logistic regression classifier on `X_train, y_train` and test its accuracy on both `X_train, y_train` and `X_valid, y_valid`.

In [9]:
LR_clf = LogisticRegression()
LR_clf.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(LR_clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on validation set: {:.2f}'
     .format(LR_clf.score(X_valid, y_valid)))

Accuracy of Logistic regression classifier on training set: 0.80
Accuracy of Logistic regression classifier on test set: 0.78


[The evaluation metric for this competition is accuracy](https://www.kaggle.com/c/titanic/overview/evaluation).

Try training  a few more classifiers and compare the accuracy. Try tuning the hyperparameters too. You can also try more feature engineering by editing the code above.

In [11]:
svc_clf = SVC().fit(X_train, y_train)
print('Accuracy of Support Vector classifier on training set: {:.2f}'
     .format(svc_clf.score(X_train, y_train)))
print('Accuracy of Support Vector classifier on validation set: {:.2f}'
     .format(svc_clf.score(X_valid, y_valid)))

Accuracy of Support Vector classifier on training set: 0.91
Accuracy of Support Vector classifier on test set: 0.72


In [20]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation

In [77]:
def run_MLP(X_train, y_train, input_dim, dropout, epochs, batch_size):
    print('-----------Running Multilayer Perceptron-----------')
    # Build model 
    model = Sequential()
    model.add(Dense(128, input_dim=input_dim, activation='relu'))
    model.add(Dropout(dropout))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(dropout))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(dropout))
    model.add(Dense(1, activation='sigmoid'))
    # Choose optimizer and loss function
    opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
    loss = 'binary_crossentropy'
    # Compile 
    model.compile(optimizer=opt, 
        loss=loss,
        metrics=['accuracy'])
    # Fit on training data and cross-validate
    model.fit(X_train, y_train,
        epochs=epochs,
        batch_size=batch_size)
    return model

In [95]:
model = run_MLP(X_train, y_train, 5, 0.1, 70, 128)
score = model.evaluate(X_valid, y_valid, batch_size=128)

-----------Running Multilayer Perceptron-----------
Train on 668 samples
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


In [96]:
# tune hyperparameters using gridsearch (this is why we have the validation set)

In [97]:
# https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

Once you have explored a different classifiers and decided on one trained model (or a voting classifer ensemble as seen before), let us use it to make predictions using the features from `X_test` and save the results into `y_test`.

In [98]:
y_test = model.predict(X_test)
y_test = [1 if i >= 0.5 else 0 for i in y_test]
y_test = pd.Series((i for i in y_test))

We create a dataframe for submission using the predictions from `y_test` and save it to a csv file. It is important that our submission file is in correct format to be graded without errors.

In [94]:
submission = pd.DataFrame({'PassengerId': PassengerId, 'Survived': y_test})
submission.to_csv('submission.csv', index=False)