# Kaggle Titanic Challenge

Here is a link to the dataset: https://www.kaggle.com/c/titanic.

## Goal
Apply machine learning to predict which passengers survived the Titanic sinking. 

**My goal is to submit a trained model try to go up the leaderboard.** 

## Overview
**training set: train.csv**

*Shape*: (891, 12)

**testing set: test.csv**

*Shape*: (418, 11)

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
params = {
    'axes.labelsize': 'large',
    'xtick.labelsize': 'x-large',
    'legend.fontsize': 20,
    'figure.dpi': 150,
    'figure.figsize': [25,7]
}
plt.rcParams.update(params)

In [None]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
train.head(3)

In [None]:
train.describe()

In [None]:
train.dtypes

## Data Cleaning

First, I will impute the missing values. Then, proceed to pre-processing the data and feature engineering.

--- 

*Survived* is the label I want to predict. 0 means the passenger died, while 1 means they lived.

*PassengerId* will not be used in training, and so can be dropped.

### Imputing the Missing Values - Training

In [None]:
n_missing = train.isnull().sum().sort_values(ascending=False)
percent_missing = train.isnull().sum().sort_values(ascending=False) / len(train)
missing_train = pd.DataFrame(data=[n_missing, percent_missing])
missing_train

About 19% of the training data is missing *Age*. About 77% is missing *Cabin*, and 2 observations are missing *Embarked*.

#### Imputing Embarked

In [None]:
train[train['Embarked'].isnull()]

Impute these two observations with the mode of Embarked. 

In [None]:
train.Embarked.fillna('C', inplace=True)

#### Imputing Cabin

Since ~77% of the features are missing, one may consider dropping it. But, having no data on *Cabin* can represent the passenger's low socio-economic status, which may be a factor towards their survival. As such, it is better to impute them.

The observations missing *Cabin* will be imputed with 'N'.
All others will be imputed with the first letter of the *Cabin* variable. 

In [None]:
train.Cabin.fillna('N', inplace=True)

In [None]:
train.Cabin.unique()

In [None]:
train.Cabin = [i[0] for i in train.Cabin]

In [None]:
train.Cabin.value_counts()

#### Imputing Age

About 20% of the training observations are missing *Age*. To impute these, I will use Linear Regression to predict these missing values. 

However, this will be done once I engineer more features.

### Imputing the Missing Values - Testing

In [None]:
n_missing = test.isnull().sum().sort_values(ascending=False)
percent_missing = test.isnull().sum().sort_values(ascending=False) / len(train)
missing_test = pd.DataFrame(data=[n_missing, percent_missing])
missing_test

86 of the testing data is missing *Age*. About 36% is missing *Cabin*, and 1 observation is missing *Fare*.

#### Imputing Fare

In [None]:
test[test['Fare'].isnull()]

Impute this observation with the mean of passengers who are 'male', *Pclass* of '3', and *Embarked* from 'S'. 

In [None]:
miss_val = test.loc[(test.Sex == 'male') & (test.Pclass == 3) & (test.Embarked == 'S')].Fare.mean()
test.Fare.fillna(miss_val, inplace=True)

#### Imputing Cabin

Perform the same imputation done to the training set to be consistent.

In [None]:
test.Cabin.fillna('N', inplace=True)

In [None]:
test.Cabin.unique()

In [None]:
test.Cabin = [i[0] for i in test.Cabin]

In [None]:
test.Cabin.value_counts()

In the training set, there is a *Cabin* with the label 'T'. When this is one-hot encoded, I will drop this column from the training set. Keeping it can cause some complications for training the model. 

If the model is trained on features that do not exist in the testing set, then the model can overfit the training set, and cannot generalize to new data. The model will have more variance and less bias.

#### Imputing Age

86 of the testing observations are missing *Age*. To impute these, I will use Linear Regression to predict these missing values. 

However, this will be done once I engineer more features.

### Feature Extraction

*Name* - Extract the title of each passenger. A passenger's title can have an effect on whether or not they survived. For instance, those with the title of *Master* may have been given priority to get on a life boat.

*SibSp* and *Parch* - Can create new features called *NumFamily*. So, for each passenger, the values of *SibSp* and *Parch* will be added + 1 (the 1 represents the passenger themself). 

*Ticket* - More analysis needs to be done for this feature. 

#### Extracting Title from Name

In [None]:
def get_title(i):
    title = i.split(', ')[1].split('.')[0]
    return title

In [None]:
train['Title'] = train.Name.apply(get_title)
test['Title'] = test.Name.apply(get_title)

In [None]:
title_dict = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Ms': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Dr': 'Dr',
    'Rev': 'Rev',
    'Col': 'Officer',
    'Mlle': 'Miss',
    'Major': 'Officer',
    'the Countess': 'Royal',
    'Sir': 'Royal',
    'Capt': 'Officer',
    'Don': 'Royal',
    'Mme': 'Royal',
    'Jonkheer': 'Royal',
    'Lady': 'Royal',
    'Dona': 'Royal'
}

train.Title = train.Title.map(title_dict)
train.drop('Name', axis=1, inplace=True)
test.Title = test.Title.map(title_dict)
test.drop('Name', axis=1, inplace=True)

In [None]:
train.Title.value_counts()

In [None]:
test.Title.value_counts()

#### Engineering NumFamily

In [None]:
train['NumFamily'] = train.SibSp + train.Parch + 1
test['NumFamily'] = test.SibSp + test.Parch + 1

#### Analyzing Ticket

In [None]:
train.Ticket.value_counts()

Looking at the unique values, there does not seem to be any reasonable way to distinguish/ clear up the data. There are lots of different levels for *Ticket*, and there is no clear way of pre-processing it. 

One may argue that *Ticket* can represent the socio-economic status of a passenger, which can influence survival. However, this can already be explained by the *Fare*, which is a much more interpretable way of determining the socio-economic status of a passenger.

So, I will drop *Ticket* from both datasets.

In [None]:
train.drop('Ticket', axis=1, inplace=True)
test.drop('Ticket', axis=1, inplace=True)

### Pre-processing

*Pclass* - This is an integer, but is really describing is the social class the passenger belongs in (1st, 2nd, 3rd). One-Hot Encode for each social class level.

*Cabin*, *Embarked*, *Sex* - One-hot encode this. 

*Age* - Use Linear Regression to determine the missing *Age* values.

*PassengerId* - Drop this, since we won't need this in training. 

---

After imputing *Age*, drop *Cabin_T* from the training set. 

In [None]:
train.drop('PassengerId', axis=1, inplace=True)

In [None]:
train = pd.get_dummies(train, columns=['Title','Pclass','Embarked', 'Sex', 'Cabin'])
test = pd.get_dummies(test, columns=['Title','Pclass','Embarked', 'Sex', 'Cabin'])

In [None]:
train.drop('Cabin_T', axis=1, inplace=True)

#### Imputing Age Via Logistic Regression

In [None]:
def impute_age(df):
    temp_test = df.loc[df.Age.isnull()]
    temp_train = df.loc[df.Age.notna()]

    temp_train_X = temp_train.loc[:, 'SibSp':]
    temp_test_X = temp_test.loc[: , 'SibSp':]
    temp_train_y = temp_train.Age
    temp_test_y = temp_test.Age

    logreg = LinearRegression().fit(temp_train_X, temp_train_y)
    preds = logreg.predict(temp_test_X)

    df.loc[df.Age.isnull(), 'Age'] = preds

In [None]:
impute_age(train)
impute_age(test)

## Data Analysis

**After Feature Extraction and Pre-Processing...**

train - Shape: (891,31)

test - Shape: (418, 30)

### Heat-map of the correlation of Top 10 Features on Survived

In [None]:
correlation_mat = train.corr(method='pearson')
corr_cols = correlation_mat.nlargest(10, 'Survived')['Survived'].index
correlation_mat = np.corrcoef(train[corr_cols].values.transpose())
f, ax = plt.subplots(figsize=(10,8))
sns.set(font_scale=1.25)
sns.heatmap(correlation_mat, square=True, annot=True, 
            yticklabels=corr_cols.values, xticklabels=corr_cols.values)
plt.show()

Being a female was correlated relatively strongly with survival. 

### Distribution of Survival on Gender

In [None]:
n_females_survived = train[(train['Sex_female'] == 1) & (train['Survived'] == 1)].Sex_female.sum()

n_males_survived = train[(train['Sex_male'] == 1) & (train['Survived'] == 1)].Sex_male.sum()

f, ax = plt.subplots(figsize=(10,8))
sns.barplot(x=['Males', 'Females'], y=[n_males_survived, n_females_survived]) 
plt.title('Gender vs. Survival')
plt.ylabel('No. Survived')
plt.show()

In [None]:
n_total_females = train[(train['Sex_female'] == 1)].Sex_female.sum()
n_total_males = train[(train['Sex_male'] == 1)].Sex_male.sum()

f, ax = plt.subplots(figsize=(10,8))
sns.barplot(x=['Males', 'Females'], y=[n_total_males, n_total_females]) 
plt.title('Gender vs. Amount that Boarded')
plt.ylabel('No. Boarded')
plt.show()

In [None]:
print('Percent of Males that Survived: ', n_males_survived / n_total_males)
print('Percent of Females that Survived: ', n_females_survived / n_total_females)

### Distribution of Survival on Pclass

In [None]:
n_firstclass_survived = train[(train['Pclass_1'] == 1) & (train['Survived'] == 1)].Pclass_1.sum()
n_secondclass_survived = train[(train['Pclass_2'] == 1) & (train['Survived'] == 1)].Pclass_2.sum()
n_thirdclass_survived = train[(train['Pclass_3'] == 1) & (train['Survived'] == 1)].Pclass_3.sum()

f, ax = plt.subplots(figsize=(10,8))
sns.barplot(x=['First Class', 'Second Class', 'Third Class'], 
            y=[n_firstclass_survived, n_secondclass_survived, n_thirdclass_survived]) 
plt.title('Passenger Class vs. Survival')
plt.ylabel('No. Survived')
plt.show()

In [None]:
n_firstclass_total = train[(train['Pclass_1'] == 1)].Pclass_1.sum()
n_secondclass_total = train[(train['Pclass_2'] == 1)].Pclass_2.sum()
n_thirdclass_total = train[(train['Pclass_3'] == 1)].Pclass_3.sum()

f, ax = plt.subplots(figsize=(10,8))
sns.barplot(x=['First Class', 'Second Class', 'Third Class'], 
            y=[n_firstclass_total, n_secondclass_total, n_thirdclass_total]) 
plt.title('Passenger Class vs. Total that Boarded')
plt.ylabel('No. Boarded')
plt.show()

In [None]:
print('Percent of First Class that Survived: ', n_firstclass_survived / n_firstclass_total)
print('Percent of Second Class that Survived: ', n_secondclass_survived / n_secondclass_total)
print('Percent of Third Class that Survived: ', n_thirdclass_survived / n_thirdclass_total)

# Modeling

- **Logistic Regression**

Hyper-parameters: 

C = [0.05, 0.01, 0.5, 0.1, 1, 1.5, 2]

penalty = ['l1', 'l2']                

- **Support Vector Machine**

Hyper-parameters: 

C = [0.05, 0.01, 0.5, 0.1, 1, 1.5, 2]

kernel = ['linear', 'poly', 'rbf']

degree = [2, 3]

- **K-Nearest Neighbors**

Hyper-parameters: 

n_neighbors = [2...10]

- **Decision Tree Classifier**

Hyper-parameters:

criterion = ['entropy']

max_depth = [2...10]

min_samples_split = [2...10]

After fitting the above, I will then use VotingClassifier, which is used for ensemble learning.

Error Metric: Accuracy

### Splitting Training Data

In [None]:
X = train.iloc[:, 1:]
y = train.Survived

In [None]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression

In [None]:
C_list = [0.05, 0.01, 0.5, 0.1, 1, 1.5, 2]
penalty_list = ['l1', 'l2']
logreg_params = {'C': C_list, 'penalty': penalty_list}

logreg = LogisticRegression()

logreg_grid = GridSearchCV(logreg, logreg_params, cv=5, scoring='accuracy')
logreg_grid.fit(X_train_val, y_train_val)

In [None]:
logreg_grid.best_score_

In [None]:
logreg_grid.best_params_

In [None]:
logreg_star = logreg_grid.best_estimator_.fit(X_train_val, y_train_val)
acc = accuracy_score(logreg_star.predict(X_test), y_test)

In [None]:
print('Logistic Regression Accuracy: ', acc)

### K-Nearest Neighbors

In [None]:
n_neighbors_list=np.arange(1, 11, 1)
knn_params = {'n_neighbors': n_neighbors_list}

knn_grid = KNeighborsClassifier()

knn_grid = GridSearchCV(knn_grid, knn_params, cv=5, scoring='accuracy')
knn_grid.fit(X_train_val, y_train_val)

In [None]:
knn_grid.best_score_

In [None]:
knn_grid.best_params_

In [None]:
knn_star = knn_grid.best_estimator_.fit(X_train_val, y_train_val)
acc = accuracy_score(knn_star.predict(X_test), y_test)

In [None]:
print('K-Nearest Neighbors Accuracy: ', acc)

### Decision Tree Classifier

In [None]:
criterion_list = ['gini', 'entropy']
max_depth_list = np.arange(2, 11, 1)
min_samples_split_list = np.arange(2, 11, 1)

dt_params = {'criterion': criterion_list, 'max_depth': max_depth_list, 
             'min_samples_split': min_samples_split_list}

dt_grid = DecisionTreeClassifier()
dt_grid = GridSearchCV(dt_grid, dt_params, cv=5, scoring='accuracy')

dt_grid.fit(X_train_val, y_train_val)

In [None]:
dt_grid.best_score_

In [None]:
dt_grid.best_params_

In [None]:
dt_star = dt_grid.best_estimator_.fit(X_train_val, y_train_val)
acc = accuracy_score(dt_star.predict(X_test), y_test)

In [None]:
print('Decision Tree Classifier Accuracy: ', acc)

### SVM

In [None]:
kernel_list = ['linear', 'poly', 'rbf']
degree_list = [2, 3]
gamma_list = [10e-3, 10e-2, 1, 2]
svc_params = {'C': C_list, 'kernel': kernel_list, 'degree': degree_list, 'gamma': gamma_list}

svc_grid = SVC()

svc_grid = GridSearchCV(svc_grid, svc_params, cv=5, scoring='accuracy')
svc_grid.fit(X_train_val, y_train_val)

In [None]:
svc_grid.best_score_

In [None]:
svc_grid.best_params_

In [None]:
svc_star = svc_grid.best_estimator_.fit(X_train_val, y_train_val)
acc = accuracy_score(svc_star.predict(X_test), y_test)

In [None]:
print('SVM Accuracy: ', acc)

### VotingClassifier

In [None]:
voting_clf = VotingClassifier(estimators=[
    ('logreg_grid', logreg_grid),
    ('svc_grid', svc_grid),
    ('knn_grid', knn_grid),
    ('dt_grid', dt_grid)
], voting='soft').fit(X_train_val, y_train_val)

In [None]:
preds = voting_clf.predict(X_test)
acc = accuracy_score(preds, y_test)

In [None]:
print('Voting Classifier Accuracy: ', acc)

# Submitting Test Predictions

In [None]:
passengerId = test.PassengerId
test.drop('PassengerId', axis=1, inplace=True)
submission = pd.DataFrame({
    'PassengerId': test.PassengerId
    'Survived': voting_clf.predict(test)
})