# Titanic-Machine Learning from Disaster - Ishan Saksena

### Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Aim: Predicting the survival status of the passengers in test.csv

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pdb # Python debugger
from IPython.display import Image, display
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz
from sklearn.impute import KNNImputer
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, learning_curve, train_test_split, GridSearchCV

sns.set()

%config InlineBackend.figure_format = 'retina' # Increase the figures' resolution in jupyter notebook
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
df_data = df_train.append(df_test).reset_index(drop=True)
df_submission = pd.read_csv('../input/titanic/gender_submission.csv')

In [None]:
df_train.head()

In [None]:
print('Training null values\n')
print(df_train.isnull().sum()) 
print('-'*30)
print('Testing null values\n')
print(df_test.isnull().sum())

In [None]:
print('Training info\n')
print(df_train.info())
print('-'*30)
print('Testing info\n')
print(df_test.info())

In [None]:
df_data.describe()

## Some Observations from the Data

### Features
    PassengerId
    Survived
    Pclass (Ticket class)
    Name
    Sex
    Age
    SibSp (# of siblings / spouses aboard the Titanic)
    Parch (# of parents / children aboard the Titanic)
    Ticket
    Fare
    Cabin
    Embarked (Port of Embarkation)
    
### Missing Values
    Age and Cabin have a number of missing values, Embarked has some

### Type
    Categorical features: Survived, Sex, Embarked, and Pclass(ordinal).
    Numercial features: Age, Fare. Discrete: SibSp, Parch.
    
### Distribution
    Only Name has 100% unique values

## Assumtions based on data analysis


### Correlating.

 - Will use heatmaps and other visual methods to figure out which data correlates to survival the most.

### Completing.

- Fill Age value - A vital component 
- Embarked value 
- Cabin value - Has a lot of missing values and might not have high correlation. Potential Drop.

### Correcting.

- Name and ID have no correlation with survival and can be dropped for the model. 
- Some duplicates in the Ticket value is an interesting discovery. We can use this piece of information later on with a simple assumption - Passengers with same ticket values may be acquainted with each other.
- Cabin can be dropped since we cannot assume that people with missing cabin entry were any different. 

### Creating.

- We could build a feature known as Family - Using the columns Sibsp and Parch. 
- We could extract the title from Name 
- Turning the Age and Fair features into ordinal categorical features with Age Bands and Fair Bins

### Classifying.

- Some Assumptions: 
- Women (Sex=female) and Children (Age<16) had higher survival rate.
- The upper-class passengers (Pclass=1) had higher survival rate.

## Exporatory Data Analysis (EDA)

In [None]:
display(df_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean())

In [None]:
display(df_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean())
display(df_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())
display(df_data[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean())
display(df_data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean())
display(df_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())

In [None]:
sns.countplot(x=df_data['Embarked'], hue=df_data['Survived'])


In [None]:
sns.countplot(x=df_data['Sex'], hue=df_data['Survived'])

## Feature Engineering and Data Cleaning

In [None]:
#mapping the sex feature
df_data['Sex#'] = df_data['Sex'].map({'male': 0, 'female': 1})

In [None]:
# Create new feature for Family
df_data['Fsize'] = df_data['Parch'] + df_data['SibSp'] + 1 #Family = Self + Sib + Parents/Children

In [None]:
# Create new feature for IsAlone
df_data['IsAlone']  = 0
df_data.loc[df_data.Fsize == 1, 'IsAlone'] = 1

# Plot
sns.countplot(x=df_data['IsAlone'], hue=df_data['Survived'])
plt.show()

#This looks like important data 

In [None]:
# Making bins
df_data['Fare'].fillna(80.0, inplace = True)
df_data.isna().sum()

df_data['FareBin'] = pd.qcut(df_data['Fare'], 6)



# Mapping the bins
label_encoder = LabelEncoder()
df_data['FareBin'] = label_encoder.fit_transform(df_data['FareBin'])


In [None]:
# splits again beacuse we just engineered new feature
df_train = df_data[:len(df_train)]
df_test = df_data[len(df_train):]

# Training set and labels
x_train = df_train.drop(labels=['Survived','PassengerId'], axis=1)
y_train = df_train['Survived']

# show columns
x_train.columns

In [None]:
#Extracting Family name by Regex - might help later
df_data['Fname'] = df_data['Name'].str.extract('([A-Za-z]+.[A-Za-z]+)\,', expand=True)

In [None]:
# Assuming the cause of duplicate tickets is because the passengers knew each other. 

duplicates = []

for uniq in df_data['Ticket'].unique():
    temp = df_data.loc[df_data['Ticket'] == uniq, 'Name']
    if temp.count() > 1:
        duplicates.append(df_data.loc[df_data['Ticket'] == uniq, ['Name', 'Ticket', 'Fare', 'FareBin', 'Fsize', 'Survived']])
duplicates = pd.concat(duplicates)
duplicates.head(20)

In [None]:
df_friend = duplicates.loc[(duplicates.Fsize == 1) & (duplicates.Survived.notnull())]
df_family = duplicates.loc[(duplicates.Fsize > 1) & (duplicates.Survived.notnull())]
display(df_friend.head(), df_family.head())

In [None]:
print('The Duplicates: ', duplicates['Name'].count())
print('Family: ', df_family['Name'].count())
print('Friend: ', df_friend['Name'].count())
print('Other: ', duplicates['Name'].count() - df_family['Name'].count() - df_friend['Name'].count())

In [None]:
## Making a column for just Connected Survival
df_data['Connected_Survival'] = 0.5

for ticket_num, df_grp in df_data.groupby('Ticket'):
    if len(df_grp) > 1: # Duplicates in Ticket
            for index, row in df_grp.iterrows():
                smax = df_grp.drop(index).Survived.max()
                smin = df_grp.drop(index).Survived.min()
                pid = row.PassengerId
                if smax == 1.0:
                    df_data.loc[df_data['PassengerId'] == pid, 'Connected_Survival'] = 1
                elif smin == 0.0:
                    df_data.loc[df_data['PassengerId'] == pid, 'Connected_Survival'] = 0

In [None]:
# Embarked Filling by checking Fare
df_data[df_data['Embarked'].isnull()][['Embarked', 'Pclass', 'Fare']]

In [None]:
# Check their relation in groups
df_data.groupby(['Embarked', 'Pclass'])[['Fare']].median()

In [None]:
#80 is closest to C1 - Assigning C and mapping

# Filling missing values with the value that has greatest frequency
df_data['Embarked'] = df_data['Embarked'].fillna('C')

# Mapping
df_data['Embarked#'] = df_data['Embarked'].map({'S': 1, 'C': 2, 'Q': 3})
df_data.head()

In [None]:
## Extracting Titles from Name - might help in filling Age
df_data['Title'] = df_data['Name'].str.extract('([A-Za-z]+)\.', expand=False)
df_data['Title'] = df_data['Title'].replace(['Capt', 'Col', 'Rev', 'Don', 'Countess', 'Jonkheer', 'Dona', 'Sir', 'Dr', 'Major', 'Dr'], 'Rare')
df_data['Title'] = df_data['Title'].replace(['Mlle', 'Mme', 'Ms'], 'Miss')
df_data['Title'] = df_data['Title'].replace(['Lady'], 'Mrs')
df_data['Title'] = df_data['Title'].map({"Mr":0, "Rare" : 1, "Master" : 2,"Miss" : 3, "Mrs" : 4 })

In [None]:
df_data.head()

#### Filling Values of Age
1. Linear Regression/XGBRegressor
2. Using Title
3. Using Pclass and Sex

In [None]:
display(df_data.Age.describe())
display(df_train['Age'].isna().sum())
display(df_test['Age'].isna().sum())

In [None]:
#By Title - "Mr":0, "Rare" : 1, "Master" : 2,"Miss" : 3, "Mrs" : 4 
df_data.groupby('Title')['Age'].median().values


In [None]:
title_age = df_data.groupby('Title')['Age'].median().values
df_data['Age_pred1'] = df_data['Age']
for i in range(5):
    df_data.loc[(df_data['Title'] == i) & (df_data['Age'].isnull()), 'Age_pred1'] = title_age[i]

In [None]:
nan

In [None]:
#By Linear Regression
x_train = df_data[df_data.Age.notnull()]
y_train = df_data[df_data.Age.notnull()]['Age']
x_test = df_data[df_data.Age.isnull()]

#select_feature = ['Sex#', 'Pclass', 'Title', 'FareBin','Embarked#', 'IsAlone']
select_feature = ['Sex#', 'Pclass', 'Title', 'FareBin']

In [None]:
reg = LinearRegression()
reg.fit(x_train[select_feature], y_train)
reg.score(x_train[select_feature], y_train)

In [None]:
df_data['Age_pred3'] = df_data['Age']
df_data.loc[df_data['Age'].isnull(), 'Age_pred2'] = reg.predict(x_test[select_feature]).astype('int')

In [None]:
xgb = XGBRegressor()
xgb.fit(x_train[select_feature], y_train)
xgb.score(x_train[select_feature], y_train)

In [None]:
df_data['Age_pred3'] = df_data['Age']
df_data.loc[df_data['Age'].isnull(), 'Age_pred3'] = xgb.predict(x_test[select_feature]).astype('int')

In [None]:
#Higher survival rate for age <16, we put filter for predicting minors

df_data['Minor_pred1'] = ((df_data['Age_pred1']) < 16)*1
df_data['Minor_pred2'] = ((df_data['Age_pred2']) < 16)*1
df_data['Minor_pred3'] = ((df_data['Age_pred3']) < 16)*1


In [None]:
# Bucketing Age like Fare


df_data['AgeBin_pred1'] = pd.qcut(df_data['Age_pred1'], 5)
#df_data['AgeBin_pred2'] = pd.qcut(df_data['Age_pred2'], 5)
df_data['AgeBin_pred3'] = pd.qcut(df_data['Age_pred3'], 5)

df_data['AgeBin_pred1'] = label_encoder.fit_transform(df_data['AgeBin_pred1'])
#df_data['AgeBin_pred2'] = label_encoder.fit_transform(df_data['AgeBin_pred2'])
df_data['AgeBin_pred3'] = label_encoder.fit_transform(df_data['AgeBin_pred3'])


In [None]:
## Assumption - What if a missing cabin value means that the passenger was not assigned a premium cabin?

df_data['Cabin'] = df_data['Cabin'].fillna(0)

In [None]:
def cabin(x):
    try:
        if x != 0:
            return 1
        else:
            return 0
    except:
            return 0

In [None]:
df_data['Cabin'] = df_data['Cabin'].apply(cabin)

## Building Models

In [None]:
df_data[['PassengerId', 'Pclass', 'Sex#']] = df_data[['PassengerId', 'Pclass', 'Sex#']].astype('int32')
df_data.head()

In [None]:
df_train = df_data[:len(df_train)]
df_test = df_data[len(df_train):]

In [None]:
train_features = ['Survived', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Sex#', 'Fsize', 'IsAlone', 'FareBin', 'Connected_Survival', 'Embarked#', 'Age_pred1', 'Age_pred3', 'Minor_pred1', 'Minor_pred3', 'AgeBin_pred1', 'AgeBin_pred3', 'Cabin']

In [None]:
corr_mat = df_train[train_features].astype(float).corr()
corr_mat_fil = corr_mat.loc[:, 'Survived'].sort_values(ascending=False)
corr_mat_fil = pd.DataFrame(data=corr_mat_fil[1:])

In [None]:
plt.figure(figsize=(20,12))
bar = sns.barplot(x=corr_mat_fil.Survived.abs(), y=corr_mat_fil.index, data=corr_mat_fil, palette='deep')

In [None]:
train_features = ['Survived', 'Sex#', 'Connected_Survival', 'FareBin', 'Minor_pred3', 'Embarked#', 'AgeBin_pred3', 'Parch', 'Age_pred3','IsAlone', 'Pclass', 'Cabin']
corr_mat = df_train[train_features].astype(float).corr()

plt.figure(figsize=(20,10))
sns.heatmap(corr_mat.abs(), annot=True)
plt.show()

## Random Forest

In [None]:
#selected_features = ['Sex#', 'Pclass', 'FareBin', 'Connected_Survival', 'Minor_pred3', 'Embarked#', 'Cabin']
#selected_features = ['Sex#', 'Pclass', 'FareBin', 'Connected_Survival', 'Cabin']
selected_features = ['Sex#', 'Pclass', 'FareBin', 'Connected_Survival', 'Minor_pred3', 'Embarked#', 'IsAlone']


df_train = df_data[:len(df_train)]
df_test = df_data[len(df_train):]

x_train = df_train[selected_features]
y_train = df_train['Survived']
x_test = df_test[selected_features]

In [None]:
model = RandomForestClassifier(random_state=2)

grid_parameters = {'n_estimators': [i for i in range(300, 601, 50)], 'min_samples_split' : [10, 20, 30, 40]}
grid = GridSearchCV(estimator=model, param_grid=grid_parameters)
grid_result = grid.fit(x_train, y_train)

# summarize results
print('Best: {} using {}'.format(grid_result.best_score_, grid_result.best_params_))

In [None]:
n_estimator = grid_result.best_params_['n_estimators']
min_samples_split = grid_result.best_params_['min_samples_split']

RFC = RandomForestClassifier(random_state=2, n_estimators=300, min_samples_split=40)
RFC.fit(x_train, y_train)
y_pred = RFC.predict(x_test)

output = pd.DataFrame({'PassengerId': df_test['PassengerId'], 'Survived': y_pred})
output = output.astype('int')
output.to_csv('predictionnocwisalone1.csv', index=False)
#print('Your file was successfully saved!')

## XGBClassifier

In [None]:
selected_features = ['Sex#', 'Pclass', 'FareBin', 'Connected_Survival', 'Minor_pred3', 'Embarked#', 'Cabin']


df_train = df_data[:len(df_train)]
df_test = df_data[len(df_train):]

x_train = df_train[selected_features]
y_train = df_train['Survived']
x_test = df_test[selected_features]

In [None]:
#xgbc = XGBClassifier(random_state=2)

#xgbc.fit(x_train, y_train)
#y_pred = xgbc.predict(x_test)

#output = pd.DataFrame({'PassengerId': df_test['PassengerId'], 'Survived': y_pred})
#output = output.astype('int')
##print('Your file was successfully saved!')

In [None]:
nan