# Introduction
Titanic: Machine Learning from Disaster, Kaggle Competition.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning."

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# import plotly.figure_factory as ff

pd.set_option('display.max_columns', None)

from sklearn.metrics import accuracy_score


In [2]:
input_df = pd.read_csv('../data/train.csv')
input_df.columns =input_df.columns.str.lower() 

input_df =input_df.rename(columns={'passengerid':'passenger_id'
                                  , 'pclass':'p_class'}) 

print(input_df.shape)
input_df.head()


(891, 12)


Unnamed: 0,passenger_id,survived,p_class,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [99]:
new_data = pd.read_csv('../data/test.csv')
new_data.columns =new_data.columns.str.lower() 
new_data =new_data.rename(columns={'passengerid':'passenger_id'
                                  , 'pclass':'p_class'}) 
print(new_data.shape)
new_data.head()

(418, 11)


Unnamed: 0,passenger_id,p_class,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<h1><center>Data Definition</center></h1>

| Variable | Definition | Key |
| --- | --- | --- |
| survival | Survival | 0 = No, 1 = Yes
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd
| sex | Sex | 
| Age | Age in years | 
| sibsp | # of siblings / spouses aboard the Titanic | 
| parch | # of parents / children aboard the Titanic | 
| ticket | Ticket number | 
| fare | Passenger fare | 
| cabin | Cabin number | 
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton


In [100]:
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   passenger_id     891 non-null    int64  
 1   survived         891 non-null    int64  
 2   p_class          891 non-null    int64  
 3   name             891 non-null    object 
 4   sex              891 non-null    object 
 5   age              714 non-null    float64
 6   sibsp            891 non-null    int64  
 7   parch            891 non-null    int64  
 8   ticket           891 non-null    object 
 9   fare             891 non-null    float64
 10  cabin            204 non-null    object 
 11  embarked         889 non-null    object 
 12  cabin_letter     204 non-null    object 
 13  relatives        891 non-null    int64  
 14  fare_per_person  891 non-null    float64
 15  is_male          891 non-null    int64  
 16  is_alone         891 non-null    int64  
dtypes: float64(3), i

In [101]:
count_survived = input_df.groupby(['survived'], as_index=False
                                 ).agg({'passenger_id':pd.Series.nunique}
                                      ).rename(columns={'passenger_id':'count_passengers'})

count_survived.survived = np.where(count_survived.survived==1, 'Survived', 'Died')
count_survived


Unnamed: 0,survived,count_passengers
0,Died,549
1,Survived,342


In [102]:
survived_by_gender = input_df.groupby((['survived', 'sex']), as_index=False
                                 ).agg({'passenger_id':pd.Series.nunique}
                                      ).rename(columns={'passenger_id':'count_passengers'})
survived_by_gender.survived = np.where(survived_by_gender.survived==1, 'Survived', 'Died')

male_survived = survived_by_gender[survived_by_gender.sex=='male']
female_survived = survived_by_gender[survived_by_gender.sex=='female']

survived_by_gender

Unnamed: 0,survived,sex,count_passengers
0,Died,female,81
1,Died,male,468
2,Survived,female,233
3,Survived,male,109


In [103]:
fig = make_subplots(rows=1, cols=3
                    , subplot_titles=("Survivability Rate", "Male Survivability", "Female Survivability")
                   )

fig.add_trace(go.Bar(
    x=count_survived.survived.to_numpy()
    , y=count_survived.count_passengers.to_numpy()
), row=1, col=1)

fig.add_trace(go.Bar(
    x=male_survived.survived.to_numpy()
    , y=male_survived.count_passengers.to_numpy()
), row=1, col=2)

fig.add_trace(go.Bar(
    x=female_survived.survived.to_numpy()
    , y=female_survived.count_passengers.to_numpy()
), row=1, col=3)




fig.update_layout(title='Survival Rate by Gender'
                  , title_x =0.5
                  , showlegend=False
                  , height = 400
                  , width = 700
                 )
fig.show()

In [104]:
survived_by_age = input_df.copy()[['passenger_id', 'age']].dropna()
survived_by_age.age = (np.floor(survived_by_age.age)).astype(int)
survived_by_age.head()

Unnamed: 0,passenger_id,age
0,1,22
1,2,38
2,3,26
3,4,35
4,5,35


In [105]:
import plotly.figure_factory as ff

x1 = survived_by_age.age.to_numpy()

hist_data = [x1]

group_labels = ['Age']
colors = ['#333F44']

# # Create distplot with curve_type set to 'normal'
fig = ff.create_distplot(hist_data
                         , group_labels
                         , show_hist=False
                         , colors=colors)

# # Add title
fig.update_layout(title_text='Age Distribution', title_x =0.5, width=700, height=400, showlegend=False)
fig.show()

In [106]:
eda_df = input_df.copy()
eda_df['cabin_letter'] = np.where(eda_df.cabin.isna(),None, eda_df['cabin'].astype(str).str[0].str.upper())
eda_df.cabin_letter=eda_df.cabin_letter.str.replace(' ', '')
eda_df['relatives'] = eda_df.sibsp+eda_df.parch
eda_df['fare_per_person'] = eda_df.fare/(eda_df.relatives+1)
eda_df['alone'] = np.where(eda_df.relatives>0, 'Not Alone', 'Alone')
eda_df=eda_df.drop(columns={'name', 'ticket', 'cabin', 'sibsp', 'parch'})
eda_df.head()

Unnamed: 0,passenger_id,survived,p_class,sex,age,fare,embarked,cabin_letter,relatives,fare_per_person,is_male,is_alone,alone
0,1,0,3,male,22.0,7.25,S,,1,3.625,1,0,Not Alone
1,2,1,1,female,38.0,71.2833,C,C,1,35.64165,0,0,Not Alone
2,3,1,3,female,26.0,7.925,S,,0,7.925,0,1,Alone
3,4,1,1,female,35.0,53.1,S,C,1,26.55,0,0,Not Alone
4,5,0,3,male,35.0,8.05,S,,0,8.05,1,1,Alone


In [107]:
by_embarked = eda_df.groupby(['embarked', 'survived']).agg(
    {'passenger_id':pd.Series.nunique}
).reset_index().rename(columns={'passenger_id':'count_passengers'})
by_embarked.survived = np.where(by_embarked.survived==1, 'Survived', 'Died')

fig = make_subplots(rows=1, cols=4
                    , subplot_titles=("Survivability Rate", "Cherbourg", "Queenstown", 'Southampton')
                   )

fig.add_trace(go.Bar(
    x=count_survived.survived.to_numpy()
    , y=count_survived.count_passengers.to_numpy()
), row=1, col=1)

fig.add_trace(go.Bar(
    x=by_embarked[by_embarked.embarked=='C'].survived.to_numpy()
    , y=by_embarked[by_embarked.embarked=='C'].count_passengers.to_numpy()
), row=1, col=2)

fig.add_trace(go.Bar(
    x=by_embarked[by_embarked.embarked=='Q'].survived.to_numpy()
    , y=by_embarked[by_embarked.embarked=='Q'].count_passengers.to_numpy()
), row=1, col=3)


fig.add_trace(go.Bar(
    x=by_embarked[by_embarked.embarked=='S'].survived.to_numpy()
    , y=by_embarked[by_embarked.embarked=='S'].count_passengers.to_numpy()
), row=1, col=4)


fig.update_layout(title='Survival Rate by Embarkment Port'
                  , title_x =0.5
                  , showlegend=False
                  , height = 400
                  , width = 700
                 )
fig.show()

In [108]:
by_class = eda_df.groupby(['p_class', 'survived']).agg(
    {'passenger_id':pd.Series.nunique}
).reset_index().rename(columns={'passenger_id':'count_passengers'})
by_class.survived = np.where(by_class.survived==1, 'Survived', 'Died')
fig = make_subplots(rows=1, cols=4
                    , subplot_titles=("Survivability Rate", "Class 1", "Class 2", 'Class 3')
                   )

fig.add_trace(go.Bar(
    x=count_survived.survived.to_numpy()
    , y=count_survived.count_passengers.to_numpy()
), row=1, col=1)

fig.add_trace(go.Bar(
    x=by_class[by_class.p_class==1].survived.to_numpy()
    , y=by_class[by_class.p_class==1].count_passengers.to_numpy()
), row=1, col=2)

fig.add_trace(go.Bar(
    x=by_class[by_class.p_class==2].survived.to_numpy()
    , y=by_class[by_class.p_class==2].count_passengers.to_numpy()
), row=1, col=3)


fig.add_trace(go.Bar(
    x=by_class[by_class.p_class==3].survived.to_numpy()
    , y=by_class[by_class.p_class==3].count_passengers.to_numpy()
), row=1, col=4)


fig.update_layout(title='Survival Rate by Class'
                  , title_x =0.5
                  , showlegend=False
                  , height = 400
                  , width = 700
                 )
fig.show()

In [109]:
def vizSurvivalRate(eda_df, x, viz_type='Bar'):
    viz_df = eda_df.groupby(x, as_index=False).agg({'survived':'mean'}).rename(columns={'survived':'mean_survived'})
    if viz_type=='Bar':
        viz_df[x]=viz_df[x].astype('str')
        fig = px.bar(viz_df
              , x=x
              , y='mean_survived')
    elif viz_type=='Line':
        fig = px.line(viz_df
              , x=x
              , y='mean_survived')
        fig.update_traces(mode='lines+markers')
#         fig.update_yaxes(rangemode='tozero')
    title_text = 'Average Survivability Rate by '+ x
    fig.update_layout(title=title_text, title_x =0.5
                     , width=700, height=400)
    fig.show()

In [110]:
vizSurvivalRate(eda_df, 'relatives', 'Line')

In [111]:
vizSurvivalRate(eda_df, 'p_class', 'Bar')

In [112]:
vizSurvivalRate(eda_df, 'sex', 'Bar')

In [113]:
vizSurvivalRate(eda_df, 'cabin_letter', 'Bar')

In [114]:
vizSurvivalRate(eda_df, 'alone', 'Bar')

In [115]:
eda_df

Unnamed: 0,passenger_id,survived,p_class,sex,age,fare,embarked,cabin_letter,relatives,fare_per_person,is_male,is_alone,alone
0,1,0,3,male,22.0,7.2500,S,,1,3.62500,1,0,Not Alone
1,2,1,1,female,38.0,71.2833,C,C,1,35.64165,0,0,Not Alone
2,3,1,3,female,26.0,7.9250,S,,0,7.92500,0,1,Alone
3,4,1,1,female,35.0,53.1000,S,C,1,26.55000,0,0,Not Alone
4,5,0,3,male,35.0,8.0500,S,,0,8.05000,1,1,Alone
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,13.0000,S,,0,13.00000,1,1,Alone
887,888,1,1,female,19.0,30.0000,S,B,0,30.00000,0,1,Alone
888,889,0,3,female,,23.4500,S,,3,5.86250,0,0,Not Alone
889,890,1,1,male,26.0,30.0000,C,C,0,30.00000,1,1,Alone


# Preprocessing

In [116]:
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.filterwarnings("ignore")

In [117]:
def preprocessed_data(eda_df):
    eda_df['cabin_letter'] = np.where(eda_df.cabin.isna(),None, eda_df['cabin'].astype(str).str[0].str.upper())
    eda_df.cabin_letter=eda_df.cabin_letter.str.replace(' ', '')
    eda_df['relatives'] = eda_df.sibsp+eda_df.parch
    eda_df['fare'] = np.where(eda_df.fare.isna(), eda_df.fare.mean()
                              ,eda_df.fare)

    eda_df['fare_per_person'] = eda_df.fare/(eda_df.relatives+1)
    eda_df['is_male']= np.where(eda_df.sex=='male', 1,0)
    eda_df['is_alone']= np.where(eda_df.relatives==0, 1,0)
    eda_df=eda_df.drop(columns={'name', 'ticket', 'cabin', 'sibsp', 'parch', 'sex'})
    eda_df['age'] = eda_df.age.replace(np.NaN, eda_df.age.mean())
#     eda_df = eda_df.replace({np.nan: None})

    return (eda_df)

train = preprocessed_data(input_df).drop(columns={'passenger_id'})
test = preprocessed_data(new_data).drop(columns={'passenger_id'})
train.head()

Unnamed: 0,survived,p_class,age,fare,embarked,cabin_letter,relatives,fare_per_person,is_male,is_alone
0,0,3,22.0,7.25,S,,1,3.625,1,0
1,1,1,38.0,71.2833,C,C,1,35.64165,0,0
2,1,3,26.0,7.925,S,,0,7.925,0,1
3,1,1,35.0,53.1,S,C,1,26.55,0,0
4,0,3,35.0,8.05,S,,0,8.05,1,1


In [119]:
categorical_cols = list(train.select_dtypes(include=['object']).columns)
for col in categorical_cols:
    print(col)
    train = encodeColumn(train, col)
    print(train.shape)

embarked
(891, 13)
cabin_letter
(891, 21)


In [120]:
categorical_cols = list(test.select_dtypes(include=['object']).columns)
for col in categorical_cols:
    print(col)
    test = encodeColumn(test, col)
    print(test.shape)

embarked
(418, 11)
cabin_letter
(418, 18)


In [121]:
X_train = train.drop(columns={'survived', 'fare'})
y_train = train[['survived']]
X_test = test.drop(columns={'fare'})
print(X_train.shape, X_test.shape)

(891, 19) (418, 17)


#### Logistic Regression

In [124]:
X_train

Unnamed: 0,p_class,age,relatives,fare_per_person,is_male,is_alone,embarked_C,embarked_Q,embarked_S,embarked_nan,cabin_letter_A,cabin_letter_B,cabin_letter_C,cabin_letter_D,cabin_letter_E,cabin_letter_F,cabin_letter_G,cabin_letter_T,cabin_letter_None
0,3,22.000000,1,3.62500,1,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1,38.000000,1,35.64165,0,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,26.000000,0,7.92500,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1,35.000000,1,26.55000,0,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,35.000000,0,8.05000,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,27.000000,0,13.00000,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
887,1,19.000000,0,30.00000,0,1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
888,3,29.699118,3,5.86250,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
889,1,26.000000,0,30.00000,1,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [126]:
X_train.columns

Index(['p_class', 'age', 'relatives', 'fare_per_person', 'is_male', 'is_alone',
       'embarked_C', 'embarked_Q', 'embarked_S', 'embarked_nan',
       'cabin_letter_A', 'cabin_letter_B', 'cabin_letter_C', 'cabin_letter_D',
       'cabin_letter_E', 'cabin_letter_F', 'cabin_letter_G', 'cabin_letter_T',
       'cabin_letter_None'],
      dtype='object')

In [125]:
X_test.columns

Index(['p_class', 'age', 'relatives', 'fare_per_person', 'is_male', 'is_alone',
       'embarked_C', 'embarked_Q', 'embarked_S', 'cabin_letter_A',
       'cabin_letter_B', 'cabin_letter_C', 'cabin_letter_D', 'cabin_letter_E',
       'cabin_letter_F', 'cabin_letter_G', 'cabin_letter_None'],
      dtype='object')

In [122]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

logreg_acc = accuracy_score(y_train, y_pred)
print(logreg_acc)
print(round(logreg.score(X_train, y_train) * 100, 2))

ValueError: X has 17 features, but LogisticRegression is expecting 19 features as input.

#### Random Forest

In [54]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)

y_pred = random_forest.predict(X_test)

rf_acc = accuracy_score(y_train, y_pred)

#### K Neearest Neighbors


In [55]:
# KNN
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)  
y_pred = knn.predict(X_test) 

knn_acc = accuracy_score(y_train, y_pred)

#### SVC


In [38]:
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_train, y_pred)
# acc_svc = round(svc.score(X_train, Y_train) * 100, 2)


0.6891133557800224

#### XGBoost

In [40]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
accuracy_score(y_train, y_pred)

0.9674523007856342

#### Gaussian Native Bayes


In [43]:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
accuracy_score(y_train, y_pred)

0.7025813692480359

#### SGD (stochastic gradient descent)


In [45]:
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
accuracy_score(y_train, y_pred)

0.6857463524130191

#### Decision Tree

In [59]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_train)
accuracy_score(y_train, y_pred)

0.9865319865319865

In [64]:
from catboost import CatBoostClassifier
catboost = CatBoostClassifier()
catboost.fit(X_train, y_train,verbose=False)
y_pred = catboost.predict(X_train)
accuracy_score(y_train, y_pred)

0.9023569023569024

In [None]:
ridgereg = Ridge(alpha=0.01)
rr.fit(X_train, y_train) 
y_pred= rr.predict(X_train)