# Titanic - Predicting Survival Exploratory Data Analysis
*Katarzyna O Rachuta, last updated 2017-07-09*

## Background
The sinking of RMS Titanic happened on the 15th April 1912, killing 1502 out of 2224 that were on board. It is known that some (such as upper middle class, women or children) were more likely to survive than others. This analysis explores the dataset, providing background for further modeling and prediction of survival.


## Method
Exploring factor such as:<br>
a) Passenger class (Pclass), which is also a proxy for their socio-economic class<br>
b) Sex <br>
c) Age <br>
d) Point of emabrkment

## References
[Kaggle](https://www.kaggle.com/c/titanic), accessed 2016-01-04

In [159]:
# Importing useful modules
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
# Visuals
from bokeh.plotting import *
output_notebook()
from bokeh.charts import BoxPlot, show, defaults, Histogram, Bar
from bokeh.layouts import row, column

  from pandas.core import datetools


In [2]:
# Reading the train data and displaying the first 5 rows.
train = pd.read_csv("train.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# Viewing different types of objects.
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

# Exploratory Visuals
Proportion of passengers that have survived by different categories

## Class by survival

In [69]:
# Creating a smaller DataFrame
survival = train[['Survived', 'Pclass']]

p = Bar(train, label='Pclass', values='Pclass', group='Survived', plot_width=450, plot_height=450,
       tools = False)
p.title.text = 'Most casualties were from 3rd class'
p.yaxis.axis_label = 'Number of casualties'
p.xaxis.axis_label = 'Class by survival'

# Removing grid lines
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

show(p)

## Sex by survival

In [70]:
p = Bar(train, label='Sex', group='Survived', plot_width=450, plot_height=450,
       tools = False)
p.title.text = 'Most casualties were male'
p.yaxis.axis_label = 'Number of casualties'
p.xaxis.axis_label = 'Sex by survival'

# Removing grid lines
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

show(p)

## Survival by point of embarkment

In [79]:
embarked_df = train[['Embarked', 'Survived']].dropna()

p = Bar(embarked_df, label='Embarked', group='Survived', plot_width=450, plot_height=450,
       tools = False)
p.title.text = 'Most casualties embarked in Southampton'
p.yaxis.axis_label = 'Number of casualties'
p.xaxis.axis_label = 'Embarkation point by survival'

# Removing grid lines
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

show(p)

## Age by survival

In [95]:
age_df = train[['Age', 'Survived']].dropna()

survived = age_df[age_df['Survived'] == 1]
deceased = age_df[age_df['Survived'] == 0 ]

p1 = Histogram(survived, values='Age', plot_width=450, plot_height=450, fill_alpha=0.3, outline_line_alpha=0, 
              line_color='white',  notebook=True, tools=False, color='grey')

p1.yaxis.axis_label = 'Passenger count'
p1.xaxis.axis_label = 'Age'
p1.title.text = 'Mean deceased passenger age is 30'

# removing grid lines
p1.xgrid.grid_line_color = None
p1.ygrid.grid_line_color = None

for r in p1.renderers:
    try:
        r.glyph.line_color = None
        r.glyph.line_alpha = r.glyph.fill_alpha
    except:
        pass

x1 = survived.median()
x2 = survived.mean()
y = (0, 6000)

p1.line(x1, y, legend='Median', line_width=1, color='#F8A800')
p1.line(x2, y, color='red', legend='Mean', line_width=1)

p1.legend.location = 'top_right'


p2 = Histogram(deceased, values='Age', plot_width=450, plot_height=450, fill_alpha=0.3, outline_line_alpha=0, 
              line_color='white',  notebook=True, tools=False, color='grey')

p2.yaxis.axis_label = 'Passenger count'
p2.xaxis.axis_label = 'Age'
p2.title.text = 'Mean/median age of a survived passenger is 28'

# removing grid lines
p2.xgrid.grid_line_color = None
p2.ygrid.grid_line_color = None

for r in p2.renderers:
    try:
        r.glyph.line_color = None
        r.glyph.line_alpha = r.glyph.fill_alpha
    except:
        pass

x3 = deceased.median()
x4 = deceased.mean()
y = (0, 6000)

p2.line(x3, y, legend='Median', line_width=1, color='#F8A800')
p2.line(x4, y, color='red', legend='Mean', line_width=1)

p2.legend.location = 'top_right'


show(row(p1, p2))

## Logistic Regression - Classification

Classificaiton will be based on their class, sex, age and embarkment point.

### Logistic Regression using sklearn

#### Step 1: Dummifying categorical variables.
Categorical variables include Pclass, sex and embarkment point.

In [107]:
# Getting dummies

Pclass_dummies = pd.get_dummies(train['Pclass'])
sex_dummies = pd.get_dummies(train['Sex'])
embarked_dummies = pd.get_dummies(train['Embarked'])

# Dropping one dummy variable from each categorical variable to avoid collinearity.

Pclass_dummies = Pclass_dummies.drop(1, axis=1)
sex_dummies = sex_dummies.drop('male', axis=1)
embarked_dummies = embarked_dummies.drop('S', axis=1)

# Joining all the dummy variables back together.

dummies = Pclass_dummies.join(sex_dummies)
dummies = dummies.join(embarked_dummies)

# Making sure I've joined the right variables.
dummies.head()

Unnamed: 0,2,3,female,C,Q
0,0,1,0,0,0
1,0,0,1,1,0
2,0,1,1,0,0
3,0,0,1,0,0
4,0,1,0,0,0


In [141]:
# Adding the age variable to the dummy.

X_multi = dummies.join(train['Age'])
X_multi.head()

Unnamed: 0,2,3,female,C,Q,Age
0,0,1,0,0,0,22.0
1,0,0,1,1,0,38.0
2,0,1,1,0,0,26.0
3,0,0,1,0,0,35.0
4,0,1,0,0,0,35.0


In [142]:
# Checking where null variables occur.

X_multi['Age'] = X_multi['Age'].fillna(0)
X_multi.dtypes

2           uint8
3           uint8
female      uint8
C           uint8
Q           uint8
Age       float64
dtype: object

In [143]:
Y = train['Survived']
Y = np.ravel(Y)

In [144]:
log_reg = LogisticRegression()

In [148]:
# Fitting the logistic regression model to my data.

log_reg.fit(X_multi, Y)

# Printing the accuracy score for my model.

log_reg.score(X_multi, Y)

0.78900112233445563

The model accuracy is ~79% accurate. This isn't terrible.

In [151]:
1 - Y.mean()

0.61616161616161613

In [157]:
# Viewing the coefficients for each variable.

coeff_df = pd.DataFrame(zip(X_multi.columns,np.transpose(log_reg.coef_)))
coeff_df

Unnamed: 0,0,1
0,2,[-0.664318000581]
1,3,[-1.92427966605]
2,female,[2.50448202773]
3,C,[0.524951027223]
4,Q,[0.26239071826]
5,Age,[-0.0134319512334]


From the coefficients above, we see that being a female has the strongest correlation with survival and being in third class is correlated with not surviving.

### Logistic regression using statsmodels

In [164]:
data2 = X_multi.join(train['Survived'])
data2 = sm.add_constant(data2, prepend=False)
explanatory_cols = [2, 3, 'female', 'C', 'Q', 'Age', 'const']
full_logit_model = sm.GLM(data2['Survived'], 
                          data2[explanatory_cols], 
                          family=sm.families.Binomial())
result = full_logit_model.fit()
# Printing the summary of the results.
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,GLM,Df Residuals:,884.0
Model Family:,Binomial,Df Model:,6.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-405.96
Date:,"Sun, 09 Jul 2017",Deviance:,811.91
Time:,21:19:48,Pearson chi2:,931.0
No. Iterations:,5,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
2,-0.7730,0.256,-3.016,0.003,-1.275,-0.271
3,-2.0749,0.243,-8.524,0.000,-2.552,-1.598
female,2.6123,0.186,14.018,0.000,2.247,2.978
C,0.5320,0.230,2.313,0.021,0.081,0.983
Q,0.2994,0.323,0.927,0.354,-0.334,0.932
Age,-0.0143,0.005,-2.643,0.008,-0.025,-0.004
const,-0.0548,0.267,-0.205,0.837,-0.578,0.469
