# Titanic Exercise

In [29]:
import pandas as pd
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from math import exp
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Preparing the data

Read in the data and look at the first 10 rows.

In [2]:
df = pd.read_csv('data/train.csv')
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Check for missing values.

In [3]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We are going to focus on Pclass, Sex, Age, and Embarked:

- **Pclass:** leave as-is
- **Sex:** convert "male" to 0 and "female" to 1
- **Age:** fill in missing values using the mean
- **Embarked:** create dummy variables

In [4]:
df['Sx'] = df.Sex.map({'male':0, 'female':1})
embarked_dum = pd.get_dummies(df.Embarked, prefix='Emb').iloc[:,1:]
df = pd.concat([df, embarked_dum], axis=1)
df['Age_fill'] = df.Age.fillna(df.Age.mean())
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sx,Emb_Q,Emb_S,Age_fill
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,1,22
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,0,0,38
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1,0,1,26
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1,0,1,35
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0,1,35


Create X and y using the features we have chosen.

In [5]:
X = df[['Pclass', 'Sx', 'Age_fill', 'Emb_Q', 'Emb_S']]
y = df.Survived

## Train/Test Split

Split X and y into training and testing sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

## Logistic Regression

Fit a logistic regression model on the training data.

In [62]:
logreg = LogisticRegression()
logreg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Print the model's intercept.

In [52]:
logreg.intercept_

array([ 2.06313265])

Print the model's coefficients. How do we interpret them?

In [53]:
logreg.coef_
# each coefficient is the change in the log-odds of the respective feature

array([[-1.05442292,  2.48979671, -0.02842875,  0.02558043, -0.43817519]])

Predict the probability of survival for the first person in X_train using scikit-learn.

In [54]:
logreg.predict_proba(X_train[0])

array([[ 0.87676902,  0.12323098]])

Do this same calculation manually.

In [55]:
p = logreg.intercept_
for x, b in zip(X_train[0], logreg.coef_[0]):
    p += x*b
print exp(p)/(1+exp(p))


0.123230977241


Pretend this person was 10 years older, and calculate their probability of survival (manually).

In [61]:
temp = [2.0, 0.0, 10.0, 0.0, 1.0]
p10 = logreg.intercept_
for x, b in zip(temp, logreg.coef_[0]):
    p10 += x*b
print exp(p10)/(1+exp(p10))

0.316873841527


Pretend this person was a woman, and calculate their probability of survival (manually).

## Model Evaluation

Make predictions on the testing data and calculate the accuracy.

In [63]:
y_pred = logreg.predict(X_test)

Compare this to the null accuracy.

In [74]:
print metrics.accuracy_score(y_test, [0]*len(y_test))
print metrics.accuracy_score(y_test, y_pred)

0.67264573991
0.80269058296


Print the confusion matrix. Does this model tend towards specificity or sensitivity?

Calculate the specificity and the sensitivity.

Change the threshold to make the model more sensitive, then print the new confusion matrix.

Recalculate the specificity and the sensitivity.

Plot the ROC curve. How can we interpret the results?

Calculate the AUC.

## Cross-Validation

Use cross-validation to check the AUC for the current model.

Remove Embarked from the model and check AUC again using cross-validation.