# Exercise 06

## Data preparation and model evaluation exercise with Titanic data




We'll be working with a dataset from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)

**Goal**: Predict survival based on passenger characteristics

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


Read the data into Pandas

In [1]:
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


# Exercise 6.1 

Impute the missing values of the age and Embarked

In [2]:
titanic.Age.fillna(titanic.Age.median(), inplace=True)
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [3]:
titanic.Embarked.mode()

0    S
dtype: object

In [4]:
titanic.Embarked.fillna('S', inplace=True)
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      0
dtype: int64

# Exercise 6.3

Convert the Sex and Embarked to categorical features

In [5]:
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0


In [6]:
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
embarked_dummies.head()

Unnamed: 0_level_0,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,0,0
3,0,1
4,0,1
5,0,1


In [7]:
titanic = pd.concat([titanic, embarked_dummies], axis=1)
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,0,0
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1,0,1
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0,1


# Exercise 6.3 (2 points)

From the set of features ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

*Note, use the created categorical features for Sex and Embarked

Select the features that maximize the **accuracy** the model using K-Fold cross-validation

In [8]:
y = titanic['Survived']

In [9]:
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare','Sex_Female', 'Embarked_Q', 'Embarked_S']  # Replace

In [10]:
import numpy as np
def comb(n,k) :
    return np.math.factorial(n) / (np.math.factorial(n-k) * np.math.factorial(k))

In [11]:
np.sum([comb(8,i) for i in range(0,8)])

255.0

In [12]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

possible_models

[('Pclass',),
 ('Age',),
 ('SibSp',),
 ('Parch',),
 ('Fare',),
 ('Sex_Female',),
 ('Embarked_Q',),
 ('Embarked_S',),
 ('Pclass', 'Age'),
 ('Pclass', 'SibSp'),
 ('Pclass', 'Parch'),
 ('Pclass', 'Fare'),
 ('Pclass', 'Sex_Female'),
 ('Pclass', 'Embarked_Q'),
 ('Pclass', 'Embarked_S'),
 ('Age', 'SibSp'),
 ('Age', 'Parch'),
 ('Age', 'Fare'),
 ('Age', 'Sex_Female'),
 ('Age', 'Embarked_Q'),
 ('Age', 'Embarked_S'),
 ('SibSp', 'Parch'),
 ('SibSp', 'Fare'),
 ('SibSp', 'Sex_Female'),
 ('SibSp', 'Embarked_Q'),
 ('SibSp', 'Embarked_S'),
 ('Parch', 'Fare'),
 ('Parch', 'Sex_Female'),
 ('Parch', 'Embarked_Q'),
 ('Parch', 'Embarked_S'),
 ('Fare', 'Sex_Female'),
 ('Fare', 'Embarked_Q'),
 ('Fare', 'Embarked_S'),
 ('Sex_Female', 'Embarked_Q'),
 ('Sex_Female', 'Embarked_S'),
 ('Embarked_Q', 'Embarked_S'),
 ('Pclass', 'Age', 'SibSp'),
 ('Pclass', 'Age', 'Parch'),
 ('Pclass', 'Age', 'Fare'),
 ('Pclass', 'Age', 'Sex_Female'),
 ('Pclass', 'Age', 'Embarked_Q'),
 ('Pclass', 'Age', 'Embarked_S'),
 ('Pclass', 'Sib

In [13]:
import itertools

possible_models = [] 
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

In [14]:
y = titanic.Survived

res = pd.DataFrame(index=possible_models,columns=['accuracy'])
for i in range(len(possible_models)):
    X = titanic[list(possible_models[i])]
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression(C=1e9)
    from sklearn.cross_validation import cross_val_score
    res.iloc[i] = cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()

In [15]:
res.head()

Unnamed: 0,accuracy
"(Pclass,)",0.67927
"(Age,)",0.61617
"(SibSp,)",0.61617
"(Parch,)",0.60833
"(Fare,)",0.663487


In [89]:
res.sort_values('accuracy',ascending=False).head()

Unnamed: 0,accuracy
"(Pclass, Age, SibSp, Sex_Female, Embarked_S)",0.801369
"(Pclass, SibSp, Sex_Female)",0.800194
"(Pclass, SibSp, Sex_Female, Embarked_Q)",0.800194
"(Pclass, SibSp, Parch, Sex_Female, Embarked_Q)",0.799083
"(Pclass, SibSp, Parch, Sex_Female)",0.799083


# Bonus Exercise 6.4 (3 points)

Now which are the best set of features selected by AUC

In [19]:
y = titanic.Survived
from sklearn.cross_validation import KFold
res = pd.DataFrame(index=possible_models,columns=['accuracy'])
for i in range(len(possible_models)):
    X = titanic[list(possible_models[i])]
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression(C=1e9)
    kf = KFold(X.shape[0], n_folds=10, random_state=0)

    results = []
    
from sklearn.cross_validation import KFold
for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    print(metrics.roc_auc_score(y_test, y_pred_class))

NameError: name 'metrics' is not defined