# Predict Survived from Titanic Disaster

## August-September 2017, by Jude Moon
Python3


# Project Overview

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 

In this project, I will analyze what sorts of people were likely to survive. In particular, I will apply the tools of machine learning to predict which passengers survived the tragedy.

This document is to keep notes as I work through the project and show my thought processes and approaches to solve this problem. It consists of:

Part1. Data Exploration
- Missing Value (NaN) Investigation
- Outliers Investigation
- Summary of Data Exploration

Part2. Feature Engineering
- Creating New Features
- Converting to Numeric Variables
- Feature Exploration
- Scaling Features
- Feature Scores

Part3. Algorithm Search
- Algorithm Exploration
- Building Pipelines


***

# Part1. Data Exploration


In [1]:
%pylab inline
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import os
import re
import sys
import pprint
import operator
import scipy.stats
from time import time

Populating the interactive namespace from numpy and matplotlib


In [2]:
# load data set
titanic_df = pd.read_csv("train.csv")
#titanic_df = pd.read_csv("../input/train.csv")

In [3]:
# the first 5 rows
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# data type of each column
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
# check any numpy NaN by column
titanic_df.isnull().sum(axis=0) # sum by column

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# statistics of central tendency and variability
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
# the numbers of survived and dead
titanic_df.groupby(titanic_df['Survived']).count()['PassengerId']

Survived
0    549
1    342
Name: PassengerId, dtype: int64

I learned general idea about the passengers: 
- total passenger number in training data set is 891
- survival % is about 38%
- Pclass is treated as integer, but actually it is category
- since median of Pclass is 3rd class, passengers are donimated by 3rd class people
- average age is 29.7 with missing 177 data points
- sibsp and parch variables are little bit tricky with a lot of zeros
- mean fare is 32 units
- cabin has so many missing values

## Missing Value (NaN) Investigation

### Would NaN introduce bias to 'Cabin'?

'Cabin' column has 687 missing values with is 77% of the total. This might introduce bias, so I would like to investigate what is average survival rates of the groups with missing value on 'Cabin' compared to others with 'Cabin' value. 

In [8]:
# survival rate by group with missing value on Cabin; True means missing value
titanic_df.groupby(titanic_df['Cabin'].isnull()).mean()['Survived']

Cabin
False    0.666667
True     0.299854
Name: Survived, dtype: float64

The survival rate of the group with missing value (True) is lower than the average (0.38), while that with value (False) is greater than the average. Missing values of 'Cabin' have a high tendency of introducing bias, meaning that the group of passengers with missing value on 'Cabin' is associated with lower survival rate than those with Cabin value. This would cause that if a supervised classification algorithm was to use 'Cabin' as a feature, it might interpret "NaN" for 'Cabin' as a clue that a person is not survived. So, I have to carefully use 'Cabin' as a feature for supervised classifier algorithms. I am not going to worry about dealing with NaN on 'Cabin' for now because it is not a number. And so I can simply convert NaN to string 'NaN' if I need to.

### What about missing value on column 'Age'? How to deal with missing values?

In [9]:
# survival rate by group with missing value on Age; True means missing value
titanic_df.groupby(titanic_df['Age'].isnull()).mean()['Survived']

Age
False    0.406162
True     0.293785
Name: Survived, dtype: float64

About 20% data is missing for 'Age'. There is a possibility of NaN-drived bias but not strong as the bias for 'Cabin'. My choice to deal with the missing value is fill NaN with the median of the sample.

In [10]:
# replace NaN with the median of Age and create new column called age
titanic_df['age'] = titanic_df['Age'].fillna(titanic_df["Age"].median())

titanic_df['age'].describe()

count    891.000000
mean      29.361582
std       13.019697
min        0.420000
25%       22.000000
50%       28.000000
75%       35.000000
max       80.000000
Name: age, dtype: float64

### What about missing value on column 'Embarked'? How to deal with missing values?
Only two observations are missing for 'Embarked'. I could ignore them or replace them with most frequent port.

In [11]:
# survival rate by group with missing value on Age; True means missing value
titanic_df.groupby(titanic_df['Embarked'].isnull()).mean()['Survived']

Embarked
False    0.382452
True     1.000000
Name: Survived, dtype: float64

In [12]:
titanic_df.groupby(titanic_df['Embarked']).count()['PassengerId']

Embarked
C    168
Q     77
S    644
Name: PassengerId, dtype: int64

In [13]:
# replace NaN with the dominant port and create new column called Port
titanic_df['embarked'] = titanic_df['Embarked'].fillna('C')

titanic_df['embarked'].isnull().sum()

0

## Outliers Investigation

### Is there an observation who has a lot of NaN?

In [14]:
# check any numpy NaN by row
#titanic_df.isnull().sum(axis=1) # sum by row
titanic_df.isnull().sum(axis=1).max() # find the max

2

No, there is no observation who has missing values more than two. So, we can keep all the observations.

### Are there any outliers in the dataset?

In [15]:
# I defined outliers as being above of 99% percentile here
# get lists of people above 99% percentile for each feature
highest = {}
for column in titanic_df.columns:
    if titanic_df[column].dtypes != "object": # exclude string data typed columns
        highest[column]=[]
        q = titanic_df[column].quantile(0.99)
        highest[column] = titanic_df[titanic_df[column] > q].index.tolist()
    
pprint.pprint(highest)

{'Age': [33, 96, 116, 493, 630, 672, 745, 851],
 'Fare': [27, 88, 258, 311, 341, 438, 679, 737, 742],
 'Parch': [13, 25, 610, 638, 678, 885],
 'PassengerId': [882, 883, 884, 885, 886, 887, 888, 889, 890],
 'Pclass': [],
 'SibSp': [159, 180, 201, 324, 792, 846, 863],
 'Survived': [],
 'age': [33, 96, 116, 493, 630, 672, 745, 851]}


In [16]:
# delete 'PassengerId' from dictionary highest
highest.pop('PassengerId', 0)

[882, 883, 884, 885, 886, 887, 888, 889, 890]

### What are the outliers repeatedly shown among the features?

In [17]:
# summarize the previous dictionary, highest
# create a dictionary of outliers and the frequency of being outlier
highest_count = {}
for feature in highest:
    for person in highest[feature]:
        if person not in highest_count:
            highest_count[person] = 1
        else:
            highest_count[person] += 1
             
highest_count

{13: 1,
 25: 1,
 27: 1,
 33: 2,
 88: 1,
 96: 2,
 116: 2,
 159: 1,
 180: 1,
 201: 1,
 258: 1,
 311: 1,
 324: 1,
 341: 1,
 438: 1,
 493: 2,
 610: 1,
 630: 2,
 638: 1,
 672: 2,
 678: 1,
 679: 1,
 737: 1,
 742: 1,
 745: 2,
 792: 1,
 846: 1,
 851: 2,
 863: 1,
 885: 1}

In [18]:
# This time, I defined outliers as being below of 1% percentile here
# get lists of people below 1% percentile for each feature
lowest = {}
for column in titanic_df.columns:
    if titanic_df[column].dtypes != "object": # exclude string data typed columns
        lowest[column]=[]
        q = titanic_df[column].quantile(0.01)
        lowest[column] = titanic_df[titanic_df[column] < q].index.tolist()

# delete 'PassengerId' from dictionary highest
lowest.pop('PassengerId', 0)

pprint.pprint(lowest)

{'Age': [78, 305, 469, 644, 755, 803, 831],
 'Fare': [],
 'Parch': [],
 'Pclass': [],
 'SibSp': [],
 'Survived': [],
 'age': [78, 305, 469, 644, 755, 803, 831]}


In [19]:
for person in lowest['age']:
    if person not in highest_count:
        highest_count[person] = 1
    else:
        highest_count[person] += 1
 
highest_count

{13: 1,
 25: 1,
 27: 1,
 33: 2,
 78: 1,
 88: 1,
 96: 2,
 116: 2,
 159: 1,
 180: 1,
 201: 1,
 258: 1,
 305: 1,
 311: 1,
 324: 1,
 341: 1,
 438: 1,
 469: 1,
 493: 2,
 610: 1,
 630: 2,
 638: 1,
 644: 1,
 672: 2,
 678: 1,
 679: 1,
 737: 1,
 742: 1,
 745: 2,
 755: 1,
 792: 1,
 803: 1,
 831: 1,
 846: 1,
 851: 2,
 863: 1,
 885: 1}

Overall, there is no outlier that are repeatedly shown among the features.

We can focus on age and Fare for continous values and Parch and SibSp for integer values to further investiage outliers. 

### Take a look at outliers

In [20]:
# fare above 99% percentile
titanic_df.loc[highest['Fare'],['Fare', 'Survived']]

Unnamed: 0,Fare,Survived
27,263.0,0
88,263.0,1
258,512.3292,1
311,262.375,1
341,263.0,1
438,263.0,0
679,512.3292,1
737,512.3292,1
742,262.375,1


The mean fare is 32 units but there are outliers who paid 262, 263, or 512 units, which are 8 to 16 times higher than the mean fare. I am going to keep these outliers because this might help to classify survival as extreme cases in such decision tree algorithm.

In [21]:
# age above 99% percentile
titanic_df.loc[highest['age'],['age', 'Survived']]

Unnamed: 0,age,Survived
33,66.0,0
96,71.0,0
116,70.5,0
493,71.0,0
630,80.0,1
672,70.0,0
745,70.0,0
851,74.0,0


In [22]:
# age below 1% percentile
titanic_df.loc[lowest['age'],['age','Survived']]

Unnamed: 0,age,Survived
78,0.83,1
305,0.92,1
469,0.75,1
644,0.75,1
755,0.67,1
803,0.42,1
831,0.83,1


The average age is 29, but there are outliers who are 66 to 80 years old, which are 2.3 to 2.8 times higher than the average age, and who are younger than one year old. I am going to keep these outliers for now because the extreme cases of age might be usefull.

## Summary of Data Exploration

- Total number of data points: 891
- Target: ‘Survived’
- Total number of data points labeled as survived: 342 (38%)
- Total number of data points labeled as dead: 549 (62%)
- Slightly imbalanced classes
- Number of initial features: 10
- List of features with missing values or NaN: 

| Feature  | # of NaN | Survival rate of NaN | Survival rate of non-Nan | Difference in survival rate |
|----------|----------|----------------------|--------------------------|-----------------------------|
| Cabin    | 687      | 0.30                 | 0.67                     | 0.37                        |
| Age      | 177      | 0.29                 | 0.41                     | 0.12                        |
| Embarked | 2        | 1.00                 | 0.38                     | -0.62                       |

- Top 3 people repeatedly shown as outliers:
- The mean fare is 32 units but there are outliers who paid 262, 263, or 512 units, which are 8 to 16 times higher than the mean fare.
- The average age is 29, but there are outliers who are 66 to 80 years old, which are 2.3 to 2.8 times higher than the average age, and who are younger than one year old.
- Overall, there is no outlier that are repeatedly shown among the features. 

***


# Part2. Feature Engineering

### Brainstorming

The target is 'Survived', and the rest columns are the candidate features: 
- 'Fare', 'age', 'Sex', and 'embarked' are ready to go without engineering
- 'Name', 'Ticket', and 'Cabin' are text variables and might require variation reduction. For example, use only last name instead of using full name with title. 
- 'SibSp' and 'Parch' have a lot of zeros, where the zero value means truely zero or absence. Squre transformation can be considered. Also, creating a new feature like 'is_family' by combining 'SibSp', 'Parch' and 'family_name' from 'Name'.

### Challenges

The features are mixed with continuous and categorial variables. Most ML algorithms work well with numerical variables and some work with mixed data types. I can think of several approaches:
- use algorithms that can handle variables with both data types (DT, NB, KNN)
- use demensionality reduction method to get numerical vectors
- convert non-ordinal categoical variables to numerical or dummy ([padas.get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html), [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)) variables and use algorithms that works for numerical variables; the limitation would be that not all features can be utilized due to a large number of unique values
- use ensemble method to combine algorithms for numerical variables and algorithms for categorical variables
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/

## Creating New Features

### family_name

In [23]:
# a procedure to create a column with family name only
def get_familyname(name):
    full_name = name.split(',')
    return full_name[0]

# apply get_familyname procedure to the column of 'Name'
familyname = titanic_df['Name'].apply(get_familyname)

# add familyname to the DataFrame as new column
titanic_df['family_name'] = familyname.values

In [24]:
# how many unique family name?
len(titanic_df['family_name'].unique())

667

### ticket_prefix

In [25]:
# understand 'Ticket' values
titanic_df['Ticket'].head(10)

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
5              330877
6               17463
7              349909
8              347742
9              237736
Name: Ticket, dtype: object

In [26]:
# how many unique family name?
len(titanic_df['Ticket'].unique())

681

'Ticket' variable is not consistant in terms of the format; some are mixed with letters and numbers, some has symbols (/ or .), and the others consist of only numbers. And not all Tickets are different. 

In [27]:
# a procedure to create a column with ticket prefix only
def get_prefix(ticket):
    if ' ' in ticket:
        prefix = ticket.split(' ')
        return prefix[0]
    else:
        return 'None'

# apply get_prefix procedure to the column of 'Ticket'
ticketprefix = titanic_df['Ticket'].apply(get_prefix)

# add ticket_prefix to the DataFrame as new column
titanic_df['ticket_prefix'] = ticketprefix.values

In [28]:
# count of ticket prefix; False means ticket_prefix == 'None'
titanic_df.groupby(titanic_df['ticket_prefix'] != 'None').count()['PassengerId']

ticket_prefix
False    665
True     226
Name: PassengerId, dtype: int64

In [29]:
# survival rate by group with ticket prefix; False means ticket_prefix == 'None'
titanic_df.groupby(titanic_df['ticket_prefix'] != 'None').mean()['Survived']

ticket_prefix
False    0.383459
True     0.384956
Name: Survived, dtype: float64

I found no difference in survival rate in the group with vs. without ticket prefix, and both rates are similar to the average of the total.

In [30]:
# frequency of ticket_prefix
titanic_df.groupby(titanic_df['ticket_prefix']).count()['PassengerId']

ticket_prefix
A./5.           2
A.5.            2
A/4             3
A/4.            3
A/5            10
A/5.            7
A/S             1
A4.             1
C               5
C.A.           27
C.A./SOTON      1
CA              6
CA.             8
F.C.            1
F.C.C.          5
Fa              1
None          665
P/PP            2
PC             60
PP              3
S.C./A.4.       1
S.C./PARIS      2
S.O./P.P.       3
S.O.C.          5
S.O.P.          1
S.P.            1
S.W./PP         1
SC              1
SC/AH           3
SC/PARIS        5
SC/Paris        4
SCO/W           1
SO/C            1
SOTON/O.Q.      8
SOTON/O2        2
SOTON/OQ        7
STON/O         12
STON/O2.        6
SW/PP           1
W./C.           9
W.E.P.          1
W/C             1
WE/P            2
Name: PassengerId, dtype: int64

In [31]:
# how many unique ticket_prefix?
len(titanic_df['ticket_prefix'].unique())

43

I found inconsistency in formatting of prefix. For example, A./5., A.5., A/5, and A/5. could be the same prefix and A/S could be the typo for A/5. I am not sure making the formatting consistent would help to better classify the survived, or the differences in the formatting actually would help to classify them. 

In [32]:
# procedure to remove all special characters and change to upper case
def remove_special(initial):
    return (''.join(e for e in initial if e.isalnum())).upper()

titanic_df['ticket_prefix_v2'] = titanic_df['ticket_prefix'].apply(remove_special)

# frequency of ticket_prefix_v2
titanic_df.groupby(titanic_df['ticket_prefix_v2']).count()['PassengerId']

ticket_prefix_v2
A4           7
A5          21
AS           1
C            5
CA          41
CASOTON      1
FA           1
FC           1
FCC          5
NONE       665
PC          60
PP           3
PPP          2
SC           1
SCA4         1
SCAH         3
SCOW         1
SCPARIS     11
SOC          6
SOP          1
SOPP         3
SOTONO2      2
SOTONOQ     15
SP           1
STONO       12
STONO2       6
SWPP         2
WC          10
WEP          3
Name: PassengerId, dtype: int64

In [33]:
# survival rate by ticket_prefix_v2
titanic_df.groupby(titanic_df['ticket_prefix_v2']).mean()['Survived']

ticket_prefix_v2
A4         0.000000
A5         0.095238
AS         0.000000
C          0.400000
CA         0.341463
CASOTON    0.000000
FA         0.000000
FC         0.000000
FCC        0.800000
NONE       0.383459
PC         0.650000
PP         0.666667
PPP        0.500000
SC         1.000000
SCA4       0.000000
SCAH       0.666667
SCOW       0.000000
SCPARIS    0.454545
SOC        0.166667
SOP        0.000000
SOPP       0.000000
SOTONO2    0.000000
SOTONOQ    0.133333
SP         0.000000
STONO      0.416667
STONO2     0.500000
SWPP       1.000000
WC         0.100000
WEP        0.333333
Name: Survived, dtype: float64

In [34]:
# how many unique ticket_prefix_v2?
len(titanic_df['ticket_prefix_v2'].unique())

29

Now the number of unique ticket prefix was 43 and now it is 29 after cleaning the special characters.

I think putting the cleaned prefix and the number back togther might help.

In [35]:
# a procedure to create a column with ticket number only
def get_number(ticket):
    if ' ' in ticket:
        number = ticket.split(' ')
        return number[1]
    else:
        return ticket

# apply get_number procedure to the column of 'Ticket'
ticketnumber = titanic_df['Ticket'].apply(get_number)

# add ticket_number to the DataFrame as new column
titanic_df['ticket_number'] = ticketnumber.values

titanic_df['ticket_number'].head(10)

0      21171
1      17599
2    3101282
3     113803
4     373450
5     330877
6      17463
7     349909
8     347742
9     237736
Name: ticket_number, dtype: object

In [36]:
# add ticket to the DataFrame as new column by concatenating cleaned initial and number
titanic_df['ticket'] = titanic_df['ticket_prefix_v2'] + titanic_df['ticket_number'] 

titanic_df['ticket'].head()

0          A521171
1          PC17599
2    STONO23101282
3       NONE113803
4       NONE373450
Name: ticket, dtype: object

In [37]:
# how many unique ticket?
len(titanic_df['ticket'].unique())

670

### cabin_initial

In [38]:
# understand 'Cabin' values
titanic_df['Cabin'].head(10)

0     NaN
1     C85
2     NaN
3    C123
4     NaN
5     NaN
6     E46
7     NaN
8     NaN
9     NaN
Name: Cabin, dtype: object

The 'Cabin' value consists of a capital letter following by numbers. I am not sure what the letter and numbers mean for but my intuition is that the letter could represent a room location or a room price, so the letter only can be used as a feature. 

In [39]:
# how many unique Cabin?
len(titanic_df['Cabin'].unique())

148

Out of 204 known 'Cabin', 148 are the unique 'Cabin', and some people share the same 'Cabin' value.

In [40]:
def get_initial_letter(cabin):
    cabin = str(cabin) # change data type to string becuase nan is float
    return cabin[0]

titanic_df['cabin_initial'] = titanic_df['Cabin'].apply(get_initial_letter)
titanic_df['cabin_initial'].head(10)

0    n
1    C
2    n
3    C
4    n
5    n
6    E
7    n
8    n
9    n
Name: cabin_initial, dtype: object

In [41]:
# survival rate by group with missing value on Cabin; True means missing value
titanic_df.groupby(titanic_df['cabin_initial']).mean()['Survived']

cabin_initial
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
n    0.299854
Name: Survived, dtype: float64

In [42]:
# how many unique 'cabin_initial'?
len(titanic_df['cabin_initial'].unique())

9

### w_family

In [43]:
# SibSp and Parch are combined as family by vectoried addition
sibsp = titanic_df['SibSp']
parch = titanic_df['Parch']

family = sibsp + parch

#change datatype to categories with 2 groups
def w_family(family):
    if family != 0:
        return 1
    return 0

# apply w_family procedure to the array
w_family = family.apply(w_family)

# add w_family to the DataFrame as new column
titanic_df['w_family'] = w_family.values

## Converting to Numeric Variables

### sex

In [44]:
# values of Sex
titanic_df["Sex"].unique()

array(['male', 'female'], dtype=object)

In [45]:
# if male, return True or 1 and create new column 'sex'
titanic_df['sex'] = (titanic_df['Sex'] == 'male').astype(int)

# values of sex
titanic_df["sex"].unique()

array([1, 0], dtype=int64)

### embarked to C, Q, and S

In [46]:
# values of embarked
titanic_df["embarked"].unique()

array(['S', 'C', 'Q'], dtype=object)

In [47]:
# create dummy variables for embarked
embarked_df = pd.get_dummies(titanic_df["embarked"])

In [48]:
# combine two dataframes
titanic_df = [titanic_df, embarked_df]
titanic_df = pd.concat(titanic_df, axis=1, join='inner')

titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,age,w_family,sex,C,Q,S
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,29.361582,0.397306,0.647587,0.190797,0.08642,0.722783
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,13.019697,0.489615,0.47799,0.39315,0.281141,0.447876
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,22.0,0.0,0.0,0.0,0.0,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,28.0,0.0,1.0,0.0,0.0,1.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,35.0,1.0,1.0,0.0,0.0,1.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,80.0,1.0,1.0,1.0,1.0,1.0


In [49]:
feature_total = np.array(titanic_df.columns)

feature_total

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'age', 'embarked',
       'family_name', 'ticket_prefix', 'ticket_prefix_v2', 'ticket_number',
       'ticket', 'cabin_initial', 'w_family', 'sex', 'C', 'Q', 'S'], dtype=object)

In [50]:
feature_numeric = []
for column in titanic_df.columns:
    if titanic_df[column].dtypes != "object" and titanic_df[column].isnull().sum() == 0:
        feature_numeric.append(column)

feature_numeric

['PassengerId',
 'Survived',
 'Pclass',
 'SibSp',
 'Parch',
 'Fare',
 'age',
 'w_family',
 'sex',
 'C',
 'Q',
 'S']

In [51]:
# remove id and target
feature_numeric = [e for e in feature_numeric if e not in ('PassengerId', 'Survived', 'survived')]

feature_numeric

['Pclass', 'SibSp', 'Parch', 'Fare', 'age', 'w_family', 'sex', 'C', 'Q', 'S']

In [52]:
len(feature_numeric)

10

In [53]:
feature_numeric[:5]

['Pclass', 'SibSp', 'Parch', 'Fare', 'age']

## Scaling Features

I will use **MinMaxScaler** to adjust the different units of features to be equally weighted and ranged between 0-1.

In [54]:
df_numeric = titanic_df[feature_numeric[:5]]

In [55]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), \
                         index=df_numeric.index, columns=df_numeric.columns)

df_scaled = df_scaled.rename(columns={"Pclass": "pclass_scl", "SibSp": "sibsp_scl", \
                                      "Parch": "parch_scl", "Fare": "fare_scl", \
                                      "age": "age_scl"})

df_scaled.describe()

Unnamed: 0,pclass_scl,sibsp_scl,parch_scl,fare_scl,age_scl
count,891.0,891.0,891.0,891.0,891.0
mean,0.654321,0.065376,0.063599,0.062858,0.363679
std,0.418036,0.137843,0.134343,0.096995,0.163605
min,0.0,0.0,0.0,0.0,0.0
25%,0.5,0.0,0.0,0.01544,0.271174
50%,1.0,0.0,0.0,0.028213,0.346569
75%,1.0,0.125,0.0,0.060508,0.434531
max,1.0,1.0,1.0,1.0,1.0


In [56]:
df = [titanic_df, df_scaled]
df = pd.concat(df, axis=1, join='inner')
# how to merge two dataframes: https://pandas.pydata.org/pandas-docs/stable/merging.html

df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,age,w_family,sex,C,Q,S,pclass_scl,sibsp_scl,parch_scl,fare_scl,age_scl
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,29.361582,0.397306,0.647587,0.190797,0.08642,0.722783,0.654321,0.065376,0.063599,0.062858,0.363679
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,13.019697,0.489615,0.47799,0.39315,0.281141,0.447876,0.418036,0.137843,0.134343,0.096995,0.163605
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,22.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.01544,0.271174
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,28.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.028213,0.346569
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,35.0,1.0,1.0,0.0,0.0,1.0,1.0,0.125,0.0,0.060508,0.434531
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [57]:
df.dtypes

PassengerId           int64
Survived              int64
Pclass                int64
Name                 object
Sex                  object
Age                 float64
SibSp                 int64
Parch                 int64
Ticket               object
Fare                float64
Cabin                object
Embarked             object
age                 float64
embarked             object
family_name          object
ticket_prefix        object
ticket_prefix_v2     object
ticket_number        object
ticket               object
cabin_initial        object
w_family              int64
sex                   int32
C                     uint8
Q                     uint8
S                     uint8
pclass_scl          float64
sibsp_scl           float64
parch_scl           float64
fare_scl            float64
age_scl             float64
dtype: object

In [58]:
# define features lists
original_numeric = ['Pclass', 'SibSp', 'Parch', 'Fare', 'age', 'sex']
original_categorical = ['Name', 'Ticket', 'Cabin', 'embarked']
original_total = original_numeric + original_categorical
                    
scaled_numeric = ['pclass_scl', 'sibsp_scl', 'parch_scl', 'fare_scl', 'age_scl', 'w_family', 'sex', 'C', 'Q', 'S']
updated_categorical = ['family_name', 'ticket', 'ticket_prefix_v2', 'cabin_initial']
updated_total = scaled_numeric + updated_categorical                    

## Feature Exploration

| List Name            | Features                                                                                          | # of Features |
|----------------------|---------------------------------------------------------------------------------------------------|---------------|
| original_numeric     | ['Pclass', 'SibSp', 'Parch', 'Fare', 'age', sex']                                                 | 6             |
| scaled_numeric       | ['pclass_scl', 'sibsp_scl', 'parch_scl', 'fare_scl', 'age_scl', 'w_family', 'sex', 'C', 'Q', 'S'] | 10            |
| original_categorical | ['Name', 'Ticket', 'Cabin', 'embarked']                                                           | 4             |
| updated_categorical  | ['family_name', 'ticket', 'ticket_prefix_v2', 'cabin_initial']                                    | 4             |



## Selecting Features 

### What selection process to use?

- Univariate Selection such as SelectKBest: statistical tests can be used to select the features that have the strongest relationship with the output variable. 

    For the first trial, I will choose 9 or less features. The number 9 threshold came from the curve of dimensionality, where you may need exponentially more data points as you add more features, that is, 

>2^(n_featuers) = # of data points 

    I have 891 data points. 2^9 = 512 and 2^10 = 1024, so 9 is the max feature number. Thus, I will keep in mind to use no more than 9 features if I decide to use SelectKBest.

- Dimensionality Reduction such as PCA: PCA (or Principal Component Analysis) uses linear algebra to transform the dataset into a compressed form. I think chosing 2-3 dimensions after PCA transformation could be good start.

### Which feature scores to compare?

I choose **f_classif** scoring function for continous variables and **chi2** for categoerical variables. 

- [Variance](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) can be useful for unsupervised classification. Since I have already labels, utilizing labels for scoring could be better than soley reling on x-variables. 

- [The mutual information (MI)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency. MI can be used for unsupervised clustering.

- The chi-square distribution arises in tests of hypotheses concerning the independence of two random variables and concerning whether a discrete random variable follows a specified distribution. The F-distribution arises in tests of hypotheses concerning whether or not two population variances are equal and concerning whether or not three or more population means are equal. In other words, chi-square is most appropriate for categorical data, whereas f-value can be used for continuous data [(read more)](https://discussions.udacity.com/t/f-classif-versus-chi2/245226).




***

# Part3. Algorithm Search

## Algorithm Exploration

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low according to [blog post by Cheng-Tao Chu](http://ml.posthaven.com/machine-learning-done-wrong).

- SVC
- KNeighbors 
- Gaussian Naive Bayes
- Decision Trees
- Ensemble Methods

## Validation Methods 
I think a proper validation method for the dataset with imbalanced classes is using cross validation iterators with stratification based on class labels, such as **StratifiedKFold** and **StratifiedShuffleSplit**. This would ensure that relative class frequencies is approximately preserved in each train and test set.

In [None]:
# generate a 1000 train-test pairs iterator with test set size = 0.1
from sklearn.model_selection import StratifiedShuffleSplit

#sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.33, random_state=44)
sss = StratifiedShuffleSplit(n_splits=3, random_state=44)

for train_index, test_index in sss.split(df[original_numeric], df['Survived']):
   #print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = df[original_numeric][train_index], df[original_numeric][test_index]
   y_train, y_test = df['Survived'][train_index], df['Survived'][test_index]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

## Building Pipelines


In [67]:
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

In [73]:
svc = SVC()
gnb = GaussianNB()
neigh = KNeighborsClassifier()
dt = tree.DecisionTreeClassifier()
rdf = RandomForestClassifier()
adb = AdaBoostClassifier()
sss = StratifiedShuffleSplit(n_splits=1000, random_state=44)

In [69]:
# a procedure to print out mean scores from cv
scores = ["accuracy", "precision", "recall", "average_precision", "f1", "roc_auc"]
def print_scores(clf, data, label):
    for score in scores:
        mean_score = cross_val_score(clf, data, label, cv=sss, scoring=score).mean()
        print(score, ':', mean_score)

In [None]:
print_scores(svc, df[original_numeric], df['Survived'])

accuracy : 0.704744444444
precision : 0.646534422473


In [59]:
clf = SVC()

parameters = {'selector__k':[6,5,4,3], \
              'clf__kernel': ['rbf', 'linear', 'poly'], \
              'clf__C': [0.1, 1, 10, 100, 1000], \
              'clf__gamma': [1, 0.1, 0.01, 0.001, 0.0001], \
              'clf__class_weight': ['balanced', None]}

pipeline1 = Pipeline([('selector', SelectKBest()), \
                      ('clf', clf)])

grid_search = GridSearchCV(clf, parameters, scoring='f1')

In [None]:
start = time()
parameters = {'clf__kernel': ['rbf', 'linear', 'poly'], \
              'clf__class_weight': ['balanced', None]}
grid_search = GridSearchCV(clf, parameters, scoring='f1')
gird_result = grid_search.fit(df[original_numeric], df['Survived']).best_estimator_

selector = gird_result.named_steps['selector']
k_features = gird_result.named_steps['selector'].get_params(deep=True)['k']
print("Number of features selected: %i" %(k_features))
print("\nThis took %.2f seconds\n" %(time() - start))

In [None]:
selected = selector.fit_transform(df[original_numeric], df['Survived'])
scores = zip(original_numeric, selector.scores_, selector.pvalues_)
sorted_scores = sorted(scores, key = lambda x: x[1], reverse=True)
new_list = list(map(lambda x: x[0], sorted_scores))[0:k_features]

In [None]:
time()

In [None]:
svc = SVC()
