# Titanic: Machine Learning from Disaster
By Lan Ngo based on the tutorial https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish and https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8

### 1. Data Description

In this challenge, we are asked to predict whether a passenger on the titanic would have been survived or not using economic status (class), sex, age etc.

RMS Titanic

| Variable | Definition	| Note |
|:--------:|:----------:|:----:|
| survival | Survival   |	0 = No, 1 = Yes |
| pclass   | Ticket class | A proxy for socio-economic status (SES)<br>1 = 1st (Upper)<br>2 = 2nd (Middle)<br>3 = 3rd (Lower) |
| sex	   | Sex	    | |
| Age      | Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |
| sibsp	| # of siblings / spouses aboard the Titanic	| Sibling = brother, sister, stepbrother, stepsister<br>Spouse = husband, wife (mistresses and fiancés were ignored)| 
| parch	| # of parents / children aboard the Titanic	| Parent = mother, father<br>Child = daughter, son, stepdaughter, stepson<br>Some children travelled only with a nanny, therefore parch=0 for them.|
| ticket |	Ticket number	| |
| fare |	Passenger fare	| |
| cabin |	Cabin number	| |
| embarked|	Port of Embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton|


In [52]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import data
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')

data_train.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
564,565,0,3,"Meanwell, Miss. (Marion Ogden)",female,,0,0,SOTON/O.Q. 392087,8.05,,S
209,210,1,1,"Blank, Mr. Henry",male,40.0,0,0,112277,31.0,A31,C
404,405,0,3,"Oreskovic, Miss. Marija",female,20.0,0,0,315096,8.6625,,S


### 2. Data exploration

Let's look at some statistics and visualizations to better understand the dataset.

In [53]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


The training-set has 891 examples and 11 features + the target variable (Survived). 2 of the features are floats, 5 are integers and 5 are objects. Out of the 11 features, PassengerId, Ticket, and Name might not correlate much with survival rate.

Feature 'Age', 'Cabin', and 'Embarked' contain missing data. Since Cabin has only 204 out of 891 training samples, it makes sense to drop the feature out of the dataset. Objects features need to be convert to numeric features in other for the algorithms to process.

In [54]:
# Descriptive statistics of the training set
data_train.describe() 

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Approximately 38% of the passengers survived the accident. Features have widely different range so we will need to convert them into roughtly the same scale.

### 3. Data pre-processing

1. Grouping ages to age group to avoid overfitting
2. Drop Ticket as it is not useful in the learning model
3. Only the first letter of Cabin is important => slice it.
4. Group continuous Fare to categories using the distribution of Fare (quartile bins)
5. Replace full name by last name and name prefix.

In [42]:
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    age_group = ['Unknown', 'Baby', 'Child', 'Teenager', 'Young Adult', 'Adult',
                'Middle age', 'Senior']
    categories = pd.cut(df.Age, bins, labels=age_group)
    df.Age = categories
    return df

def simplify_cabins(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    return df

def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    fare_group = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=fare_group)
    df.Fare = categories
    return df

def transform_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df

def drop_features(df):
    return df.drop(['Ticket', 'Name', 'Embarked'], axis=1)

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabins(df)
    df = simplify_fares(df)
    df = transform_name(df)
    df = drop_features(df)
    return df

In [43]:
# transform features of train and test set
data_train = transform_features(data_train)
data_test = transform_features(data_test)
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Lname,NamePrefix
0,1,0,3,male,Young Adult,1,0,1_quartile,N,"Braund,",Mr.
1,2,1,1,female,Middle age,1,0,4_quartile,C,"Cumings,",Mrs.
2,3,1,3,female,Adult,0,0,1_quartile,N,"Heikkinen,",Miss.
3,4,1,1,female,Adult,1,0,4_quartile,C,"Futrelle,",Mrs.
4,5,0,3,male,Adult,0,0,2_quartile,N,"Allen,",Mr.


Using scikitlearn's LabelEncoder to convert unique string values to number, making data more flexible to various algorithms.

In [44]:
from sklearn import preprocessing
def encode_features(df_train, df_test):
    features = ['Fare', 'Cabin', 'Age', 'Sex', 'Lname', 'NamePrefix']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test

data_train, data_test = encode_features(data_train, data_test)
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Lname,NamePrefix
0,1,0,3,1,7,1,0,0,7,100,19
1,2,1,1,0,3,1,0,3,2,182,20
2,3,1,3,0,0,0,0,0,7,329,16
3,4,1,1,0,0,1,0,3,2,267,20
4,5,0,3,1,0,0,0,1,7,15,19


Seperating training data into features (X_all) and Survided label (y_all). Then use train_split_test to split data_train to training set (80%) of the data and testing set (20%).

In [45]:
from sklearn.model_selection import train_test_split

X_all = data_train.drop(['Survived', 'PassengerId'], axis=1)
y_all = data_train['Survived']

num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, 
                                                    test_size=num_test, random_state=23)

### 4. Fitting and Tunning with Random Forest Classifier

Using Random Forest as classifier.


In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV

        

# Classifier type
random_forest = RandomForestClassifier()

# parameters for classsifier
parameters = {'n_estimators': [4, 6, 9],
             'max_features': ['log2', 'sqrt', 'auto'],
             'criterion': ['entropy', 'gini'],
             'max_depth': [2, 3, 5, 10],
             'min_samples_split': [2, 3, 5],
             'min_samples_leaf': [1, 5, 8]}

# Scoring type used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(random_forest, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set classifier to the best combiation of parameters
random_forest = grid_obj.best_estimator_

# Fit the best algorithm to the data
random_forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=6, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [47]:
# accuracy on test set
rf_predictions = random_forest.predict(X_test)
rf_without_kfold = accuracy_score(y_test, rf_predictions)

We extend the model by tunning the algorithim using KFold (splitting data into 10 buckets then run the algorithm using a different bucket as the test set for each iteration).

In [48]:
from sklearn.cross_validation import KFold

def run_kfold(clf):
    kf = KFold(data_train.shape[0], n_folds=10)
    outcomes = []
    fold = 0
    
    for train_idx, test_idx in kf:
        fold += 1
        X_train, X_test = X_all.values[train_idx], X_all.values[test_idx]
        y_train, y_test = y_all.values[train_idx], y_all.values[test_idx]
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        outcomes.append(accuracy)
        print("Fold {0} accuracy: {1}".format(fold, accuracy))
        
    mean_accuracy = np.mean(outcomes)
    print("Mean accuracy: {0}".format(mean_accuracy))
    return mean_accuracy
    
rf_w_kfold = run_kfold(random_forest)

Fold 1 accuracy: 0.7888888888888889
Fold 2 accuracy: 0.8314606741573034
Fold 3 accuracy: 0.7640449438202247
Fold 4 accuracy: 0.8426966292134831
Fold 5 accuracy: 0.797752808988764
Fold 6 accuracy: 0.8202247191011236
Fold 7 accuracy: 0.7865168539325843
Fold 8 accuracy: 0.7528089887640449
Fold 9 accuracy: 0.8539325842696629
Fold 10 accuracy: 0.7640449438202247
Mean accuracy: 0.8002372034956304


In [49]:
# Compare accuracy between model with and without KFold
print("Accuracy for model without using KFold: {0}".format(rf_without_kfold))
print("Accuracy for model using KFold: {0}".format(rf_w_kfold))

Accuracy for model without using KFold: 0.7821229050279329
Accuracy for model using KFold: 0.8002372034956304


### 5. Predict actual test data and export to CSV


In [50]:
ids = data_test['PassengerId']
predictions = random_forest.predict(data_test.drop('PassengerId', axis=1))

output = pd.DataFrame({'PassengerId': ids, 'Survived': predictions})
output.to_csv('titanic-random-forest.csv', index=False)
# Kaggle Score: 0.72727