Titanic: Machine Learning from Disaster
===
巨量資料分析導論 Homework 3
B036060017 資管四年級 謝威廷, 2018/4/18

## Outline
1. Data Loading
2. Dealing with Missing & Redundant Variable
3. Label Categoriacal Variable with One Hot Encoding
4. Modeling

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

seed = 17

## 1. Data Loading
- After rough observation, select the following features to predict the outcome.
- 'Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked'

In [7]:
train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')
data = pd.concat([train, test], axis = 0)

test_id = data.iloc[train.shape[0]:, 6]
data = data.drop(["PassengerId"], axis = 1)

selected_features = ['Survived', 'Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']
data = data.loc[:, selected_features]

print("Train Shape : ", train.shape, " Test Shape : ", test.shape )
print("Full Data Shape : ", data.shape)
data.head()

Train Shape :  (891, 12)  Test Shape :  (418, 11)
Full Data Shape :  (1309, 8)


Unnamed: 0,Survived,Age,Pclass,Sex,SibSp,Parch,Fare,Embarked
0,0.0,22.0,3,male,1,0,7.25,S
1,1.0,38.0,1,female,1,0,71.2833,C
2,1.0,26.0,3,female,0,0,7.925,S
3,1.0,35.0,1,female,1,0,53.1,S
4,0.0,35.0,3,male,0,0,8.05,S


## 2. Dealing with Missing & Redundant Variable

In [8]:
data.isnull().sum()

Survived    418
Age         263
Pclass        0
Sex           0
SibSp         0
Parch         0
Fare          1
Embarked      2
dtype: int64

### 2.1 Embarked & Fare
- `Embarked` : Only 2 missing values, fill with most frequent value `S`.
- `Fare`: Only 1 missing value, fill with the average of whole `Fare` value.

In [9]:
print(data['Embarked'].value_counts())
data['Embarked'] = data['Embarked'].fillna('S')
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())

S    914
C    270
Q    123
Name: Embarked, dtype: int64


### 2.2 Age
- Using RandomForest with other features to impute the missing value.

In [10]:
from sklearn.ensemble import RandomForestRegressor

age_data = data.loc[:, data.dtypes!="object"]
age_data = age_data.drop(["Survived"], axis = 1)
# Train
age_complete = age_data[(age_data["Age"].notnull())]
# Test
age_incomplete = age_data[(age_data["Age"].isnull())]

X_train_age = age_complete.iloc[:, 1:]
y_train_age = age_complete.iloc[:, 1]

model_rf_age = RandomForestRegressor(n_estimators=1000, oob_score = True, n_jobs = -1, random_state = seed)
model_rf_age.fit(X_train_age, y_train_age)
pred_age = model_rf_age.predict(age_incomplete.iloc[:, 1:])
data.loc[data["Age"].isnull(), ["Age"]] = pred_age
print("Random Forest impute Age MSE :", model_rf_age.oob_score_)

Random Forest impute Age MSE : 1.0


## 3. Label Categoriacal Variable with One Hot Encoding

In [11]:
data_ohe = pd.get_dummies(data)
print(data_ohe.shape)
data_ohe.head()

(1309, 11)


Unnamed: 0,Survived,Age,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,22.0,3,1,0,7.25,0,1,0,0,1
1,1.0,38.0,1,1,0,71.2833,1,0,1,0,0
2,1.0,26.0,3,0,0,7.925,1,0,0,0,1
3,1.0,35.0,1,1,0,53.1,1,0,0,0,1
4,0.0,35.0,3,0,0,8.05,0,1,0,0,1


In [12]:
train = data_ohe.iloc[:train.shape[0], :]
test = data_ohe.iloc[train.shape[0]:, :]

y_train = train.loc[:, "Survived"].astype("int")
X_train = train.drop(["Survived"], axis = 1)
X_test = test.drop(["Survived"], axis = 1)
print("Train Shape : ", train.shape, " Test Shape : ", test.shape )

Train Shape :  (891, 11)  Test Shape :  (418, 11)


## 4. Model Training
### Auxiliary Function 
- Observe the prediction outcome through CrossValidation, ConfusionMatrix

In [81]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

def show_clf_result(model_name, X_train, y_train):
    scores = cross_val_score(eclf, X_train, y_train, cv=5, scoring='accuracy')
    pred = cross_val_predict(eclf, X_train, y_train, cv=5)
    cm = confusion_matrix(y_train, pred)
    report = classification_report(y_train, pred)
    
    print("%s Classification Report" %model_name, "\n")
    print("3 Cross Validation Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
    print("Confusion Matrix\n", cm, "\n")
    print('Classification Metrics\n', report)

### 4.1 XGBoost

In [82]:
from xgboost import XGBClassifier

model_xgb = XGBClassifier(random_state=seed)
model_xgb.fit(X_train, y_train)
show_clf_result(model_xgb, X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=17, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1) Classification Report 

3 Cross Validation Accuracy: 0.82 (+/- 0.02)
Confusion Matrix
 [[487  62]
 [ 99 243]] 

Classification Metrics
              precision    recall  f1-score   support

          0       0.83      0.89      0.86       549
          1       0.80      0.71      0.75       342

avg / total       0.82      0.82      0.82       891



In [83]:
submission = pd.DataFrame({"PassengerId" : test_id, 
                           "Survived" : model_xgb.predict(X_test)})
submission.to_csv("XGB_subm.csv", index = False)

### 4.2 Ensemble Major Voting with XGBoost & RandomForest

In [86]:
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier

clf1 = XGBClassifier(random_state=seed)
clf2 = RandomForestClassifier(random_state=seed)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2], voting='soft')

eclf.fit(X_train, y_train)
show_clf_result(eclf, X_train, y_train)

EnsembleVoteClassifier(clfs=[XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic',
       random_state=...estimators=10, n_jobs=1,
            oob_score=False, random_state=17, verbose=0, warm_start=False)],
            refit=True, verbose=0, voting='soft', weights=None) Classification Report 

3 Cross Validation Accuracy: 0.82 (+/- 0.02)
Confusion Matrix
 [[487  62]
 [ 99 243]] 

Classification Metrics
              precision    recall  f1-score   support

          0       0.83      0.89      0.86       549
          1       0.80      0.71      0.75       342

avg / total       0.82      0.82      0.82       891



In [87]:
submission = pd.DataFrame({"PassengerId" : test_id, 
                           "Survived" : eclf.predict(X_test)})
submission.to_csv("ENS_subm.csv", index = False)

## TPOT : Tree-based Pipeline Optimization

In [None]:
#!pip install tpot
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations = 5, population_size = 100, verbosity = 2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_titanic_pipeline.py')

Optimization Progress:  33%|███▎      | 200/600 [01:21<03:03,  2.18pipeline/s]

Generation 1 - Current best internal CV score: 0.8395372771063914


Optimization Progress:  50%|█████     | 300/600 [02:32<02:49,  1.77pipeline/s]

Generation 2 - Current best internal CV score: 0.8395372771063914


Optimization Progress:  67%|██████▋   | 400/600 [03:41<01:23,  2.38pipeline/s]

Generation 3 - Current best internal CV score: 0.8395372771063914


Optimization Progress:  83%|████████▎ | 500/600 [06:02<01:46,  1.07s/pipeline]

Generation 4 - Current best internal CV score: 0.8395372771063914


Optimization Progress:  85%|████████▌ | 511/600 [06:28<01:48,  1.22s/pipeline]