# Use a Stacked Ensemble to predict Titanic passenger survival  

This notebook contains steps and code to demonstrate how to build a Stacked Ensemble with scikit-learn. The model will address the Titanic passenger survival dataset available at [Kaggle](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/a02e7cd6-8137-4390-891a-ed0d703ce30a?projectid=95cd1d8d-3ffc-4eea-906b-590efd3bf1fe&projectTitle=Instant%20Competition&context=cpdaas) 

Some familiarity with Python is helpful. This notebook uses Python 3.7 and was tested in Watson Studio.

## Learning goals

The learning goals of this notebook are:

-  Import data
-  Train and test Level 1 models
-  Train and test a simple Stacked Ensemble
-  Train and test a K Fold Stacked Ensemble

## Contents

This notebook contains the following parts:

1.	[Setup](#setup)
2.	[Train and test Level 1 models](#train)
3.	[Train and test a simple Stacked Ensemble](#simple)
4.	[Train and test a K Fold Stacked Ensemble](#kfold)
5.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Setup 
Before you use the sample code in this notebook, you must perform the following setup tasks:

- Download the Titanic dataset. Note that the following feature engineering modifications have been made to the dataset available from [Kaggle](https://www.kaggle.com/c/titanic/data)
   - Extracted title from `Name`
   - Imputed missing values for rows with missing `Age, Embarked and Fare`
   - Mapped  titles to a fixed set
   - Use `AgeGroup` instead of `Age`
   - Added column `HadCabin` to indicate whether a passenger had an assigned Cabin or not 
   - Removed columns: `Cabin, Name, Age, PassengerId, Pclass `
   
- Import required packages

- Split data into test and training sets   
 
### 1.1 Required imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import  LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter

### 1.2 Download data
In order to save time we've applied the modifications noted above to the orginal dataset in order to save time

In [None]:
!wget https://cp4d-workshop-datasets.s3.us-south.cloud-object-storage.appdomain.cloud/titanic_train_cleaned_v2.csv  --output-document=titanic_train_cleaned_v2.csv

### 1.3 Split data into test and training sets

Kaggle does provide a training and test set for the competitition but the test set is unlabeled. So we'll work with just the training set.

We use 2 splits:
1. Split the original training set into training and test data
2. Split the new training set in half for the implementation of a simple Stacked Ensemble

In [None]:
df_data = pd.read_csv('titanic_train_cleaned_v2.csv')

X = df_data.drop(columns=['Survived'])
y = df_data['Survived']

# First split on entire training set 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80,test_size=0.20, random_state=100)

# Second split to be used for simple Stacked Ensemble
X_train_first_half, X_train_second_half, y_train_first_half, y_train_second_half = train_test_split(X_train, y_train, train_size=0.50,test_size=0.50, random_state=100)

# Set up ColumnTransformer to scale columns . i.e. subtract the mean from each value and scale to unit variance
cols_to_scale = ['Fare']
scaler = StandardScaler()
preprocessor = ColumnTransformer(transformers = [('scaler', scaler, cols_to_scale)], remainder ='passthrough')

X_train.head()

### 1.4 Create Dataframes for the Simple Stacked Ensemble 

We'll build the simple Ensemble from scratch . We need the predictions from the Level 1 models to serve as input. These dataframes will be used to store those predictions.

In [None]:
# For training the meta model
df_simple_stacked_ensemble_train = pd.DataFrame()

# For testing the meta model 
df_simple_stacked_ensemble_test = pd.DataFrame()

<a id="train"></a>
## 2. Train and test Level 1 models

In this section, you will train and test the  Level 1 models. You will be using 2 Level 1 models:

1. Gradient Boosting

2. Random Forest

### 2.1 Gradient Boosting Level 1 model

Train the model on the first half of the training data and generate the predictions for the second half

In [None]:
# Create model and pipeline 
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_pipeline = Pipeline([('preprocessor', preprocessor), ('gb', gb_classifier )])

# Train model on first half of training data
gb_model = gb_pipeline.fit(X_train_first_half, y_train_first_half)

# Generate probabilities for second half 
y_train_proba = gb_model.predict_proba(X_train_second_half)[:,1]
df_simple_stacked_ensemble_train['gb'] = y_train_proba.tolist()

Retrain on all training data and generate probabilities to test the simple Stacked Ensemble

In [None]:
# Retrain model on all training data
gb_model = gb_pipeline.fit(X_train, y_train)

# Generate probabilities for test data
y_test_proba = gb_model.predict_proba(X_test)[:,1]
df_simple_stacked_ensemble_test['gb'] = y_test_proba.tolist()  

Evaluate model by itself to compare later

In [None]:
y_pred = gb_model.predict(X_test)
gb_accuracy_score = accuracy_score(y_test, y_pred)
print('%s %.3f ' % ('Gradient Boosting accuracy is ', gb_accuracy_score))
print('Confusion Matrix')
confusion_matrix(y_test, y_pred)

In [None]:
# Save results
accuracy_scores, model_names = list(), list()
accuracy_scores.append(gb_accuracy_score)
model_names.append('Gradient Boosting')

### 2.2 Random Forest Level 1 model

Train the model on the first half of the training data and generate the predictions for the second half

In [None]:
# Create model and pipeline 
rf_classifier = RandomForestClassifier(random_state=1)
rf_pipeline = Pipeline([('preprocessor', preprocessor), ('rf', rf_classifier )])

# Train model on first half of training data
rf_model = rf_pipeline.fit(X_train_first_half, y_train_first_half)

# Generate probabilities  for second half 
y_train_proba = rf_model.predict_proba(X_train_second_half)[:,1]
df_simple_stacked_ensemble_train['rf'] = y_train_proba.tolist()

Retrain on all training data and generate probabilities to test the simple Stacked Ensemble

In [None]:
# Retrain model on all training data
rf_model = rf_pipeline.fit(X_train, y_train)

# Generate probabilities for test data
y_test_proba = rf_model.predict_proba(X_test)[:,1]
df_simple_stacked_ensemble_test['rf'] = y_test_proba.tolist()  

Evaluate model by itself to compare later

In [None]:
y_pred = rf_model.predict(X_test)
rf_accuracy_score = accuracy_score(y_test, y_pred)
print('%s %.3f ' % ('Random Forest accuracy is ', rf_accuracy_score))
print('Confusion Matrix')
confusion_matrix(y_test, y_pred)

In [None]:
# Save results
accuracy_scores.append(rf_accuracy_score)
model_names.append('Random Forest')

<a id="simple"></a>
## 3. Simple Stacked Ensemble

This meta model is trained using the prediction probablities from the Level 1 models. We will use  Logistic Regression  for our meta model.

In [None]:
# Check training probabilities generated by Level 1 models. Should see a column for each Level 1 model. 
df_simple_stacked_ensemble_train.head()

In [None]:
# Check testing probabilities generated by Level 1 models. Should see a column for each Level 1 model. 
df_simple_stacked_ensemble_test.head()

In [None]:
# Create meta model
ensemble1_meta_model = LogisticRegression()

# Train model using prediction probabilities of Level 1 models
ensemble1_model = ensemble1_meta_model.fit(df_simple_stacked_ensemble_train, y_train_second_half)

Evaluate meta model using test data

In [None]:
y_pred = ensemble1_model.predict(df_simple_stacked_ensemble_test)
ensemble1_accuracy_score = accuracy_score(y_test, y_pred)
print('%s %.3f ' % ('Ensemble Simple Split accuracy is ', ensemble1_accuracy_score))
print('Confusion Matrix')
confusion_matrix(y_test, y_pred)

In [None]:
# Save results
accuracy_scores.append(ensemble1_accuracy_score)
model_names.append('Simple Stacked Ensemble')

<a id="kfold"></a>
## 4. K fold Stacked Ensemble

This meta model is trained using the prediction probabilities from the Level 1 models. We will use  Logistic Regression  for our meta model.  We will use the scikit-learn class `StackingClassifier` to build this.

Build and train the StackingClassifier

In [None]:
# Setup (untrained) Level 1 models
ensemble2_level1_models = list()
ensemble2_level1_models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))
ensemble2_level1_models.append(('Random Forest', RandomForestClassifier(random_state=1)))

In [None]:
# Create meta model
ensemble2_meta_model = LogisticRegression()

# Setup K fold (we'll use 10 folds)
ensemble2_cross_validation = StratifiedKFold(n_splits=10)

In [None]:
# Create StackingClassifier and pipeline
ensemble2 = StackingClassifier(estimators=ensemble2_level1_models, stack_method='predict_proba', final_estimator=ensemble2_meta_model, cv=ensemble2_cross_validation)
ensemble2_pipeline = Pipeline([('preprocessor', preprocessor), ('ensemble', ensemble2)])

# Train 
ensemble2_model = ensemble2_pipeline.fit(X_train, y_train)

Evaluate meta model using test data

In [None]:
y_pred = ensemble2_model.predict(X_test)
ensemble2_accuracy_score = accuracy_score(y_test, y_pred)
print('%s %.3f ' % ('Ensemble K Fold accuracy is ', ensemble2_accuracy_score))
print('Confusion Matrix')
confusion_matrix(y_test, y_pred)

In [None]:
# Save results
accuracy_scores.append(ensemble2_accuracy_score)
model_names.append('K fold Stacked Ensemble')

<a id="summary"></a>
## 5. Summary

Lets visualize the results to compare them and wrap things up

### 5.1 Utilities to help with drawing chart

In [None]:
# function to sort 2 lists based on the sort order of the first list
def sort_lists(list1, list2):
   zipped_lists = zip(list1, list2)
   sorted_pairs = sorted(zipped_lists)

   tuples = zip(*sorted_pairs)
   list1, list2 = [ list(tuple) for tuple in  tuples]

   return list1, list2

# Adds y value on top of bar in bar chart
# Source: http://composition.al/blog/2015/11/29/a-better-way-to-add-labels-to-bar-charts-with-matplotlib/
def autolabel(rects, ax):
    # Get y-axis height to calculate label position from.
    (y_bottom, y_top) = ax.get_ylim()
    y_height = y_top - y_bottom

    for rect in rects:
        height = rect.get_height()
        label_position = height + (y_height * 0.01)

        ax.text(rect.get_x() + rect.get_width()/2., label_position,
                '%.3f' % float(height),
                ha='center', va='bottom')



### 5.2 Draw chart

In [None]:
sorted_accuracy_scores, sorted_model_names = sort_lists(accuracy_scores, model_names)
acc_series = pd.Series(sorted_accuracy_scores)
fig, ax = plt.subplots(figsize=(12, 6))
rects = ax.bar(sorted_model_names, sorted_accuracy_scores, color='green')
ax.set_title('Accuracy listed by model')
ax.set_xlabel('Model type')
ax.set_ylabel('Accuracy on test data')
ax.set_xticklabels(sorted_model_names, rotation=45)
ax.axes.yaxis.set_visible(False)
ax.set_yscale('log')
bar_labels = ["%.2f" % i for i in sorted_accuracy_scores]

autolabel(rects, ax)



### 5.3 Wrap up

Congrats ! You've successfully built two  Stacked Ensemble models !

If you would like to keep working on this dataset we suggest you try your hand at improving this and enter it in the ongoing  [Kaggle Competition](https://www.kaggle.com/c/titanic).

**Note**: To generate a Kaggle entry you will need to download the Kaggle test datset (it's unlabelled) and generate an entry. Here's an example:

```

# Read in Kaggle test data set
df_test =  pd.read_csv('test.csv')

# Remove PassengerId before using test data and   modify/enhance as needed
X_test = df_test.drop(columns=['PassengerId'])
...


# Code to build your ensemble model goes here
... 

# Generate predictions
y_pred = ensemble.predict(X_test_modified)

# Create Kaggle entry file
kaggle_entry = pd.DataFrame({'PassengerId': df_test['PassengerId'].values,'Survived': y_pred})
kaggle_entry.to_csv('my_titanic_entry_v1.csv', index=False)

```

