# A Walk Through Ensemble Models
*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. Please check the pdf file for more details.*

In this exercise you will:

- get to know a useful package **pandas** for data analysis/preprocessing
- implement **decision tree** and apply it to a Titanic dataset
- implement a whole bunch of **ensemble methods**, including **random forest, and adaboost**, and apply them to a Titanic dataset

Please note that **YOU CANNOT USE ANY MACHINE LEARNING PACKAGE SUCH AS SKLEARN** for any homework, unless you are asked to.

In [11]:
# some basic imports
from scipy import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import re

%matplotlib inline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Let's first do some data preprocessing

Here we use [pandas](https://pandas.pydata.org/) to do data preprocessing. Pandas is a very popular and handy package for data science or machine learning. You can also refer to this official guide for pandas: [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)

In [12]:
# read titanic train and test data
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

print("train shape: {} test shape: {}".format(train.shape, test.shape))
# Showing overview of the train dataset
train.head(3)

train shape: (1047, 11) test shape: (262, 11)


Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
1,3,0,"Holm, Mr. John Fredrik Alexander",male,43.0,0,0,C 7075,6.45,,S
2,3,0,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S


## deal with missing values and transform to discrete variables

In [13]:
# copied from: https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset
full_data = [train, test]

# Feature that tells whether a passenger had a cabin on the Titanic
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# Create new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())

# Remove all NULLS in the Age column
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    # Next line has been improved to avoid warning
    dataset.loc[np.isnan(dataset['Age']), 'Age'] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

# Define function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Master": 2, "Mrs": 3, "Miss": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4

In [14]:
drop_elements = ['Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
test  = test.drop(drop_elements, axis = 1)

In [15]:
train.head()

Unnamed: 0,Pclass,Survived,Sex,Age,Parch,Fare,Embarked,Has_Cabin,FamilySize,IsAlone,Title
0,1,1,0,1,0,3,1,1,1,1,4
1,3,0,1,2,0,0,0,0,1,1,1
2,3,0,1,2,0,1,0,0,3,0,1
3,3,0,1,1,0,0,2,0,1,1,1
4,2,0,1,2,0,1,0,0,1,1,1


One of the good thing of pd.DataFrame is that you can keep the column names along with the data, which can be beneficial for many case.

Another good thing is that pd.DataFrame can be converted to np.array implicitely.

Also, pd provides a lot of useful data manipulating methods for your convenience, though we may not use them in this homework.

In [16]:
X = train.drop(['Survived'], axis=1)
y = train["Survived"]
X_test = test.drop(['Survived'], axis=1)
y_test = test["Survived"]
print("train: {}, test: {}".format(X.shape, X_test.shape))

train: (1047, 10), test: (262, 10)


In [17]:
def accuracy(y_gt, y_pred):
    return np.sum(y_gt == y_pred) / y_gt.shape[0]

In [18]:
print("Survived: {:.4f}, Not Survivied: {:.4f}".format(y.sum() / len(y), 1 - y.sum() / len(y)))

Survived: 0.3878, Not Survivied: 0.6122


In [109]:
print (X[0:1])

   Pclass  Sex  Age  Parch  Fare  Embarked  Has_Cabin  FamilySize  IsAlone  \
0       1    0    1      0     3         1          1           1        1   

   Title  
0      4  


## Decision Tree
Now it's your turn to do some real coding. Please implement the decision tree model in **decision_tree.py**. The PDF file provides some hints for this part.

In [151]:
def calc_height(d):
    now = 0
    if type(d) == dict:
        for t in list(d.values())[0].values():
            now = max(now, calc_height(t))
    return now + 1

from decision_tree import DecisionTree

# Plot the decision tree to get an intuition about how it makes decision
#plt.figure(figsize=(10, 5))
#dt.show()

dt = DecisionTree(criterion='entropy', max_depth=10, min_samples_leaf=1, sample_feature=False)
dt.fit(X, y)
y_train_pred = dt.predict(X)
print (calc_height(dt._tree))
print("Accuracy on train set: {}".format(accuracy(y, dt.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, dt.predict(X_test))))

10
Accuracy on train set: 0.8787010506208214
Accuracy on test set: 0.7900763358778626


In [104]:
# TODO: Train the best DecisionTree(best val accuracy) that you can. You should choose some 
# hyper-parameters such as critertion, max_depth, and min_samples_in_leaf 
# according to the cross-validation result.
# To reduce difficulty, you can use KFold here.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=2020)

best_acc, best_min_samples_leaf, best_max_depth = 0, 0, 0
best_method = ""
for now_method in ['entropy', 'infogain_ratio', 'gini']:
    for now_min_samples_leaf in range(1, 10):
        for now_max_depth in range(5, 11):
            dt = DecisionTree(criterion=now_method, max_depth=now_max_depth, min_samples_leaf=now_min_samples_leaf, sample_feature=False)
            ave_acc = 0
            for train_indice, valid_indice in kf.split(X, y):
                X_train_fold, y_train_fold = X.loc[train_indice], y.loc[train_indice]
                X_val_fold, y_val_fold = X.loc[valid_indice], y.loc[valid_indice]
                dt.fit(X_train_fold, y_train_fold)
                y_train_pred = dt.predict(X_train_fold) 
                y_valid_pred = dt.predict(X_val_fold)
                ave_acc += accuracy(y_val_fold, y_valid_pred) * 0.7 + accuracy(y_train_fold, y_train_pred) * 0.3
            ave_acc /= 5
            print(now_method, now_min_samples_leaf, now_max_depth, ave_acc)
            if ave_acc > best_acc:
                best_acc = ave_acc
                best_method = now_method
                best_min_samples_leaf = now_min_samples_leaf
                best_max_depth = now_max_depth
    
    
# begin answer
print (best_acc, best_method, best_min_samples_leaf, best_max_depth)
# end answer

entropy 1 5 0.7992937733392289
entropy 1 6 0.799393209786271
entropy 1 7 0.7983865402979309
entropy 1 8 0.8013493436089073
entropy 1 9 0.7966667279789232
entropy 1 10 0.7993397742309168
entropy 2 5 0.8006302964652257
entropy 2 6 0.8014027791642615
entropy 2 7 0.8017326328019181
entropy 2 8 0.7993429640235803
entropy 2 9 0.7993365844382534
entropy 2 10 0.7986762973569136
entropy 3 5 0.7988859410708992
entropy 3 6 0.8039949801715425
entropy 3 7 0.7995988462654102
entropy 3 8 0.7977485094838969
entropy 3 9 0.7990754632319031
entropy 3 10 0.8004119863579
entropy 4 5 0.7996558803345193
entropy 4 6 0.8042667448254687
entropy 4 7 0.8024004590806377
entropy 4 8 0.8045565018844514
entropy 4 9 0.8052263583437815
entropy 4 10 0.8018738862544673
entropy 5 5 0.7980019528240249
entropy 5 6 0.8051488739766676
entropy 5 7 0.8045474267700199
entropy 5 8 0.802619025815366
entropy 5 9 0.8026158360227026
entropy 5 10 0.8046190258153659
entropy 6 5 0.7974416630045076
entropy 6 6 0.8045042259411502
entropy 

In [134]:
# report the accuracy on test set
dt = DecisionTree(criterion=best_method, max_depth=best_max_depth, min_samples_leaf=best_min_samples_leaf, sample_feature=False)
dt.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, dt.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, dt.predict(X_test))))

Accuracy on train set: 0.833810888252149
Accuracy on test set: 0.8091603053435115


In [None]:
dt = DecisionTree(criterion='infogain_ratio', max_depth=8, min_samples_leaf=1, sample_feature=False)

## Random Forest
Please implement the random forest model in **random_forest.py**. The PDF file provides some hints for this part.

In [129]:
from random_forest import RandomForest

base_learner = DecisionTree(criterion='entropy', max_depth=5, min_samples_leaf=5, sample_feature=True)
rf = RandomForest(base_learner=base_learner, n_estimator=100, seed=2020)
rf.fit(X, y)

y_train_pred = rf.predict(X)

print("Accuracy on train set: {}".format(accuracy(y, y_train_pred)))
print("Accuracy on test set: {}".format(accuracy(y_test, rf.predict(X_test))))

Accuracy on train set: 0.8404966571155683
Accuracy on test set: 0.816793893129771


In [165]:
from random_forest import RandomForest

base_learner = DecisionTree(criterion='entropy', max_depth=8, min_samples_leaf=5, sample_feature=True)
rf = RandomForest(base_learner=base_learner, n_estimator=100, seed=2020)
rf.fit(X, y)

y_train_pred = rf.predict(X)

print("Accuracy on train set: {}".format(accuracy(y, y_train_pred)))
print("Accuracy on test set: {}".format(accuracy(y_test, rf.predict(X_test))))

Accuracy on train set: 0.8691499522445081
Accuracy on test set: 0.8015267175572519


In [170]:
# TODO: Train the best RandomForest that you can. You should choose some 
# hyper-parameters such as max_depth, and min_samples_in_leaf 
# according to the cross-validation result.

from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=2020)

best_acc, best_min_samples_leaf, best_max_depth = 0, 0, 0
best_method = ""
for now_method in ['infogain_ratio', 'gini', 'entropy']:
    for now_min_samples_leaf in range(1, 10, 2):
        for now_max_depth in range(3, 6, 1):
            base_learner = DecisionTree(criterion=now_method, max_depth=now_max_depth, min_samples_leaf=now_min_samples_leaf, sample_feature=True)
            rf = RandomForest(base_learner=base_learner, n_estimator=100, seed=2020)
            ave_valid_acc, ave_train_acc, ave_com_acc = 0, 0, 0
            for train_indice, valid_indice in kf.split(X, y):
                X_train_fold, y_train_fold = X.loc[train_indice], y.loc[train_indice]
                X_val_fold, y_val_fold = X.loc[valid_indice], y.loc[valid_indice]
                rf.fit(X_train_fold, y_train_fold)
                y_train_pred = rf.predict(X_train_fold) 
                y_valid_pred = rf.predict(X_val_fold)
                train_acc = accuracy(y_train_fold, y_train_pred)
                valid_acc = accuracy(y_val_fold, y_valid_pred)
                ave_train_acc += train_acc
                ave_valid_acc += valid_acc
                ave_com_acc += valid_acc * 0.7 + train_acc * 0.3
                
            ave_train_acc /= 5
            ave_valid_acc /= 5
            ave_com_acc /= 5
            print("%16s %d %d %.4f%% %.4f%% %.4f%%" % (now_method, now_min_samples_leaf, now_max_depth, ave_com_acc*100, ave_train_acc*100, ave_valid_acc*100))
            if ave_com_acc > best_acc:
                best_acc = ave_com_acc
                best_method = now_method
                best_min_samples_leaf = now_min_samples_leaf
                best_max_depth = now_max_depth
    
    
# begin answer
print (best_acc, best_method, best_min_samples_leaf, best_max_depth)
# end answer

  infogain_ratio 1 3 79.0561% 79.4415% 78.8909%
  infogain_ratio 1 4 80.3463% 81.0650% 80.0383%
  infogain_ratio 1 5 81.1680% 82.6885% 80.5163%
  infogain_ratio 3 3 79.2239% 79.7756% 78.9875%
  infogain_ratio 3 4 80.4344% 81.1365% 80.1335%
  infogain_ratio 3 5 81.3780% 82.4975% 80.8982%
  infogain_ratio 5 3 78.8731% 79.2740% 78.7013%
  infogain_ratio 5 4 80.3886% 81.2083% 80.0374%
  infogain_ratio 5 5 80.8788% 82.6168% 80.1340%
  infogain_ratio 7 3 79.1757% 79.3937% 79.0823%
  infogain_ratio 7 4 80.7734% 81.3755% 80.5154%
  infogain_ratio 7 5 80.9483% 82.4020% 80.3254%
  infogain_ratio 9 3 78.8872% 79.3221% 78.7008%
  infogain_ratio 9 4 80.2936% 81.1126% 79.9426%
  infogain_ratio 9 5 81.0966% 82.4497% 80.5167%
            gini 1 3 81.6033% 82.1395% 81.3734%
            gini 1 4 81.7639% 83.7870% 80.8968%
            gini 1 5 81.8849% 85.5301% 80.3226%
            gini 3 3 81.7229% 82.0918% 81.5648%
            gini 3 4 82.0078% 83.9302% 81.1839%
            gini 3 5 81.9693% 85.3629% 8

In [None]:
# report the accuracy on test set
# k=100
# begin answer
# end answer
rf.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, rf.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, rf.predict(X_test))))

## Adaboost
Please implement the adaboost model in **adaboost.py**. The PDF file provides some hints for this part.

In [None]:
from adaboost import Adaboost

base_learner = DecisionTree(criterion='entropy', max_depth=1, min_samples_leaf=1, sample_feature=False)
ada = Adaboost(base_learner=base_learner, n_estimator=50, seed=2020)
ada.fit(X, y)

y_train_pred = ada.predict(X)

print("Accuracy on train set: {}".format(accuracy(y, y_train_pred)))

In [None]:
# TODO: Train the best Adaboost that you can. You should choose some 
# hyper-parameters such as max_depth, and min_samples_in_leaf 
# according to the cross-validation result.
# begin answer
# end answer

In [None]:
# report the accuracy on test set
# begin answer
# end answer
ada.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, ada.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, ada.predict(X_test))))