# Lecture 14: Tree3 Part 2
Ensemble Techniques

We will use the [Titanic: Machine learning from Disaster](https://www.kaggle.com/competitions/titanic/data?select=train.csv) dataset again.

Recall that the training data provided contains 891 records with the following attributes:
* Survived: 0= No; 1 = Yes
* Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* Name: Passenger name 
dataset.
* Gender: (female; male)
* Age: Passenger age
* SibSp: Number of Siblings/Spouses Aboard
* Parch: Number of Parents/Children Aboard
* Ticket:Ticket Number
* Fare: Passenger Fare
* Cabin: Cabin
* Embarked: Port of Embarkation (C = Cherbourg; Q =Queenstown; S = Southampton)



We will do similar pre-processing steps as before, only this time we are going to add more variables, including the *Number of Siblings/Spouses Aboard*, 
*Number of Parents/Children Aboard*, *Passenger Fare*, *Port of Embarkation*. 

The *Port of Embarkation* feature will need to be transformed in order to obtain dummy variables to encode the three categories:

* Cherbourg: C
* Queenstown: Q
* Southampton: S

For completeness, let us detail the steps again: 

In [None]:
import pandas as pd
titanic = pd.read_csv('titanic_train.csv')
titanic = titanic.drop(['Ticket','Cabin'], axis=1)
titanic = titanic.dropna()

titanic = pd.concat([titanic, pd.get_dummies(titanic['Gender'])], axis=1)

# New steps
titanic = pd.concat([titanic, pd.get_dummies(titanic['Embarked'])], axis=1)

Split the dataset with new features.

Note Port of Embarkation is not a binary feature, so we need all 3 dummy variable.

In [None]:
import sklearn.model_selection as ms
X = titanic[['Pclass', 'Age', 'female', 'SibSp', 'Parch', 'Fare', 'S', 'C', 'Q']]
Y = titanic['Survived']


XTrain, XTest, YTrain, YTest = ms.train_test_split(X, Y, test_size= 0.2, random_state=42)

We will try out the learned ensemble methods implemented in sklearn:
* **Bagging**: BaggingClassifier()
* **Boosting**: AdaBoostClassifier()
* **Random Forests**: RandomForestClassifier()
* **Extremely Randomized Trees**: ExtraTreesClassifier()

Some important parameters these ensemble techniques take: 
* **base_estimator**: determines the base classifier to be used in modelling. By default, the base classifier is decision trees, i.e., *DecisionTreeClassifier* we have discussed 
* **n_estimators**: determines the total number of base classifiers to be used

Let's load the modules, and run them with 100 base models.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

n_estimators = 100

models = [DecisionTreeClassifier(max_depth=3),\
BaggingClassifier(n_estimators=n_estimators),\
RandomForestClassifier(n_estimators=n_estimators),\
ExtraTreesClassifier(n_estimators=n_estimators),\
AdaBoostClassifier(n_estimators=n_estimators)]

# Create a list that provides an easy way to identify each of the models later.
model_titles = ['DecisionTree', 'Bagging', 'RandomForest', 'ExtraTrees', 'AdaBoost']

# Create five empty lists for saving the computing results later
# Why is this a better way to save test result?
surv_preds, surv_probs, scores, fprs, tprs = ([] for i in range(5))


for i, model in enumerate(models):
    print("Fitting {0}".format(model_titles[i]))
    clf = model.fit(XTrain,YTrain)
    surv_preds.append(model.predict(XTest))
    surv_probs.append(model.predict_proba(XTest))
    scores.append(model.score(XTest, YTest))
    fpr, tpr, _ = roc_curve(YTest, surv_probs[i][:,1]) # get true and false positive rates 
    fprs.append(fpr)
    tprs.append(tpr)

A for loop that prints out the score (accuracy) of each method

In [None]:
for i, score in enumerate(scores):
    print("{0} with score {1:0.2f}".format(model_titles[i], score))

Make a ROC plot for easy comparison among metjods.

In [None]:
import matplotlib.pyplot as plt

for i, _ in enumerate(models):
    plt.plot(fprs[i],tprs[i])

plt.legend(model_titles)
plt.show()