<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explores ensemble methods for model creation. We start with a very simple decision tree and then dive into various ensemble methods.

https://scikit-learn.org/stable/modules/ensemble.html


In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier



In [None]:
titanic= pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv') 

In [None]:
titanic.head()

In [None]:
titanic.Pclass = titanic.Pclass.astype(str)

In [None]:
titanic.info()

In [None]:
titanic_enc = pd.get_dummies(titanic)

In [None]:
y_target = titanic_enc.pop('Survived')

Below we will explore a variety of different decision tree models. we begin with a simple default param model, then add in grid search, move to random forest, and finally a boosting classifier. It's worth noting that each of these methods have strengths and weaknesses. Ensemble methods tend to perform better than their more simple counterparts, but as you will see below it's not always quite that simple. Just throwing an ensemble method at a problem won't always improve on score, the algorithm must be implemented correctly to take advantage of the classifiers strengths and minimize the weaknesses. 

One easy to see tradeoff is complexity. The basic decision tree runs in a fraction of the time of these more complex methods. Especially when combined with gridsearch the ensembles are building hundreds to thousands more models than even a grid search decision tree. 

Our dataset is a little over 700 rows. Imagine how long these would take on a dataset with tens of thousands or millions of rows. 

# Decision Tree with default params
Not an ensemble

In [None]:
clf = DecisionTreeClassifier().fit(titanic_enc, y_target)
pd.DataFrame(cross_validate(clf,titanic_enc,y_target,scoring=['f1'])).agg('mean')

# Decision Tree Classifier with GridSearchCV
still not ensemble, but clearly more complex than training a single tree. This explores the parameter space far more thoroughly to find the best performing model.

In [None]:
parameters = {'max_depth':list(range(1,20,2))}
clf = GridSearchCV(DecisionTreeClassifier(), parameters,scoring=['f1'],refit=False)
clf = clf.fit(titanic_enc, y_target)

In [None]:
results = clf.cv_results_
pd.DataFrame(results)[['param_max_depth','mean_test_f1']].sort_values('mean_test_f1',ascending=False).head(10)

# Random Forest (Bagging) Classifier with GridSearchCV

A bagging model in which predictions are averaged. Bagging often performs better with very deep trees which in general would overfit if it weren't for the averaging of predictions

In [None]:
parameters = {'max_depth':list(range(1,20,2)),
              'n_estimators':[1,2,3,4,5,6,7,8,9,10,100,200,300,400,500,600,700,900,1000]}
clf = GridSearchCV(RandomForestClassifier(n_jobs=-1), parameters,scoring=['f1'],refit=False)
clf = clf.fit(titanic_enc, y_target)
results = clf.cv_results_


In [None]:
pd.DataFrame(results)[['param_max_depth','param_n_estimators','mean_test_f1']].sort_values('mean_test_f1',ascending=False).head(10)

In [None]:
list(range(1,20,2))

# Gradient Boosting Classifier with GridSearchCV

In [None]:


parameters = {'max_depth':list(range(1,20,2)),
              'n_estimators':[1,2,3,4,5,6,7,8,9,10,100,200,300,400,500,600]
              }
clf = GridSearchCV(GradientBoostingClassifier(), parameters,scoring=['f1'],refit=False)
clf = clf.fit(titanic_enc, y_target)
results = clf.cv_results_

In [None]:
pd.DataFrame(results)[['param_n_estimators','param_max_depth','mean_test_f1']].sort_values('mean_test_f1',ascending=False).head(10)

# Voting Ensemble

Predicts the class that the majority of classifiers predicted. In this case if we have the three classifiers predict

1 predicts class 0

2 predicts class 1

3 predicts class 0

then the final prediction will be 0



In [None]:
clf1 = SVC(kernel='sigmoid',random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = SVC(kernel='rbf',random_state=1)

eclf = VotingClassifier(
    estimators=[('svc_sig', clf1), ('rf', clf2), ('svc_rbf', clf3)],
    voting='hard')


eclf = eclf.fit(titanic_enc, y_target)
pd.DataFrame(cross_validate(eclf,titanic_enc,y_target,scoring=['f1'])).agg('mean')


# Voting with Grid Search

Here we show that it is possible to combine our use of GridSearch, Cross-Validation and Ensemble methods. While the performance is not as good in this case it is a very robust method that can outperform in some circumstances. 

In [None]:
# dropping the number of grid search params to reduce fit time

# start with a sigmoid SVC
parameters_svc_sig = {'C':list(range(1,10,2))}
clf1 = GridSearchCV(SVC(kernel='sigmoid',random_state=1), parameters_svc_sig,scoring='f1',refit=True)


# Add in a random forest
parameters_dt = {'max_depth':list(range(1,20,2)),
                'n_estimators':[1,2,3,4,5,6,7,8,9,10,100,200]
                 }
clf2 = GridSearchCV(RandomForestClassifier(), parameters_dt,scoring='f1',refit=True)

# and finally a SVC with a radial kernel
parameters_svc_rbf = {'C':list(range(1,10,2))}
clf3 = GridSearchCV(SVC(kernel='rbf',random_state=1), parameters_svc_rbf,scoring='f1',refit=True)

eclf = VotingClassifier(
    estimators=[('svc_sig', clf1), ('rf', clf2), ('svc_rbf', clf3)],
    voting='hard')


eclf = eclf.fit(titanic_enc, y_target)
pd.DataFrame(cross_validate(eclf,titanic_enc,y_target,scoring=['f1'])).agg('mean')

# Stacked Ensemble

Stacking is one of the more complex models we will explore in this class. It combines multiple models each trained using a cross validated dataset into a new single gradient boosted model which is then trained using cross validation again. 

In [None]:
clf1 = SVC(kernel='sigmoid',random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = SVC(kernel='rbf',random_state=1)

In [None]:
estimators = [('svc_sig', clf1),
              ('rf', clf2),
              ('svc_rbf', clf3)]

In [None]:

final_estimator = GradientBoostingClassifier(
    n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
    random_state=42)



stacked_ens = StackingClassifier(
    estimators=estimators,
    final_estimator=final_estimator)

pd.DataFrame(cross_validate(stacked_ens,titanic_enc,y_target,scoring=['f1'])).agg('mean')

# Stacking Ensemble with GridSearchCV

Now we stack but throw in grid search to try and get the best single model for each classifier. 

In [None]:
parameters_svc_sig = {'C':list(range(1,3,1))}
clf1 = GridSearchCV(SVC(kernel='sigmoid',random_state=1), parameters_svc_sig,scoring='f1',refit=True)


# Add in a random forest
parameters_dt = {'max_depth':list(range(1,3,1)),
                'n_estimators':[1,2,3,4,5] # reduce the grid search to make the model fit faster (it won't be as robust though)
                 }
clf2 = GridSearchCV(RandomForestClassifier(), parameters_dt,scoring='f1',refit=True)

# and finally a SVC with a radial kernel
parameters_svc_rbf = {'C':list(range(1,3,1))}
clf3 = GridSearchCV(SVC(kernel='rbf',random_state=1), parameters_svc_rbf,scoring='f1',refit=True)

In [None]:
estimators = [('svc_sig', clf1),
              ('rf', clf2),
              ('svc_rbf', clf3)]

note that max_features = 1. this new feature is actually the prediction from each of the estimators 

In [None]:
# note that the performance here will be underwhelming as combinging grid search with a stacking classifier is a computationally expensive 
# procedure. previous runs of this using the same grid params as models above took over 2 hours to run
# this code is mainly for demonstration purposes therefore and should not be interpreted as a 
# reason to not consinder stacking due to poor performance

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier


final_estimator = GradientBoostingClassifier(
    n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
    random_state=42)


stacked_ens = StackingClassifier(
    estimators=estimators,
    final_estimator=final_estimator)

pd.DataFrame(cross_validate(stacked_ens,titanic_enc,y_target,scoring=['f1'])).agg('mean')

In [None]:
!cp "/content/drive/My Drive/Colab Notebooks/4482_ensemble.ipynb" ./

# run the second shell command, jupyter nbconvert --to html "file name of the notebook"
# create html from ipynb

!jupyter nbconvert --to html "4482_ensemble.ipynb"