# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from scipy.stats import ttest_ind
from sklearn.pipeline import make_pipeline, make_union

In [4]:
%matplotlib inline

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [5]:
# Load the data and create X and y
X, y = datasets.load_breast_cancer(return_X_y=True)

In [38]:
# Initialize a Decision Tree Classifier and use cross_val_score to evaluate its performance.  Set cv to 5-folds
le = LabelEncoder()
y = le.fit_transform(y)
dt = DecisionTreeClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
dt.fit(X_train, y_train)

not_bagged = cross_val_score(dt, X_test, y_test, cv=5)
np.mean(not_bagged)

0.9257176204544626

In [42]:
# Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance.
# Set crossvalidation to 5-folds.
bag = BaggingClassifier(dt)
bag.fit(X_train, y_train)
bagged = cross_val_score(bag, X_test, y_test, cv=5)
np.mean(bagged)

0.93583907794434107

In [44]:
# Which score is better? Are the score significantly different? How can you judge that?

print "Bagging appears to produce better results."

Bagging appears to produce better results.


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [52]:
dt_pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier())
ddtt = dt_pipe.fit(X_train, y_train)
bag_pipe = make_pipeline(StandardScaler(), BaggingClassifier(ddtt))
bbgg = bag_pipe.fit(X_train, y_train)

print np.mean(cross_val_score(ddtt, X_test, y_test, cv=5))
print np.mean(cross_val_score(bbgg, X_test, y_test, cv=5))

print 'Standardization doesnt seem to improve the output significantly.'

0.914914104388
0.940959988328
Standardization doesnt seem to improve the output significantly.


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [61]:
# Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
# search for few values of the parameters in order to improve the score of the classifier
# Use the whole X, y dataset for your test
param_grid = {"criterion": ["gini", "entropy"],
              "min_samples_split": [2, 10, 20],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }

gs = GridSearchCV(dt, param_grid=param_grid, cv=5)
gs.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [2, 10, 20], 'max_leaf_nodes': [None, 5, 10, 20], 'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 5, 10], 'min_samples_leaf': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [62]:
# Check the best_score_ once you've trained it. Is it better than before?
print gs.best_score_
print 'The score is better.'

0.947275922671
The score is better.


In [64]:
# How does the score of the Grid-searched DT compare with the score of the Bagging DT?
print 'The score doesnt seem synificantly better, however it is consistantly better.'

The score doesnt seem synificantly better, however it is consistantly better.


In [67]:
# Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
param_grid = {"n_estimators": [1, 3, 5, 7, 9, 11],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]
              }

gs_ = GridSearchCV(bag, param_grid=param_grid, cv=5)
gs_.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
        ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [1, 3, 5, 7, 9, 11], 'bootstrap': [True, False], 'bootstrap_features': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [68]:
gs.best_score_

0.9472759226713533

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [83]:
X_, y_ = datasets.load_diabetes(return_X_y=True)

# Initialize a Decision Tree Regressor and use cross_val_score to evaluate its performance.  Set cv to 5-folds
dtr = DecisionTreeRegressor()
X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.33, random_state=42)
dtr.fit(X_train, y_train)
not_bagged = cross_val_score(dtr, X_test, y_test,cv=5)
np.mean(not_bagged)

-0.084320666297112989

In [84]:
# Wrap a Bagging Regressor around the Decision Tree Classifier and use cross_val_score to evaluate it's performance.
# Set crossvalidation to 5-folds.
bag = BaggingRegressor(dtr)
bag.fit(X_train, y_train)
bagged = cross_val_score(bag, X_test, y_test, cv=5)
np.mean(bagged)

0.38513268316994415

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [90]:
#decision tree regressor
param_grid = {"criterion": ["mse", "mae"],
              "min_samples_split": [2, 10, 20],
              "max_depth": [None, 2, 5, 10],
              "min_samples_leaf": [1, 5, 10],
              "max_leaf_nodes": [None, 5, 10, 20],
              }

gs = GridSearchCV(dtr, param_grid=param_grid, cv=5)
gs.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [2, 10, 20], 'max_leaf_nodes': [None, 5, 10, 20], 'criterion': ['mse', 'mae'], 'max_depth': [None, 2, 5, 10], 'min_samples_leaf': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [91]:
print gs.best_score_

0.753158644136


In [92]:
# cross val with bagging
param_grid = {"n_estimators": [1, 3, 5, 7, 9, 11],
              "bootstrap": [True, False],
              "bootstrap_features": [True, False]
              }

gs_ = GridSearchCV(bag, param_grid=param_grid, cv=5)
gs_.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [1, 3, 5, 7, 9, 11], 'bootstrap': [True, False], 'bootstrap_features': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [93]:
print gs_.best_score_

0.832999130645


## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset