# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns = data.feature_names)
y = data.target

In [59]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

tree_model = DecisionTreeClassifier(max_depth = 5, random_state = 1)
score = cross_val_score(tree_model, X, y, cv=5)
print "DTC"
print "%s ± %s" % (score.mean(), score.std())

bdt = BaggingClassifier(tree_model)
score = cross_val_score(bdt, X, y, cv=5)
print "With Bagging"
print "%s ± %s" % (score.mean(), score.std())

print "I think std being < .05 means significant score."

DTC
0.922770296268 ± 0.0183853837066
With Bagging
0.954505579069 ± 0.0252545217544
I think std being < .05 means significant score.


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [99]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

dtr_model = make_pipeline(StandardScaler(), DecisionTreeClassifier())
score = cross_val_score(dtr_model, X, y, cv=5)
print "DTC"
print "%s ± %s" % (score.mean(), score.std())

bagging_model = make_pipeline(StandardScaler(), BaggingClassifier(dtr_model))
print bagging_model
score = cross_val_score(bagging_model, X, y, cv=5)
print "With Bagging"
print "%s ± %s" % (score.mean(), score.std())

print "Bagging is better, yes significant with std <.05"
print "Bagging is ~.03 better but DTC is about the same"

DTC
0.0133481646274 ± 0.0163632262945
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('baggingclassifier', BaggingClassifier(base_estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('decisiontreeclassifier', DecisionTreeClassifier(class_weight=Non...estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False))])
With Bagging
0.0 ± 0.0
Bagging is better, yes significant with std <.05
Bagging is ~.03 better but DTC is about the same


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [63]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
param_grid = {'max_leaf_nodes': np.arange(5,10),
              'min_samples_leaf': np.arange(2,14),
              'min_samples_split': np.arange(4,14)
             }

tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 5)

tree.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13]), 'max_leaf_nodes': array([5, 6, 7, 8, 9]), 'min_samples_leaf': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [66]:
print "Best parameters"
print tree.best_params_
print "Best DTC score after gridsearch"
print tree.best_score_
print "Yes it is better than before and about the same as bagging"

Best parameters
{'min_samples_split': 4, 'max_leaf_nodes': 6, 'min_samples_leaf': 8}
Best DTC score after gridsearch
0.943760984183
Yes it is better than before and about the same as bagging


In [74]:
#What are good parameter ranges to search??
param_grid = {'n_estimators': np.arange(5,10),
              'max_samples': np.arange(1,3),
              'n_jobs' : np.arange(1,10)
             }

dtc = DecisionTreeClassifier()
tree = GridSearchCV(BaggingClassifier(dtc), param_grid, cv = 5)

tree.fit(X,y)

                                                                                                                                      

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
        ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': array([5, 6, 7, 8, 9]), 'max_samples': array([1, 2]), 'n_jobs': array([1, 2, 3, 4, 5, 6, 7, 8, 9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [76]:
print "Best parameters"
print tree.best_params_
print "Best Bagging score after gridsearch"
print tree.best_score_
print "Gridsearch made bagging worse...?"

Best parameters
{'n_estimators': 5, 'max_samples': 2, 'n_jobs': 3}
Best Bagging score after gridsearch
0.806678383128
Gridsearch made bagging worse...?


## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

In [83]:
from sklearn.datasets import load_diabetes
data = load_diabetes()

pd.DataFrame(data.data)

X = pd.DataFrame(data.data)
y = data.target



### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [101]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

tree_model = DecisionTreeRegressor(max_depth = 5, random_state = 1)
score = cross_val_score(tree_model, X, y, cv=5)
print "DTR"
print "%s ± %s" % (score.mean(), score.std())

bdt = BaggingRegressor(tree_model)
score = cross_val_score(bdt, X, y, cv=5, scoring='r2')
print "With Bagging"
print "%s ± %s" % (score.mean(), score.std())

print "Bagging helps a lot, it's r2 almost doubles"

DTR
0.223665822631 ± 0.10598865997
With Bagging
0.402556095641 ± 0.0599559712635
Bagging helps a lot


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [93]:
param_grid = {'max_leaf_nodes': np.arange(5,10),
              'min_samples_leaf': np.arange(2,14),
              'min_samples_split': np.arange(4,14)
             }

tree = GridSearchCV(DecisionTreeRegressor(), param_grid, cv = 5)

tree.fit(X,y)

print "Best parameters"
print tree.best_params_
print "Best DTR score after gridsearch"
print tree.best_score_
print "Yes it is better than before and about the same as bagging"

Best parameters
{'min_samples_split': 4, 'max_leaf_nodes': 8, 'min_samples_leaf': 12}
Best DTC score after gridsearch
0.354788797624
Yes it is better than before and about the same as bagging


In [102]:
param_grid = {'n_estimators': np.arange(5,10),
              'max_samples': np.arange(1,3),
              'n_jobs' : np.arange(1,10)
             }

dtr = DecisionTreeRegressor()
tree = GridSearchCV(BaggingRegressor(dtr), param_grid, cv = 5)

tree.fit(X,y)

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': array([5, 6, 7, 8, 9]), 'max_samples': array([1, 2]), 'n_jobs': array([1, 2, 3, 4, 5, 6, 7, 8, 9])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [103]:
print "Best parameters"
print tree.best_params_
print "Best Bagging score after gridsearch"
print tree.best_score_
print "Score using mean accuracy seems low."

Best parameters
{'n_estimators': 7, 'max_samples': 2, 'n_jobs': 5}
Best Bagging score after gridsearch
0.122791100507
Gridsearch made bagging worse...?


## Bonus: Project 6 data

Repeat the analysis for the Project 6 Dataset