# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and Y.
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds.
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### 1.b Scale (normalize) data
As you may have noticed the features are not normalized. Do the score improve with normalization?

1. Normalize the predictors.
2. Build a decision tree classifier and bagging decision tree classifier.
3. Are scores different from non-scaled data?


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier.
2. Search for few parameter values to try and improve the score of the classifier.
4. Check the best\_score\_ once you've trained it. Is it better than before?
5. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
6. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
7. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator (see example).
    - Note that there are also additional parameters to change (see example).
    - Note that you may end up with a grid space to large to search in a short time - choose smaller ranges of parameters!
    - Make use of the n_jobs parameter to speed up your grid search (-1 uses all cores).
8. Does the score improve for the Grid-searched Bagging Classifier?
9. Which score is better? Are the score significantly different? How could/would you judge that?

---

**EXAMPLE**
```python
params = {"base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
         }

bagged_decision_trees = BaggingClassifier(DecisionTreeClassifier())

gsbdt = GridSearchCV(bagged_decision_trees, params, n_jobs=-1, cv=5)
```

## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging Regressor instead of classifiers.

### 2.a Simple comparison
1. Load the data and create X and Y
2. Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. What does the score mean (look at documentation!).
3. Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
4. Which score is better? Are the score significantly different? How could/would you judge that?

### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor.
2. Search for few values of the parameters in order to improve the score of the regressor.
3. Check the best\_score\_ once you've trained it. Is it better than before?
4. How does the score of the Grid-searched DT compare with the score of the Bagging DT?
5. Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
6. Repeat the search
    - Note that you'll have to change parameter names for the base_estimator.
    - Note that there are also additional parameters to change.
    - Note that you may end up with a grid space to large to search in a short time.
    - Make use of the n_jobs parameter to speed up your grid search.
7. Does the score improve for the Grid-searched Bagging Regressor?
8. Which score is better? Are the score significantly different? How could/would you judge that?


## [BONUS]: Project 5 data

Repeat the appropriate analysis (classification/regression) for the Project 5 Dataset.