# Decision Trees Exercises

![image.png](attachment:image.png)

# Learning Objectives

- Recognize Decision trees and how to use them for classification problems
- Recognize how to identify the best split and the factors for splitting
- Explain strengths and weaknesses of decision trees
- Explain how regression trees help with classifying continuous values
- Apply Intel® Extension for Scikit-learn* to leverage underlying compute capabilities of hardware


## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [None]:
from __future__ import print_function
import os
data_path = ['data']

# This listener will import Intel Extnsions for Scikit-learn optimized versions 
# for any applicable imports from scikit-learn once this patch has been run

#from sklearnex import patch_sklearn
#patch_sklearn()

from io import StringIO
from IPython.display import Image, display
import time

from sklearn.tree import export_graphviz
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier

## Exercise 1

* Import the data and examine the features.
* We will be using all of them to predict __`color`__ (white or red), but the colors feature will need to be integer encoded.
* Note: you are expected to solve this exercise by replacing the 'xxx' in the following cells.

In [None]:
import pandas as pd
import numpy as np

filepath = os.sep.join(data_path + ['Wine_Quality_Data.csv'])
t = time.process_time()
data = pd.read_csv(filepath, sep=',')
elapsed_time = time.process_time() - t

In [None]:
r = data.shape[0]

In [None]:
print ("It took",elapsed_time,"seconds to to load",r,"rows.")

In [None]:
# Examine the data
data.head()

In [None]:
data.dtypes

Convert the color feature to an integer. This is a quick way to do it using Pandas.

In [None]:
data['xxx'] = data.color.replace('white',0).replace('red',1).astype(int)

## Exercise 2

* The amount of data is small so we will use a __`StratifiedShuffleSplit`__ to split data into train and test sets that are stratified by wine quality. Stratified ShuffleSplit cross-validator provides train/test indices to split data in train/test sets.  This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.  Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
* Check the percent composition of each quality level for both the train and test data sets.
* Note: you are expected to solve this exercise by replacing the 'xxx' in the following cells.

In [None]:
# All data columns except for color
feature_cols = [x for x in data.columns if x not in 'color']

In [None]:
# Split the data into two parts with 1000 points in the test data
# This creates a generator
strat_shuff_split = xxx(n_splits=1, test_size=1000, random_state=42)

# Get the index values from the generator
train_idx, test_idx = next(strat_shuff_split.split(data[feature_cols], data['color']))

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'color']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'color']

Now check the percent composition of each quality level in the train and test data sets. The data set is mostly white wine, as can be seen below. 0 for white 1 for red.

In [None]:
y_train.value_counts(normalize=True).sort_index()

In [None]:
y_test.value_counts(normalize=True).sort_index()

## Exercise 3

* Fit a __DecisionTreeClassifier__ with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?
* Note: you are expected to solve this exercise by replacing the 'xxx' in the following cells.

In [None]:
dt = XXX(random_state=42)
dt = dt.fit(X_train, y_train)

The number of nodes and the maximum actual depth.

In [None]:
dt.tree_.node_count, dt.tree_.max_depth

#### A function to return error metrics.

In [None]:
def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

The decision tree predicts a little better on the training data than the test data, which is consistent with (mild)  overfitting. Also notice the perfect recall score for the training data. In many instances, this prediction difference is even greater than that seen here. 

In [None]:
# The error on the training and test data sets
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_full_error

## Exercise 4

* Using __`GridSearchCV`__ with cross validation, find a decision tree that performs well on the test data set. Use a different variable name for this decision tree model than in question 3 so that both can be used in question 6.
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.
* Note: you are expected to solve this exercise by replacing the 'xxx' in the following cells.

In [None]:
param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = xxx(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

The number of nodes and the maximum depth of the tree.

In [None]:
GR.best_estimator_.tree_.node_count, GR.best_estimator_.tree_.max_depth

These test errors are a little better than the previous ones. So it would seem the previous example overfit the data, but only slightly so.

In [None]:
y_train_pred_gr = GR.predict(X_train)
y_test_pred_gr = GR.predict(X_test)

train_test_gr_error = pd.concat([measure_error(y_train, y_train_pred_gr, 'train'),
                                 measure_error(y_test, y_test_pred_gr, 'test')],
                                axis=1)

In [None]:
train_test_gr_error

## Exercise 5

* Re-split the data into `X` and `y` parts, this time with __`residual_sugar`__ being the predicted (`y`) data. *Note:* if the indices were preserved from the `StratifiedShuffleSplit` output in question 2, they can be used again to split the data.
* Using grid search with cross validation, find a decision tree **regression** model that performs well on the test data set.
* Measure the errors on the training and test sets using mean squared error.
* Make a plot of actual *vs* predicted residual sugar.
* Note: you are expected to solve this exercise by replacing the 'xxx' in the following cells.

In [None]:
feature_cols = [x for x in data.columns if x != 'residual_sugar']

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'xxx']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'xxx']

In [None]:
dr = DecisionTreeRegressor().fit(X_train, y_train)

param_grid = {'max_depth':range(1, dr.tree_.max_depth+1, 2),
              'max_features': range(1, len(dr.feature_importances_)+1)}

GR_sugar = GridSearchCV(DecisionTreeRegressor(random_state=42),
                     param_grid=param_grid,
                     scoring='neg_mean_squared_error',
                      n_jobs=-1)

GR_sugar = GR_sugar.fit(X_train, y_train)

The number of nodes and the maximum depth of the tree. This tree has lots of nodes, which is not so surprising given the continuous data.

In [None]:
GR_sugar.best_estimator_.tree_.node_count, GR_sugar.best_estimator_.tree_.max_depth

The error on train and test data sets. Since this is continuous, we will use mean squared error.

In [None]:
y_train_pred_gr_sugar = GR_sugar.predict(X_train)
y_test_pred_gr_sugar  = GR_sugar.predict(X_test)

train_test_gr_sugar_error = pd.Series({'train': mean_squared_error(y_train, y_train_pred_gr_sugar),
                                         'test':  mean_squared_error(y_test, y_test_pred_gr_sugar)},
                                          name='MSE').to_frame().T

train_test_gr_sugar_error

A plot of actual vs predicted residual sugar.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context('notebook')
sns.set_style('white')
sns.set_palette('dark')

%matplotlib inline

In [None]:
fig = plt.figure(figsize=(6,6))
ax = plt.axes()

ph_test_predict = pd.DataFrame({'test':y_test.values,
                                'predict': y_test_pred_gr_sugar}).set_index('test').sort_index()

ph_test_predict.plot(marker='o', ls='', ax=ax)
ax.set(xlabel='Test', ylabel='Predict', xlim=(0,35), ylim=(0,35));