# Decision Trees Exercises

## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [6]:
from __future__ import print_function
import os

data_path = ['/Users/laragarcia/Desktop/data_ML']

## Question 1

* Import the data and examine the features.
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [7]:
import pandas as pd
import numpy as np
filepath = os.sep.join(data_path + ['Wine_Quality_Data.csv'])
data = pd.read_csv(filepath, sep = ',', header = 0)

In [8]:
data.dtypes

fixed_acidity           float64
volatile_acidity        float64
citric_acid             float64
residual_sugar          float64
chlorides               float64
free_sulfur_dioxide     float64
total_sulfur_dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
color                    object
dtype: object

In [9]:
data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [10]:
data['color'] = data.color.replace('white',0).replace('red',1).astype(np.int)

## Question 2

* Use `StratifiedShuffleSplit` to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.
* Check the percent composition of each quality level for both the train and test data sets.

In [11]:
feature_col = [x for x in data.columns if x not in 'color']

In [16]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 1000, random_state = 42)

train_idx, test_idx = next(sss.split(data[feature_col], data['color']))

X_train = data.loc[train_idx, feature_col]
Y_train = data.loc[train_idx, 'color']

X_test = data.loc[test_idx, feature_col]
Y_test = data.loc[test_idx, 'color']

In [17]:
Y_train.value_counts(normalize = True).sort_index()

0    0.753866
1    0.246134
Name: color, dtype: float64

In [18]:
Y_test.value_counts(normalize = True).sort_index()

0    0.754
1    0.246
Name: color, dtype: float64

## Question 3

* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

In [19]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state = 42)
dt.fit(X_train, Y_train)

DecisionTreeClassifier(random_state=42)

In [20]:
dt.tree_.node_count, dt.tree_.max_depth

(171, 22)

In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy': accuracy_score(y_true, y_pred),
                     'precision': precision_score(y_true, y_pred),
                     'recall': recall_score(y_true, y_pred),
                     'f1': f1_score(y_true, y_pred)}, 
                     name = label)

In [23]:
y_train_predict = dt.predict(X_train)

y_test_predict = dt.predict(X_test)

train_test_error = pd.concat([measure_error(Y_train, y_train_predict, 'train'),
                            measure_error(Y_test, y_test_predict, 'test')], axis = 1)

train_test_error

Unnamed: 0,train,test
accuracy,0.999818,0.984
precision,0.999261,0.96371
recall,1.0,0.971545
f1,0.999631,0.967611


## Question 4

* Using grid search with cross validation, find a decision tree that performs well on the test data set. Use a different variable name for this decision tree model than in question 3 so that both can be used in question 6.
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.

In [25]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth +1, 2),
             'max_features':range(1, len(dt.feature_importances_)+1)}

gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
                 param_grid = param_grid,
                 scoring = 'accuracy',
                 n_jobs = 1)

gs = gs.fit(X_train, Y_train)

In [26]:
gs.best_estimator_.tree_.node_count, gs.best_estimator_.tree_.max_depth

(99, 7)

In [27]:
y_train_predict_gs = gs.predict(X_train)

y_test_predict_gs = gs.predict(X_test)

train_test_gs_error = pd.concat([measure_error(Y_train, y_train_predict_gs, 'train'),
                            measure_error(Y_test, y_test_predict_gs, 'test')], axis = 1)

train_test_gs_error

Unnamed: 0,train,test
accuracy,0.995816,0.989
precision,0.998501,0.983539
recall,0.984479,0.971545
f1,0.99144,0.977505


## Question 5

* Re-split the data into `X` and `y` parts, this time with `residual_sugar` being the predicted (`y`) data. *Note:* if the indices were preserved from the `StratifiedShuffleSplit` output in question 2, they can be used again to split the data.
* Using grid search with cross validation, find a decision tree **regression** model that performs well on the test data set.
* Measure the errors on the training and test sets using mean squared error.
* Make a plot of actual *vs* predicted residual sugar.

In [29]:
feature_col = [x for x in data.columns if x != 'residual_sugar']

X_train = data.loc[train_idx, feature_col]
Y_train = data.loc[train_idx, 'residual_sugar']

X_test = data.loc[test_idx, feature_col]
Y_test = data.loc[test_idx, 'residual_sugar']

In [31]:
from sklearn.tree import DecisionTreeRegressor

dr = DecisionTreeRegressor().fit(X_train, Y_train)

param_grid = {'max_depth':range(1, dr.tree_.max_depth +1, 2),
             'max_features':range(1, len(dr.feature_importances_)+1)}

dr_sugar = GridSearchCV(DecisionTreeRegressor(random_state=42),
                 param_grid = param_grid,
                 scoring = 'neg_mean_squared_error',
                 n_jobs = -1)

dr_sugar = dr_sugar.fit(X_train, Y_train)

In [32]:
dr_sugar.best_estimator_.tree_.node_count, dr_sugar.best_estimator_.tree_.max_depth

(2891, 13)

In [None]:
from sklearn.metrics import mean_squared_error

y_train_predict_gs_sugar = gs_sugar.predict(X_train)

y_test_predict_gs_sugar = gs_sugar.predict(X_test)

train_test_gs_sugar_error = pd.concat([measure_error(Y_train, y_train_predict_gs, 'train'),
                            measure_error(Y_test, y_test_predict_gs, 'test')], axis = 1)

train_test_gs_error

## Question 6 *(Optional)*

This question is optional as it requires an additional command line program (GraphViz) and Python library (PyDotPlus). GraphViz can be installed with a package manager on Linux and Mac. For PyDotPlus, either `pip` or `conda` (`conda install -c conda-forge pydotplus`) can be used to install the library.

Once these programs are installed:

* Create a visualization of the decision tree from question 3, where wine color was predicted and the number of features and/or splits are not limited.
* Create a visualization of the decision tree from question 4, where wine color was predicted but a grid search was used to find the optimal depth and number of features.

The decision tree from question 5 will likely have too many nodes to visualize.