# Decision Trees Exercises

## Introduction

### Agenda
By the end of this lab you will be able to 
+ Create a stratified train/test split
+ Train a decision tree classifier
    + Evaluate it using *classification error metrics*
    + Select a decision tree model using Grid Search Cross Validation
+ Train a decision tree regressor
    + Evaluate it using regression error metrics
    + Select a decision tree model using Grid Search Cross Validation

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv` and it is located in today's `Resource` directory.

In [1]:
import pathlib

import pandas as pd
import numpy as np

from sklearn.model_selection import (
    StratifiedShuffleSplit,
    GridSearchCV
)

from sklearn.tree import (
    DecisionTreeClassifier,
    DecisionTreeRegressor
)
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error
)

In [2]:
# optional imports
import matplotlib.pyplot as plt
import seaborn as sns

## Question 1

* Import the data and examine the features (EDA).
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [4]:
filepath = pathlib.Path.cwd() / 'Resources' /  'Wine_Quality_Data.csv'
data = pd.read_csv(filepath, sep=',')

In [5]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed_acidity,6497.0,7.215307,1.296434,3.8,6.4,7.0,7.7,15.9
volatile_acidity,6497.0,0.339666,0.164636,0.08,0.23,0.29,0.4,1.58
citric_acid,6497.0,0.318633,0.145318,0.0,0.25,0.31,0.39,1.66
residual_sugar,6497.0,5.443235,4.757804,0.6,1.8,3.0,8.1,65.8
chlorides,6497.0,0.056034,0.035034,0.009,0.038,0.047,0.065,0.611
free_sulfur_dioxide,6497.0,30.525319,17.7494,1.0,17.0,29.0,41.0,289.0
total_sulfur_dioxide,6497.0,115.744574,56.521855,6.0,77.0,118.0,156.0,440.0
density,6497.0,0.994697,0.002999,0.98711,0.99234,0.99489,0.99699,1.03898
pH,6497.0,3.218501,0.160787,2.72,3.11,3.21,3.32,4.01
sulphates,6497.0,0.531268,0.148806,0.22,0.43,0.51,0.6,2.0


## Question 2

* Use [`StratifiedShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)  to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.
* Check the percent composition of each color for both the train and test data sets.

## Question 3

* Fit a [decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) with no set limits on maximum depth, features, or leaves. Name it `dt`.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

## Question 4

* Using [grid search with cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), find a decision tree that performs well on the test data set. Name this decision tree model `cvdt`.
* What type of cross validation is being done here?
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.

## Question 5

* Re-split the data into `X` and `y` parts, this time with `residual_sugar` being the predicted (`y`) data. *Note:* if the indices were preserved from the `StratifiedShuffleSplit` output in question 2, they can be used again to split the data.
* Using grid search with cross validation, find a [decision tree **regression** model](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) that performs well on the test data set. Name the model `dr`.
* Measure the errors on the training and test sets using mean squared error.

## Question 6 *(Optional)*
* Make a plot of actual *vs* predicted residual sugar.  Either export the data to your plotting environment of choice, or plot inline here.