# Decision Trees Exercises

## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [1]:
from __future__ import print_function
import os
data_path = ['../data/']

## Question 1

* Import the data and examine the features.
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [2]:
import pandas as pd
import numpy as np

filepath = os.sep.join(data_path + ['Wine_Quality_Data.csv'])
data = pd.read_csv(filepath, sep=',')

Convert the color feature to an integer. This is a quick way to do it using Pandas.

In [4]:
colors = data['color'].replace('white',1).replace('red',0).astype(np.int)

## Question 2

* Use `StratifiedShuffleSplit` to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.
* Check the percent composition of each quality level for both the train and test data sets.2

In [7]:
feature_cols = [ cols for cols in data.columns if cols != 'color']

In [11]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits = 1, test_size = 1000, random_state=42)
train_idx , test_idx = next(sss.split(data[feature_cols],data['color']))

X_train = data.loc[train_idx,feature_cols]
Y_train = data.loc[train_idx,'color']

X_test = data.loc[test_idx,feature_cols]
Y_test = data.loc[test_idx,'color']


In [12]:
Y_train.value_counts(normalize=True)

white    0.753866
red      0.246134
Name: color, dtype: float64

In [13]:
Y_test.value_counts(normalize=True)

white    0.754
red      0.246
Name: color, dtype: float64

In [17]:
X_train.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
4885,6.8,0.23,0.31,2.8,0.047,40.0,122.0,0.99126,3.06,0.64,10.9,7
5568,6.4,0.17,0.3,2.8,0.034,33.0,125.0,0.99152,3.03,0.49,10.4,6
224,8.4,0.635,0.36,2.0,0.089,15.0,55.0,0.99745,3.31,0.57,10.4,4
2655,7.5,0.21,0.34,1.2,0.06,26.0,111.0,0.9931,3.51,0.47,10.7,6
4989,6.9,0.29,0.3,8.2,0.026,35.0,112.0,0.99144,3.0,0.37,12.3,6


## Question 3

* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

In [18]:
from sklearn.tree import DecisionTreeClassifier

tree_cls = DecisionTreeClassifier(random_state = 42)
tree_cls = tree_cls.fit(X_train,Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [19]:
tree_cls.tree_.max_depth

19

In [20]:
tree_cls.tree_.node_count

159

In [33]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import label_binarize

def measure_report(y_true,y_pred,lab):
    precesion, recall, fscore,_ = score(y_true,y_pred,average='weighted')
    accuracy = accuracy_score(y_true,y_pred)
    return pd.Series({'precesion':precesion,
                     'recall':recall,
                      'accuracy':accuracy,
                     'f1':fscore},name = lab)


In [35]:

y_train_pred = tree_cls.predict(X_train)
y_test_pred = tree_cls.predict(X_test)

train_test_full_score = pd.concat([measure_report(Y_train,y_train_pred,'train'),
                                   measure_report(Y_test,y_test_pred,'test')],axis=1)

train_test_full_score

Unnamed: 0,train,test
accuracy,1.0,0.985
f1,1.0,0.985109
precesion,1.0,0.985527
recall,1.0,0.985


## Question 4

* Using grid search with cross validation, find a decision tree that performs well on the test data set. Use a different variable name for this decision tree model than in question 3 so that both can be used in question 6.
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.

In [38]:
from sklearn.model_selection import GridSearchCV

params_grid = {'max_depth':range(1,tree_cls.tree_.max_depth+1,2),
              'max_features':range(1,len(tree_cls.feature_importances_)+1)
              }

grid_cv = GridSearchCV(
                      DecisionTreeClassifier(random_state=42),
                      param_grid = params_grid,
                      scoring = 'accuracy',
                      n_jobs=-1)

grid_cv = grid_cv.fit(X_train,Y_train)

In [40]:
grid_cv.best_estimator_.tree_.node_count

129

In [41]:
grid_cv.best_estimator_.tree_.max_depth

11

In [43]:
grid_cv.best_estimator_.max_features_

9

In [44]:
tree_cls.feature_importances_

array([  4.07844502e-02,   1.73914515e-02,   2.55050733e-03,
         1.33314151e-02,   2.04879656e-01,   2.23371209e-03,
         6.71137448e-01,   1.61568078e-02,   1.36748284e-02,
         1.09450126e-02,   6.48578078e-03,   4.28929831e-04])

In [46]:

y_train_pred_grd = grid_cv.predict(X_train)
y_test_pred_grd = grid_cv.predict(X_test)

train_test_full_score_grd = pd.concat([measure_report(Y_train,y_train_pred_grd,'train'),
                                   measure_report(Y_test,y_test_pred_grd,'test')],axis=1)

train_test_full_score_grd

Unnamed: 0,train,test
accuracy,0.997999,0.988
f1,0.997997,0.988
precesion,0.998002,0.988
recall,0.997999,0.988


###### Test data accuracy has been improved

## Question 5

* Re-split the data into `X` and `y` parts, this time with `residual_sugar` being the predicted (`y`) data. *Note:* if the indices were preserved from the `StratifiedShuffleSplit` output in question 2, they can be used again to split the data.
* Using grid search with cross validation, find a decision tree **regression** model that performs well on the test data set.
* Measure the errors on the training and test sets using mean squared error.
* Make a plot of actual *vs* predicted residual sugar.`

In [50]:
data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [55]:
feature_cols = [ cols for cols in data.columns if cols != 'residual_sugar']

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()


X_train = data.loc[train_idx,feature_cols]
X_train['color'] = lb.fit_transform(X_train['color'])

Y_train = data.loc[train_idx,'residual_sugar']

X_test = data.loc[test_idx,feature_cols]
X_test['color'] = lb.transform(X_test['color'])
Y_test = data.loc[test_idx,'residual_sugar']


In [56]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree  import DecisionTreeRegressor

dr = DecisionTreeRegressor().fit(X_train,Y_train)

params_grid = {'max_depth':range(1,dr.tree_.max_depth+1,2),
              'max_features':range(1,len(dr.feature_importances_)+1)
              }

grid_cvr = GridSearchCV(
                      DecisionTreeRegressor(random_state=42),
                      param_grid = params_grid,
                      scoring = 'neg_mean_squared_error',
                      n_jobs=-1)

grid_cvr = grid_cvr.fit(X_train,Y_train)

In [57]:
grid_cvr.best_estimator_.tree_.max_depth

23

In [58]:
grid_cvr.best_estimator_.tree_.node_count

7381

In [63]:
from sklearn.metrics import mean_squared_error, r2_score

Y_train_pred = grid_cvr.predict(X_train)
Y_test_pred = grid_cvr.predict(X_test)

tree_report = pd.DataFrame()
tree_report = pd.concat([tree_report,pd.Series({'train_error':mean_squared_error(Y_train,Y_train_pred),
                                               'test_error':mean_squared_error(Y_test,Y_test_pred)}
                                               ,name="MSE")])
tree_report = pd.concat([tree_report,pd.Series({'train_r2_score':r2_score(Y_train,Y_train_pred),
                                               'test_r2_score':r2_score(Y_test,Y_test_pred)
                                               },name="R2_Score")])

In [64]:
tree_report

Unnamed: 0,0
test_error,3.420587
train_error,0.001456
test_r2_score,0.854796
train_r2_score,0.999935


##### There is a huge difference between train and test report.It lokks like overfitting problem