<h1>Decision Trees and Machine Learning</h1>

<li>Decision trees are tree structures containing rules</li>
<li>The leaf nodes of the tree are the "learned" categories (or threshold values)</li>
<li>A path from the root to a leaf node represents a rule (or a decision path)</li>
<li>Leaf nodes represent predictions (either a category or the mean (expected) value of a decision variable</li>
<li>Each case filters through the tree from the root to a leaf node to get a prediction</li>

<span style="color:green;font-size:xx-large">Why decision trees?</span>
<li>Easy to understand </li>
<li>Rule finding process is transparent</li>
<li>Can handle "mixed" categorical(male/female) and numerical (age, number of siblings) data</li>
<li>Can handle missing data </li>
<li>Can be used to generate partial "good" solutions</li>
<li>Can find non-linear patterns in the data</li>

<span style="color:green;font-size:xx-large">Decision trees and non-linear patterns</span>
<p></p>
<img src="linear_model.png">
<img src="linear_classifier_accuracy.png">
<img src="non_linear_classifier.png">
<img src="decision_tree.png">

<span style="color:green;font-size:xx-large">Why not decision trees?</span>
<li>Finding an optimal tree is a hard problem</li>
<li>Overfitting is a problem</li>

<h1>Example: Predicting wine quality</h1>
<li><span style="color:blue">Input features</span>: Chemical properties of wines</li>
<li><span style="color:blue">Wine quality</span>: A number between 0 and 10</li>

<span style="color:green;font-size:x-large">Import the data</span>

In [1]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
import pandas as pd
from pandas import DataFrame
w_df = pd.read_csv(url,header=0,sep=';')
w_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


<span style="color:green;font-size:x-large">Examine the dependent variable</span>
<li>Higher dv values indicate a better quality wine</li>
<li>Lower dv values indicate a poorer quality wine</li>
<li>We'll assume that the values are continuous</li>
<li>And use the various features to predict wine quality</li>

In [None]:
w_df['quality'].unique()

<span style="color:green;font-size:x-large">Build train and test samples</span>

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(w_df, test_size = 0.3)
x_train_wine = train.iloc[0:,0:11]
y_train_wine = train[['quality']]
x_test_wine = test.iloc[0:,0:11]
y_test_wine = test[['quality']]



<span style="color:green;font-size:xx-large">Choose a model</span>
<p></p>
<ul>
<li><b>Classification trees</b>: Uses rules to classify cases into two or more categories (classify handwritten digits)</li>
<ul>
<li>Classification trees recursively split the data on a feature value</li>
<li>Each split minimizes the entropy (also known as the impurity)</li>
<li>Entropy is commonly measured using the GINI cost function (a measure of the probability of misclassification or 'purity')</li>
    <li>Classifiers are used when the target variable is a set of unordered categories (handwritten digit recognition)</li>
</ul>
<li><b>Regression trees</b>: Uses rules to group data into target variable ranges (Wine Quality)</li>
<ul>
<li>Also split the data on feature values</li> 
    <li>Minimize cost (impurity). Usually the mean squared error</li>
    <li>Regression trees are used when the target variable is continuous and ordered (wine quality from 0 to 10)</li>
    </ul>
</ul>

<span style="color:green;font-size:x-large">Regression trees</span>
<li>Run regressions for each X to the dependent variable</li>
<li>Pick the variable with the most explanatory power and split it at several points</li>
<li>Calculate the Mean Square Error of each of the two halves for each split</li>
<li>Pick the split point that gives the lowest mse (combined)</li>

<span style="color:green;font-size:xx-large">Decision trees and Entropy minimization</span>
<p></p>
<li><b>Entropy</b>: a measure of uncertainty in the data<p></li>
<ul>
    <li>what is the uncertainty in color when you draw a marble from a box of 100 blue marbles?</li>
    <li>what is the uncertainty when you draw a marble from a box with 50 blue and 50 red marbles?</li>
</ul>
<li>Entropy minimization: decision tree algorithms seek to partition the data on features in a way so that total entropy is minimized</li>
<li>In the case of regression trees, mean square error serves as a proxy for entropy</li>
<li>In the case of decision trees, gini, a measure of the frequency of misclassification of an element, is used as a proxy for entropy (you can also directly measure entropy but it is computationally inefficient to do so)</li>
<p></p>
<span style="color:red;font-size:large">The danger of entropy minimization</span>
<p></p>
<li>In the degenerate case, we can build rules that partition the data into single case subsets</li>
<li>The resulting combined entropy will be zero!</li>
<li>But the results will be useless because we will likely not be able to predict anything if we get a new case</li>


<span style="color:green;font-size:x-large">Tree depth: Stopping and Pruning Rules</span>
<li>In the degenerate case, a decision tree algorithm can build a tree with exactly one training case in each leaf node</li>
<li>This would be pointlessly overfitted!</li>
<p></p>
<span style="color:green;font-size:x-large">controlling for overfitting</span>

<li>Set a minimum count of observations in each leaf node. If the number of observations falls below the minimum, don't split the node any further</li>
<li>Set a maximum tree <b>depth</b>. Once a path reaches a certain length, stop splitting that path</li>
<li>Minimize <b>complexity cost</b>. Complexity cost in a decision tree is a function of the overall misclassification rate of the tree (we want the overall misclasification rate to be low) and the number of leaf nodes (we don't want too many categories because of the overfitting danger)</li>
<li>Various other parameters (cf. <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html</a>)</li>

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
wine_rgr = tree.DecisionTreeRegressor(max_depth=3,min_samples_leaf=20,min_samples_split=50)
wine_rgr.fit(x_train_wine,y_train_wine)

<span style="color:green;font-size:x-large">Evaluating a regressor</span>
<li>Since a regressor is predicting continuous values of the dv, we can evaluate it like a linear regression model</li>
<li><span style="color:blue">R-Square</span> tells us how much of the variance in the data is explained by our model (how well the model fits the data)</li>
<li><span style="color:blue">mean square error</span> gives us an estimate of how far, on the average, are our predictions from actuals (this is better as a model comparison tool)</li>

In [None]:
#Get the R-Square for the predicted vs actuals on the text sample
print("Training R-Square",wine_rgr.score(x_train_wine,y_train_wine))
print("Testing R-Square",wine_rgr.score(x_test_wine,y_test_wine))
#print("Training mean sq error",wine_rgr.score(x_train_wine,y_train_wine))
#print("Testing mean sq error",wine_rgr.score(x_test_wine,y_test_wine))

<span style="color:green;font-size:x-large">Visualizing the tree</span>
<li>sklearn has a handy function <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html">plot_tree</a> for visualizing a tree</li>

In [None]:
import matplotlib.pyplot as plt

#associate column names with features
feature_names = w_df.columns[:-1]

#set up a figure (mainly for the size)
fig, ax = plt.subplots(figsize=(24, 28))

#plot the tree
tree.plot_tree(wine_rgr,feature_names=feature_names, max_depth=4, fontsize=12)

plt.show()



<h1>Classification trees</h1>



<span style="color:green;font-size:xx-large">In-class problem: Classifying wine into good or bad wine</span>
<li>assume any wine with quality less than 5.5 is a bad wine</li>
<li>and any wine with quality greater than or equal to 5.5 is a good wine</li>
<li>build a classifier that classifies the wine data into good and bad wines</li>
<li>use the training data to build the classifier and report the accuracy <span style="color:red">score</span> on the testing data</li>

In [None]:
#Build train and test data
x_train_wine = train.iloc[0:,0:11]
y_train_wine = train[['quality']]
x_test_wine = test.iloc[0:,0:11]
y_test_wine = test[['quality']]

#Convert y values into categorical data
y_train_wine_cat = y_train_wine >= 5.5
y_test_wine_cat = y_test_wine >= 5.5

#Set up and fit a model 
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
wine_clf = DecisionTreeClassifier(max_depth=4)
#fit the data to the classifier
wine_clf.fit(x_train_wine,y_train_wine_cat)

#Report the accuracy score
training_accuracy = wine_clf.score(x_train_wine,y_train_wine_cat)
testing_accuracy = wine_clf.score(x_test_wine,y_test_wine_cat)
print(training_accuracy)
print(testing_accuracy)

#Render the decision tree
import matplotlib.pyplot as plt
feature_names = w_df.columns[:-1]
fig, ax = plt.subplots(figsize=(24, 28))
tree.plot_tree(wine_clf,feature_names=feature_names, max_depth=4, fontsize=12)

plt.show()
