## Decision Trees

In this classification task, we'd like to distinguish 3 different species (Setosa, Versicolour, and Virginica) of iris flowers based on their petal and sepal length and width.

This data set is built into sklearn, so it's straightforward to load it in. See [here](https://en.wikipedia.org/wiki/Iris_flower_data_set) for more details on the iris data set.

In [3]:
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [4]:
iris_data = load_iris()
X = iris_data.data
y = iris_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_actual = y_test

In [5]:
print("Features are", iris_data.feature_names)
print("Targets are", iris_data.target_names)
print("Training set size is", len(X_train))
print("Test set size is", len(X_test))
for a, b in zip(X_train[:10], y_train[:10]):
    print("Input = {0}; output = {1} (species = {2})".format(a, b,
                                                             iris_data.target_names[b]))

Features are ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Targets are ['setosa' 'versicolor' 'virginica']
Training set size is 105
Test set size is 45
Input = [ 5.   2.   3.5  1. ]; output = 1 (species = versicolor)
Input = [ 6.5  3.   5.5  1.8]; output = 2 (species = virginica)
Input = [ 6.7  3.3  5.7  2.5]; output = 2 (species = virginica)
Input = [ 6.   2.2  5.   1.5]; output = 2 (species = virginica)
Input = [ 6.7  2.5  5.8  1.8]; output = 2 (species = virginica)
Input = [ 5.6  2.5  3.9  1.1]; output = 1 (species = versicolor)
Input = [ 7.7  3.   6.1  2.3]; output = 2 (species = virginica)
Input = [ 6.3  3.3  4.7  1.6]; output = 1 (species = versicolor)
Input = [ 5.5  2.4  3.8  1.1]; output = 1 (species = versicolor)
Input = [ 6.3  2.7  4.9  1.8]; output = 2 (species = virginica)


## Build and output the model
In the next two steps, we build the decision tree model from the training set and export it to a file for viewing in GraphViz.

In [6]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [7]:
with open("iris.dot", 'w') as f:
    f = tree.export_graphviz(model,
                             out_file=f,
                             feature_names=iris_data.feature_names,  
                             class_names=iris_data.target_names,  
                             filled=True,
                             rounded=True,  
                             special_characters=True)  

## Model evaluation

Let's first print the confusion matrix as we usually do.

In [8]:
y_pred = model.predict(X_test)
print(confusion_matrix(y_actual, y_pred))

[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]


Now let's print the **precision**, **recall** and **$F_1$ score** for each class.

**Example**: For the "versicolor" class, the precision is equal to the proportion of irises predicted to be "versicolor" that were indeed "versicolor".

The recall is equal to the proportion of irises that are in fact "versicolor" that the classifier correctly predicted to be "versicolor".

If our classifier hypothetically labelled everything as "versicolor", this would give us a low precision and high recall (100%) for this class.

If our classifier labelled only a single iris (where it was absolutely sure of its prediction) as "versicolor", this would give us a high precision (100%) and low recall for this class.

Typically, we have to trade off precision against recall based on what is most important for our problem.

The $F_1$ score is equal to the harmonic mean of precision and recall. In other words, it gives equal weight to the precision and recall and then computes their average to give us a single score for the class.

In [11]:
print(classification_report(y_actual,
                            y_pred,
                            target_names=iris_data.target_names))

             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        16
 versicolor       1.00      0.94      0.97        18
  virginica       0.92      1.00      0.96        11

avg / total       0.98      0.98      0.98        45



In [None]:
model.predict_proba(X_test)