# Decision Trees

CART (Classification and Regression Tree)

## What are Decision Trees?

 - Supervised Learning
 - Works for both classification and regression
 - Foundation of Random Forests
 - Attractive because of interpretability
 
***

Decision Trees work by:
 - Split based on set impurity criteria
 - Stopping criteria
 
***

Some **advantages** of decision trees are:
 - Simple to understand and to interpret. Trees can be visualized.
 - Requires little data preperation.
 - Able to handle both numerical and categorical data.
 - Possible to validate a model using statistical tests.
 - Performs well even if its assumptions are somewhat violated by the true model from which the data were generalized.

The **disadvantages** of decision trees include:
 - Overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples a required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem
 - Decision trees can be unstable. Mitigant: Use decision trees within an ensemble.
 - Cannot guarantee to return the globally optimal decision tree. Mitigant: Training multiple trees in an ensemble learner.
 - Decision tree learners create biased trees if some classes dominate. Recommendation: Balance the dataset prior to fitting.

## Classification



### Training a Decision Tree with Scikit-Learn Library

In [4]:
from sklearn import tree

In [5]:
X = [[0, 0], [1, 2]]
y = [0, 1]

In [6]:
clf = tree.DecisionTreeClassifier()

In [7]:
clf = clf.fit(X, y)

In [9]:
clf.predict([[2., 2.]])

array([1])

In [10]:
clf.predict_proba([[2., 2.]])

array([[0., 1.]])

This means, there is 0% probability it will fit in first class and 100% probability it will fit in second class

In [11]:
clf.predict([[0.4, 1.2]])

array([1])

In [12]:
clf.predict_proba([[0.4, 1.2]])

array([[0., 1.]])

DecisionTreeClassifier is capable of both binary classification and multiclass classification

### Applying to Iris Dataset

In [15]:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()

In [18]:
iris.data[0:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [19]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [21]:
X = iris.data[:, 2:] #petal length and width

In [22]:
y = iris.target

In [23]:
clf = tree.DecisionTreeClassifier(random_state=42)

In [24]:
clf = clf.fit(X, y)

In [74]:
from sklearn.tree import export_graphviz
import graphviz

In [75]:
export_graphviz(clf,
               out_file="tree.dot",
               feature_names=iris.feature_names[2:],
               class_names=iris.target_names,
               rounded=True,
               filled=True)

Run the following line on your command prompt

'$ dot -Tpng tree.dot -o tree.png'

Only works if you created an anaconda environment for data, else you can add GraphViz's executable to Path using:

import os

os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

Where you use the location of the bin folder for GraphViz

Then view tree with img src="tree.png" wrapped in <> and can scale with width=xx% and height=xx%

<img src="tree.png" width=60% height=60%>