# Decision Tree

Decision trees are versatile machine learning algorithms that can perform
both classification and regression tasks, and even multioutput tasks. They are
powerful algorithms, capable of fitting complex datasets.

## Making Predictions

The way Decision Trees make predictions is so understandable. Suppose
you find an iris flower and you want to classify it based on its petals. You
start at the root node (depth 0, at the top): this node asks whether the flower’s
petal length is smaller than 2.45 cm. If it is, then you move down to the root’s
left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not
have any child nodes), so it does not ask any questions: simply look at the
predicted class for that node, and the decision tree predicts that your flower is
an Iris setosa (class=setosa).

Now suppose you find another flower, and this time the petal length is greater
than 2.45 cm. You again start at the root but now move down to its right child
node (depth 1, right). This is not a leaf node, it’s a split node, so it asks
another question: is the petal width smaller than 1.75 cm? If it is, then your
flower is most likely an Iris versicolor (depth 2, left). If not, it is likely an Iris
virginica (depth 2, right). It’s really that simple.

A node’s samples attribute counts how many training instances it applies to.
For example, 100 training instances have a petal length greater than 2.45 cm
(depth 1, right), and of those 100, 54 have a petal width smaller than 1.75 cm
(depth 2, left). A node’s value attribute tells you how many training instances
of each class this node applies to: for example, the bottom-right node applies
to 0 Iris setosa, 1 Iris versicolor, and 45 Iris virginica. Finally, a node’s gini
attribute measures its Gini impurity: a node is “pure” (gini=0) if all training
instances it applies to belong to the same class. For example, since the depth-
1 left node applies only to Iris setosa training instances, it is pure and its Gini
impurity is 0.

### Model Interpretation: White box versus black box

Decision trees are intuitive, and their decisions are easy to interpret. Such
models are often called white box models. In contrast, as you will see,
random forests and neural networks are generally considered black box
models. They make great predictions, and you can easily check the
calculations that they performed to make these predictions; nevertheless,
it is usually hard to explain in simple terms why the predictions were
made. For example, if a neural network says that a particular person
appears in a picture, it is hard to know what contributed to this prediction:
Did the model recognize that person’s eyes? Their mouth? Their nose?
Their shoes? Or even the couch that they were sitting on? Conversely,
decision trees provide nice, simple classification rules that can even be applied manually if need be (e.g., for flower classification). The field of
interpretable ML aims at creating ML systems that can explain their
decisions in a way humans can understand. This is important in many
domains—for example, to ensure the system does not make unfair
decisions.

## Estimating Class Probabilities

A decision tree can also estimate the probability that an instance belongs to a
particular class k. First it traverses the tree to find the leaf node for this
instance, and then it returns the ratio of training instances of class k in this
node.

## The CART Training Algorithm

The algorithm works by
first splitting the training set into two subsets using a single feature k and a
threshold t (e.g., “petal length ≤ 2.45 cm”). How does it choose k and t ? It
searches for the pair (k, t ) that produces the purest subsets, weighted by their
size.

Once the CART algorithm has successfully split the training set in two, it
splits the subsets using the same logic, then the sub-subsets, and so on,
recursively. It stops recursing once it reaches the maximum depth (defined by
the max_depth hyperparameter), or if it cannot find a split that will reduce
impurity. A few other hyperparameters (described in a moment) control
additional stopping conditions: min_samples_split, min_samples_leaf,
min_weight_fraction_leaf, and max_leaf_nodes.

As you can see, CART algorithm is a greedy algorithm.

## Gini Impurity or Entropy?

By default, the DecisionTreeClassifier class uses the Gini impurity measure,
but you can select the entropy impurity measure instead by setting the
criterion hyperparameter to "entropy".



Both Gini Impurity and Entropy are measures of impurity or uncertainty in a decision tree. Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Entropy is a measure of the average amount of information contained in each item in the dataset.

The formula for Gini Impurity is:

\begin{equation}
I_G(p) = 1 - \sum\limits_{i=1}^J p_i^2
\end{equation}

where $p_i$ is the proportion of samples in the $i$-th class and $J$ is the total number of classes.

The formula for Entropy is:

$$\begin{equation}
I_H(p) = - \sum\limits_{i=1}^J p_i \log_2(p_i)
\end{equation}$$

where $p_i$ is the proportion of samples in the $i$-th class and $J$ is the total number of classes.

In LaTeX, the formulas can be written as follows:

Gini Impurity:


$$I_G(p) = 1 - \sum\limits_{i=1}^J p_i^2$$


Entropy:


$$I_H(p) = - \sum\limits_{i=1}^J p_i \log_2(p_i)$$


So, should you use Gini impurity or entropy? The truth is, most of the time it
does not make a big difference: they lead to similar trees. Gini impurity is
slightly faster to compute, so it is a good default. However, when they differ,
Gini impurity tends to isolate the most frequent class in its own branch of the
tree, while entropy tends to produce slightly more balanced trees.


## Regularization Hyperparameters

Decision trees make very few assumptions about the training data (as
opposed to linear models, which assume that the data is linear, for example).
If left unconstrained, the tree structure will adapt itself to the training data,
fitting it very closely—indeed, most likely overfitting it. Such a model is
often called a nonparametric model, not because it does not have any
parameters (it often has a lot) but because the number of parameters is not
determined prior to training, so the model structure is free to stick closely to
the data. In contrast, a parametric model, such as a linear model, has a
predetermined number of parameters, so its degree of freedom is limited,
reducing the risk of overfitting (but increasing the risk of underfitting).

To avoid overfitting the training data, you need to restrict the decision tree’s
freedom during training. As you know by now, this is called regularization.
The regularization hyperparameters depend on the algorithm used, but
generally you can at least restrict the maximum depth of the decision tree. In
Scikit-Learn, this is controlled by the max_depth hyperparameter. The default
value is None, which means unlimited. Reducing max_depth will regularize
the model and thus reduce the risk of overfitting.

## Regression

The
main difference is that instead of predicting a class in each node, it predicts a
value. For example, suppose you want to make a prediction for a new
instance with x = 0.2. The root node asks whether x ≤ 0.197. Since it is not,
the algorithm goes to the right child node, which asks whether x ≤ 0.772.
Since it is, the algorithm goes to the left child node. This is a leaf node, and it
predicts value=0.111. This prediction is the average target value of the 110
training instances associated with this leaf node, and it results in a mean
squared error equal to 0.015 over these 110 instances.


## Limitations

Desicion Trees are easy to understand and interpret, simple to use, versatile, and powerful. However, they have a few limitations. 

### Sensitivity to Axis Orientations

First, as
you may have noticed, decision trees love orthogonal decision boundaries (all
splits are perpendicular to an axis), which makes them sensitive to the data’s
orientation.

### Decision Trees have a high variance

More generally, the main issue with decision trees is that they have quite a
high variance: small changes to the hyperparameters or to the data may
produce very different models. Luckily, by averaging predictions over many trees, it’s possible to reduce
variance significantly. Such an ensemble of trees is called a random forest.