# Decision Trees

Decision trees are quite popular since they are simple to understand and interpret. There are several algorithms to construct them (e.g. [I3](https://en.wikipedia.org/wiki/ID3_algorithm), [C4.5](https://en.wikipedia.org/wiki/C4.5_algorithm)), here we will describe [CART](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_.28CART.29) (Classification And Regression Trees).

We will start with regression problems. Our input $x$ consists of $m$ features:

$$
x = \left(
    \begin{matrix} 
    x_1 \\
    \vdots \\
    x_m
    \end{matrix}
    \right)
$$

Note, that unlike in linear- and logistic-regression we don't need the feature $x_0$.
A decision tree will split the feature space into a set of rectangles, like in the figure below.

![title](img/decision_trees_regions.png)

This same model can be represented by a binary tree:

![title](img/decision_trees_binary_tree.png)

The corresponding model can then predict $y$ with a constant $c_p$:

$$
f(x) = \sum_{k=1}^{5}c_p \begin{cases} 
1 & \text{if } x \in R_p \\
0 & \text{if } x \notin R_p
\end{cases}
$$

Or in general, if we have already found $P$ partitions:

$$
f(x) = \sum_{k=1}^{P}c_p \begin{cases} 
1 & \text{if } x \in R_p \\
0 & \text{if } x \notin R_p
\end{cases}
$$

If we use, like in linear regression, the sum of squares ($\sum(y^{(i)} - f(x^{(i)}))^2$) as our criterion for minimization, the best $c_p$ becomes the average of $y^{(i)}$ in region $R_p$:

$$
N_p = \#\{x^{(i)} \in R_p\}
$$

$$
c_p = \frac{1}{N_p} \sum_{x^{(i)} \in R_p} y^{(i)}
$$

To find the best binary partition, characterized by a splitting variable $j$ ($1 \leq j \leq m$) and split point s, define a pair of half planes:

$$
R_1(j, s) = \{x | x_j < s\}
$$

$$
R_2(j, s) = \{x | x_j \geq s\}
$$

The best splitting variable $j$ and split point $s$ will solve:

$$
\min_{j, s} \Big[\min_{c_1} \sum_{x^{(i)} \in R_1(j, s)} (y^{(i)} - c_1)^2 + \min_{c_2} \sum_{x^{(i)} \in R_2(j, s)} (y^{(i)} - c_2)^2 \Big]
$$

As we have already seen, the inner minimization is solved by taking the average of $y^{(i)}$ in the corresponding region.