# Decision Trees 

## ***Vocabulary***

none yet

# Lecture Notes #

## ***Introduction and Construction***

A decision tree is a boolean function (outputs true or false). At each node in the decision tree, there is a literal. At the leaves there is a fixed value which is the output.

The size of the decision tree will be the number of nodes in the tree. The depth (height) of the tree is equal to the length of the longest path from the root to a leaf.

*Note that for an input going into a decision tree, the x is referred to as a "challenge", and the y a "label".*

Topics:
- Heuristics for learning decision trees
- Theoretical properties

---

**Example input: X ∈ {0, 1}<sup>n</sup> (bit string of n length)**

The decision tree is going to encode some function f(x) into {0, 1} as follows:

- At each node, the tree decides which branch to take based on the value of the literal, until it reaches the leaf.

The example decision tree's depth = 2, and size = 3.

<br>
<center>
    <img src="images/1.2.1.png" alt="Professor Notes" />
</center>
<br>

---

#### **The machine learning problem:**
- Given a set of labeled examples, build a tree with low error

<br>

---

<br>

**Տ** = training set, where Տ is a collection of strings and 0, 1 labels.

- So c is a collection of X's and y's, where X ∈ {0, 1}<sup>n</sup>, and y ∈ {0, 1}.
<br>

**Error Rate/Training Error/Emperical Error Rate** = (number of mistakes that T makes on Տ)/ size of Տ, where T is a decision tree.

<br>
<center>
    <img src="images/1.2.2.png" alt="Professor Notes" />
</center>
<br>

---

#### **Natural Approach for Building Decision Trees:**
- Given a set Տ

<br>

- Tree 1: Very simple, trivial tree
    - Tree is a leaf (we dont query any literals, always output 0 or 1)
    - How do we decide what to output?
        - Choose 1 or 0 depending on which label is more prevalent in the dataset
 <br>
     
- Tree 2: More advanced tree
    - Tree has one node, the root
    - How do we decide which literal to put at the root?
        - You want a literal at the root that is going to discriminate between zero and one labels

<br>
<center>
    <img src="images/1.2.3.png" alt="Professor Notes" />
</center>
<br>

<br>

---

#### **So how do we decide which literal to put at the root?**

Define a potential function Φ(a):
<br>&emsp;&emsp;*[English: phi of a]*

$$Φ(a) = min(a, 1-a)$$

<br>

---

So, for the trivial decision tree:

Pick a literal, *x<sub>i</sub>* , then compute Φ(Pr<sub>(x, y)~Տ</sub> (y = 0))
<br>&emsp;&emsp;*[English: Compute phi of the probability that for an example we choose from Տ that y = 0]*

- Assume: 10 positive examples
- Assume: 5 negative examples
- What is Φ(Pr<sub>(x, y)~Տ</sub> (y = 0))?
    - 1/3
- *This* probability is the error rate for the trivial decistion tree.

$$ Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0)) $$

<br>
<center>
    <img src="images/1.2.4.png" alt="Professor Notes" />
</center>
<br>

---

Looking at the tree with one node, pick a literal, *x<sub>1</sub>*, as the root node...

What label should be put on the first leaf?
- Condition on *x<sub>1</sub>* = 0 -> output the majority value

Then, for the second leaf...
- Condition on *x<sub>1</sub>* = 1 -> output the majority value

Meaning, for each option of the value of *x<sub>1</sub>*, we output the majority label for that value of *x<sub>1</sub>*.

<br> 

**What is the new error rate?**

It is a weighted average of the error of each of the new leaves. Explicitly written out, the error rate for the decision tree with one node is:

$$
Pr_{(x, y)\textasciitildeՏ}[x_1 = 0]*Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0) | x_1 = 0) + 
Pr_{(x, y)\textasciitilde Տ}[x_1 = 1]*Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0) | x_1 = 1)
$$

---

**Gain(x<sub>1</sub>) = Old Rate - New Rate using x<sub>1</sub>**
<br>&emsp;&emsp;*[English: The gain of x<sub>1</sub> is the old error rate minus the new error rate using x<sub>1</sub>]*

This is the gain in training error that we attained by moving from the trivial decision tree to the decision tree where we put x<sub>1</sub> at the root. We are defining it as Gain(x<sub>1</sub>).

<br>
<center>
    <img src="images/1.2.5.png" alt="Professor Notes" />
</center>
<br>

---

Now we can compute the Gain(x<sub>i</sub>) of each literal, from x<sub>1</sub> to x<sub>n</sub>, and find which literal maximizes the gain and place that literal at the root of our tree. 

Once we have done that, each branch will now be using a subset of the original set. In this case the left branch will use the training set Տ<sub>|x<sub>1</sub>=0</sub> *[English: Տ restricted to x<sub>1</sub>=0]*, and the right branch will be using the training set Տ<sub>|x<sub>1</sub>=1</sub> *[English: Տ restricted to x<sub>1</sub>=1]*. 

Meaning we have two different training sets now, one for the left subtree and one for the right subtree. We repeat the process of computing what literal should be at the root of the next subtrees and continue until the tree has been completed.

Is this computationally feasible?

    It depends on what the functions are. In this case, the gain function is relatively easy to compute, but also consider how large of a tree that you want to build. Also, if you start building trees that are extremely or exponentially large in terms of the features we have, that is not going to be computationally feasible. So we are going to need some sort of stopping criterion. The stopping criterion will be covered later.

## ***Potential Functions and Random Forests***

#### **Tree Structure**

The structure of the tree is determined by the choice of potential function, $\phi$. For example, we used $\phi(a) = min(a, 1-a)$, which corresponded to training error. Another common potential function is $\phi = 2\cdot a \cdot1-a$, the Gini function or Gini index. Comparing the two using graphs:

<br>
<center>
    <img src="images/1.2.6.png" alt="Professor Notes" width="400"/>
</center>
<br>

We can see that $\phi_1$ is not convex or differentiable at all points due to the discontinuity at .5. $\phi_2$ is an upper bound on $\phi_1$, and it is smooth! Because it is an upper bound:

$$ small\;values\;of\;\phi_2 \implies small\;values\;of\;\phi_1 $$

This means that when $\phi_2$ is getting smaller, $\phi_1$ is getting smaller too.

#### **Example with Gini Index**

Let's look at an example using the following table, $S$, and potential function $\phi(a) = 2\cdot a \cdot (1-a)$.
$$
\begin{array}{|c|c|c|c|}
\hline
x_1 & x_2 & \text{Pos} & \text{Neg} \\
\hline
0 & 0 & 1 & 1 \\
\hline
0 & 1 & 2 & 1 \\
\hline
1 & 0 & 3 & 1 \\
\hline
1 & 1 & 4 & 2 \\
\hline
\end{array}
$$

***What is  $\;\phi_{S}(Pr(Neg))$?***

<br>
<center>
    <img src="images/1.2.7.png" alt="Professor Notes" width="400"/>
</center>
<br>

As we can see worked out, it is $\frac{4}{9}$ for the trivial tree.

Now, **should we pick $x_1$ or $x_2$ to be at the root**?

Let's look at $x_1$:

$$ Pr(x_1=0)\cdot (Pr(Neg|x_1=0))+Pr(x_1=1)\cdot (Pr(Neg|x_1=1)) $$
$$ = \frac{1}{3}\cdot 2\cdot \frac{2}{5}\cdot \frac{3}{5} + \frac{2}{3}\cdot 2 \cdot\frac{3}{10} \cdot \frac{7}{10} $$
$$ = \frac{11}{25} $$

$\frac{11}{25} $ is slightly better than $\frac{4}{9}$, so $x_1$ is a slight improvement in choice over the trivial tree.

Let's look at $x_2$:

$$ Pr(x_2=0)\cdot (Pr(Neg|x_2=0))+Pr(x_2=1)\cdot (Pr(Neg|x_2=1)) $$
$$ = \frac{2}{5}\cdot 2\cdot \frac{1}{3}\cdot \frac{2}{3} + \frac{3}{5}\cdot 2 \cdot\frac{3}{9} \cdot \frac{7}{9} $$
$$ = \frac{4}{9} $$

We can now calculate the gain of each $x$ being the root. For $x_1$:

$$ \frac{4}{9} - {11}{25} > 0 $$

And for $x_2$:

$$ \frac{4}{9} - \frac{4}{9} = 0 $$

$x_1$ is the variable with the greatest gain, so $x_1$ is the best choice of root node.

#### **When to Stop**

***Question: When should we stop?***

There are many answers...
- Stop when the gain is extremely small for all literals
- Pruning: build an enormous tree, have some parameter indicating how many nodes desired. Start at the bottom of the tree and move upward, removing branches that have little effect on the rest of the tree.

#### **Random Forests**

This is the practice of building many small decision trees, and taking the majority vote of the resulting trees.

***Question: How do we build many decision trees?***

Algorithm:
- Take training set $S$
- Randomly subsample from $S$ to create $S'$ (can subsample with or without replacement)
- Randomly choose some features ${x_1,\dots,x_n}$ of size $k$
- Build a decision tree using $S'$ and the $k$ random features

# Personal Notes #

**[Understanding Machine Learning: From Theory to Algorithms, Chapter 18](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)** 