# Decision Trees

A `decision tree` is a predictive modeling tool which uses a tree structure to represent a number of possible `decision paths` and an outcome for each path.

<img src="images/tree1.png" alt="" style="width: 600px;"/>


Decision trees have a lot to recommend them. They’re `very easy to understand and interpret`, and the process by which they reach a prediction is completely transparent. Unlike the other models, decision trees `can easily handle a mix of numeric` (e.g., number of legs) `and categorical` (e.g., delicious/not delicious) `attributes and can even classify data for which attributes are missing`.

At the same time, finding an “optimal” decision tree for a set of training data is computationally a very hard problem. (We will get around this by trying to build a goodenough tree rather than an optimal one, although for large data sets this can still be a lot of work.) More important, `it is very easy (and very bad) to build decision trees that are overfitted to the training data`, and that don’t generalize well to unseen data. We’ll look at ways to address this.

Most people divide decision trees into `classification trees` (which produce categorical outputs) and `regression trees` (which produce numeric outputs). 

## Entropy

In order to build a `decision tree`, we will need to decide what questions to ask and in what order. At each stage of the tree there are some possibilities we’ve eliminated and some that we haven’t. Every possible question partitions the remaining possibilities according to their answers.

Ideally, `we’d like to choose questions whose answers give a lot of information about what our tree should predict`. If there’s a single yes/no question for which “yes” answers always correspond to True outputs and “no” answers to False outputs (or vice versa), this would be an awesome question to pick.

We capture this notion of “how much information” with `entropy` (think "uncertainty"). You have probably heard this used to mean disorder. We use it to represent the uncertainty associated with data.

Imagine that we have a set S of data, each member of which is labeled as belonging to one of a finite number of classes C1, ..., Cn. If all the data points belong to a single class, then there is no real uncertainty, which means we’d like there to be `low entropy`. If the data points are evenly spread across the classes, there is a lot of uncertainty and we’d like there to be `high entropy`.

<img src="images/tree2.png" alt="" style="width: 600px;"/>

Each term `−pi log2 pi` is non-negative and is close to zero precisely when pi is either close to zero or close to one.

<img src="images/tree3.png" alt="" style="width: 600px;"/>

This means the `entropy will be small` when every `pi` is close to 0 or 1 (i.e., when most of the data is in a single class), and it `will be larger` when many of the `pi`’s are not close to 0 (i.e., when the data is spread across multiple classes). 

In [11]:
from typing import List
import math

def entropy(class_probabilities: List[float]) -> float:
    """given a list of class probabilities, compute the entropy"""
    return sum(-p * math.log(p,2) 
               for p in class_probabilities
              if p > 0) # ignore zero probabilities

assert entropy([1.0]) == 0
assert entropy([0.5, 0.5]) == 1
assert 0.81 < entropy([0.25, 0.75]) < 0.82
assert entropy([0.25, 0.75]) == entropy([0.75, 0.25])

In [4]:
entropy([0.33, 0.33, 0.33])

1.5834674497121084

In [5]:
entropy([0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1])

3.321928094887362

In [6]:
entropy([0.1, 0.1, 0.1, 0.1, 0.1, 0.5])

2.1609640474436813

In [8]:
entropy([0.1, 0.1, 0.2, 0.1, 0.1, 0.4])

2.3219280948873626

In [9]:
from typing import Any
from collections import Counter

def class_probabilities(labels: List[Any]) -> List[float]:
    """return the frewquency of class appearances in a list"""
    total_count = len(labels)
    return [count / total_count for count in Counter(labels).values()]

def data_entropy(labels: List[Any]) -> float:
    return entropy(class_probabilities(labels))

assert data_entropy(['a']) == 0
assert data_entropy(['True', 'False']) == 1
assert data_entropy([3, 4, 4, 4]) == entropy([0.25, 0.75])

Each stage of a decision tree involves asking a question whose answer partitions data into one or (hopefully) more subsets. We’d like some notion of the `entropy` that results from partitioning a set of data in a certain way. 

We want a partition to have:
- **low entropy** if it splits the data into subsets that themselves have low entropy (i.e., are highly certain), 
- and **high entropy** if it contains subsets that (are large and) have high entropy (i.e., are highly uncertain).

In [12]:
def partition_entropy(subsets: List[List[Any]]) -> float:
    """returns the entropy from this partition of data into subsets"""
    total_count = sum(len(subset) * len(subset) / total_count
                     for subset in subsets)

One problem with this approach is that partitioning by an attribute with many different values will result in a `very low entropy due to overfitting`. For example, imagine you work for a bank and are trying to build a decision tree to predict which of your customers are likely to default on their mortgages, using some historical data as your training set. Imagine further that the data set contains each customer’s Social Security number. Partitioning on SSN will produce one-person subsets, each of which necessarily has zero entropy. But a model that relies on SSN is certain not to generalize beyond the training set. For this reason, you should probably `try to avoid (or bucket, if appropriate) attributes with large numbers of possible values when creating decision trees`.

## Example 1: Identify which candidates will interview well

In [14]:
from typing import NamedTuple, Optional

class Candidate(NamedTuple):
    level: str
    lang: str
    tweets: bool
    phd: bool
    did_well: Optional[bool] = None # allow unlabeled data