# Decision Trees:
We will use the dataset below to learn a decision tree that predicts if we play tennis (Yes or No). Based on the attribute, Temperature (hot or cold), Outlook (sunny, overcast or rain), Humidity (high or normal), and Wind (weak or strong)

Training Dataset:

|Outlook  |  Temperatur | Humidity | Wind | PlayTennis|  
| ----------- | ----------- |----------- | ----------- | ----------- |
|sunny    |  hot  |   high  | weak    | no
|sunny    |  hot  |   high  | strong  | no
|overcast |  hot  |   high  | weak    | yes
|rain     | mild  |   high  | weak    | yes
|rain     | cold  |   normal|  weak   | yes
|rain     | cold  |   normal|  strong | no
|overcast | cold  |   normal|  strong | yes
|sunny    |  mild |   high  |  weak   | no
|sunny    |  cold |   normal|  weak   | yes
|rain     | mild  |  normal | weak    | yes
|sunny    | mild  |  normal | strong  | yes
|overcast | mild  | high    | strong  | yes
|overcast | hot   | normal  | weak    | yes
|rain     | mild  |  high   | strong  | no


New Sample:

|Outlook  |  Temperatur | Humidity | Wind | PlayTennis|  
| ----------- | ----------- |----------- | ----------- | ----------- |
|rain     |  mild  |  normal  | strong | ?


The following mathematical questions are possible in an exam. 


Careful, you will need to calculate $\log_2$ which most calculators can't do by default

$$ log_b x = \frac{log_d x}{log_d b} $$

$$ log_2 x = \frac{\log_{10} x}{\log_{10} 2} \quad \text{  or  } \quad \frac{\ln x}{\ln 2}$$

1. What is the entropy $H(PlayTennis)$?
2. What is the conditional entropy $H(PlayTennis |Humidity)$? 
3. What is the conditional entropy $H(PlayTennis| Wind)$?
4. What is the information gain  $IG(PlayTennis, Humidity)$?
5. What is the information gain  $IG(PlayTennis, Wind)$?
6. What feature will we split on at the root? Why?
7. Calculate the whole tree with ID3.
7. How is the new sample classified?

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

dataset = pd.DataFrame({'Outlook':['sunny','sunny','overcast','rain','rain','rain','overcast','sunny','sunny','rain','sunny','overcast','overcast','rain'],  
                        'Temperature':['hot','hot','hot','mild','cold','cold','cold','mild','cold','mild','mild','mild','hot','mild'], 
                        'Humidity':['high','high','high','high','normal','normal','normal','high','normal','normal','normal','high','normal','high'], 
                        'Wind' :['weak','strong','weak','weak','weak','strong','strong','weak','weak','weak','strong','strong','weak','strong'], 
                        'PlayTennis':[0,0,1,1,1,0,1,0,1,1,1,1,1,0]})
dataset

### What is the entropy $H(PlayTennis)$?

Entropy is a measure of uncertainty. If I sample from a distribution, do I already now beforehand what I'll get?


Entropy:
$$H(Y) = - \sum_{i=1}^k P(Y=y_i) \log_2 P(Y=y_i)$$

$$P(PlayTennis ==  yes) = \frac{9}{14}$$
$$P(PlayTennis ==  no) = \frac{5}{14}$$
$$ H(PlayTennis) =  - \frac{9}{14}  \cdot  \log_2 (\frac{9}{14})  - \frac{5}{14}  \cdot  \log_2 (\frac{5}{14}) ~= 0.94 $$

In [None]:
from scipy.stats import entropy
P_tennis       = 9/14
P_no_tennis    = 5/14

entropy_tennis = entropy([P_tennis, P_no_tennis], base=2)
print(entropy_tennis)

In [None]:
# program entropy calculation ourself
from math import log

def entropy_(data, verbose=False):
    value,counts = np.unique(data, return_counts=True)
    if verbose:
        print("value", value)
        print("counts", counts)
    probs = counts / len(data)
    ent = 0
    for p in probs:
        if p > 0.:
            ent -= p  *  log(p, 2)
    return ent

In [None]:
print(entropy_(dataset['PlayTennis'], verbose=True))

### What is the conditional entropy $H(PlayTennis|Humidity)$ ?
$$H(Y|X) = \sum_{i=1}^k P(X=x_i) H(Y|X=x_i)$$
$$H(Y|X=x_i) = - \sum_{i=1}^l P(Y|X=x_i) \log_2 P(Y|X=x_i)$$

$$P(Humidity =  high)   = \frac{7}{14}$$
$$P(Humidity =  normal) = \frac{7}{14}$$

In [None]:
high_humid = dataset[dataset['Humidity']=='high']
print(high_humid)

$$ H(PlayTennis|Humidity=high) =  - \frac{4}{7} \cdot \log_2 (\frac{4}{7})  - \frac{3}{7} \cdot \log_2 (\frac{3}{7})  \approx 0.99$$

In [None]:
- 4/7  *  log(4/7 ,2)  - 3/7  *  log(3/7 , 2)

In [None]:
normal_humid = dataset[dataset['Humidity']=='normal']
print(normal_humid)

$$ H(PlayTennis|Humidity=normal) =  - \frac{1}{7} \cdot \log_2 (\frac{1}{7})  - \frac{6}{7} \cdot \log_2 (\frac{6}{7}) \approx 0.59 $$

In [None]:
- 1/7  *  log(1/7 ,2)  - 6/7  *  log(6/7 , 2)

#### Putting it together
$$ H(PlayTennis|Humidity) =  P(Humidity=normal) H(PlayTennis|Humidity=normal)  + P(Humidity=high) H(PlayTennis|Humidity=high)$$

$$= \frac{7}{14} \cdot 0.99 + \frac{7}{14} \cdot 0.59 \approx 0.79$$

$$H(PlayTennis|Humidity)  \approx 0.79$$

In [None]:
7/14  *  0.99 + 7/14  *   0.59 ## Runden auf 2 Nachkommastellen

In [None]:
def centropy_(data,attribute1,attribute2, verbose=False):
    value,counts = np.unique(data[attribute2],return_counts=True)
    probs = counts / len(data)
    cent = 0
    for v,p in zip(value,probs):
        _data =  data[attribute1][data[attribute2]==v]
        ent   = entropy_(_data)
        if p > 0.:
            cent += p  *  ent
    return cent

In [None]:
centropy_(dataset,'PlayTennis','Humidity')

### What is the conditional entropy H(PlayTennis|Wind)

In [None]:
centropy_(dataset,'PlayTennis','Wind')

$$P(Wind =  strong)   = \frac{6}{14}$$
$$P(Wind =  weak)   =   \frac{8}{14}$$
$$H(PlayTennis|Wind=weak) = - \frac{6}{8} \cdot \log_2\frac{6}{8} - \frac{2}{8} \cdot \log_2\frac{2}{8} \approx 0.81$$
$$H(PlayTennis|Wind=strong) = - \frac{3}{6} \cdot \log_2\frac{3}{6} - \frac{3}{6} \cdot \log_2\frac{3}{6} = 1$$
$$H(PlayTennis|Wind) = \frac{6}{14} \cdot 1 + \frac{8}{14} \cdot 0.81 \approx 0.89 $$

### What is the information gain $IG(PlayTennis, Humidity)$?

$$ IG(PlayTennis, Humidity) = H(PlayTennis)- H(PlayTennis|Humidity)$$
$$IG(PlayTennis, Humidity) = 0.94 - 0.79= 0.15$$ 

In [None]:
def ig(data,attribute1,attribute2):
    return entropy_(data[attribute1]) - centropy_(data,attribute1,attribute2)      

In [None]:
print(ig(dataset,'PlayTennis','Humidity'))

### What is the information gain $IG(PlayTennis, Wind)$?

In [None]:
print(ig(dataset,'PlayTennis','Wind'))

## If we only look at these two features, which feature will we split on at the root? Why?

We will use Humidity and split into high and normal humidity.

Reason: Information Gain is higher for humidity as for wind

### And if we use all features?

In [None]:
for feature in list(dataset.columns)[:-1]:
    print(f"Information Gain of splitting feature {feature} is {ig(dataset, 'PlayTennis', feature)}")

#### Split on  outlook, it has the highest Information Gain

In [None]:
data2 = dataset[dataset['Outlook']=='sunny']
print(data2)

$$P(Outlook = sunny)   = \frac{5}{14}$$
$$P(Outlook =  overcast)   =   \frac{4}{14}$$
$$P(Outlook =  rain)   =   \frac{5}{14}$$ 

$$H(PlayTennis|Outlook=sunny) = - \frac{2}{5} \cdot \log_2\frac{2}{5} - \frac{3}{5} \cdot \log_2\frac{3}{5} \approx 0.97$$
$$H(PlayTennis|Outlook=overcast) = - \frac{0}{4} \cdot \log_2\frac{0}{4} - \frac{4}{4} \cdot \log_2\frac{4}{4} = 0$$
$$H(PlayTennis|Outlook=rain) = - \frac{2}{5} \cdot \log_2\frac{2}{5} - \frac{3}{5} \cdot \log_2\frac{3}{5} \approx 0.97$$ 

$$IG(PlayTennis, Outlook) = 0.94 - (\frac{5}{14} \cdot 0.97 + \frac{4}{14} \cdot 0 +  \frac{5}{14} \cdot 0.97) \approx 0.25$$

                                                            Outlook
                                                              /|\
                                                             / | \
                                                            /  |  \
                                                           /   |   \
                                                        sun   over  rain
                                                        /      |     \
                                                       /       |      \

### ID3 Algorithm

Split the tree recursively until Entropy is 0 or no attributes are left

In [None]:
sunny = dataset[dataset['Outlook']=='sunny']
overcast = dataset[dataset['Outlook']=='overcast']
rain = dataset[dataset['Outlook']=='rain']

split = {"sunny": sunny, "overcast": overcast, "rain":rain}

for i in split:
    print(f' Split {i} \n {split[i]} \n has entropy {entropy_(split[i]["PlayTennis"])} \n\n')

#### Information Gain for Sunny 

In [None]:
for feature in list(sunny.columns)[:-1]:
    print(f"Information Gain of splitting feature {feature} is {ig(sunny, 'PlayTennis', feature)}")

#### Information Gain for Rain

In [None]:
for feature in list(rain.columns)[:-1]:
    print(f"Information Gain of splitting feature {feature} is {ig(rain, 'PlayTennis', feature)}")

First we split on outlook, now we split on humidity and wind

               Outlook
                 /|\
                / | \
               /  |  \
              /   |   \
           sun   over  rain
           /      |     \
          /       |      \
         /       yes      \
      Humid              Wind
       /\                 /\
      /  \               /  \
    high  normal       weak strong

### And that's it. Now every split has Entropy == 0

In [None]:
sh = sunny[sunny['Humidity']=='high']
sn = sunny[sunny['Humidity']=='normal']

rw = rain[rain['Wind']=='weak']
rs = rain[rain['Wind']=='strong']

second_split = {"sunny high":sh , "sunny normal": sn, "rain weak":rw, "rain strong": rs}

for s in second_split:
    print(f"split {s} has entropy {entropy_(second_split[s]['PlayTennis'])}")
    print(second_split[s])
    print("\n")

               Outlook
                 /|\
                / | \
               /  |  \
              /   |   \
           sun   over  rain
           /      |     \
          /       |      \
         /       yes      \
      Humid              Wind
       /\                 /\
      /  \               /  \
    high  normal       weak strong
     |      |´           |    |
     |      |            |    |
    no     yes          yes   no

### How is the new sample classified?

|Outlook  |  Temperatur | Humidity | Wind | PlayTennis|  
| ----------- | ----------- |----------- | ----------- | ----------- |
|rain     |  mild  |  normal  | strong | ?

PlayTennis --> No

### Outlook:


- ID3 is the basic decision tree algorithm but it is pretty old (Common Frameworks don't support it anymore).
- Decision Trees are "interpretable"
- Basic Decisions Tree like to overfit.
- Better with Random Forest and XGBoost
- XGBoost (Extreme Gradient Boosting) is a Boosting method that works similar to Adaboost (learn one hypothesis, rank wrongly classified examples higher, add second hypothesis ...). 
    - First step: Create very small Decision Tree (eg. with maximum 3 splits)
    - Second step: Use loss function and calculate gradients for loss
    - Third step: Create new decision tree and use gradient information for splits (minimize loss)
    - fourth step: Do this iteratively until you have ~100 very small decision trees.


- Generally: XGBoost and Random Forests show better performance on tabular data than neural networks. 
    - [From 07.2022: Why do tree-based models still outperform deep
learning on tabular data?](https://arxiv.org/pdf/2207.08815.pdf)
- (But no longer interpretable)

![](https://i.imgflip.com/3gyzsn.jpg)



