# Decision Trees:
We will use the dataset below to learn a decision tree that predicts if we play tennis (Yes or No). Based on the attribute, Temperature (hot or cold), Outlook (sunny, overcast or rain), Humidity (high or normal), and Wind (weak or strong)

Training Dataset:

|Outlook  |  Temperatur | Humidity | Wind | PlayTennis|  
| ----------- | ----------- |----------- | ----------- | ----------- |
|sunny    |  hot  |   high  | weak    | no
|sunny    |  hot  |   high  | strong  | no
|overcast |  hot  |   high  | weak    | yes
|rain     | mild  |   high  | weak    | yes
|rain     | cold  |   normal|  weak   | yes
|rain     | cold  |   normal|  strong | no
|overcast | cold  |   normal|  strong | yes
|sunny    |  mild |   high  |  weak   | no
|sunny    |  cold |   normal|  weak   | yes
|rain     | mild  |  normal | weak    | yes
|sunny    | mild  |  normal | strong  | yes
|overcast | mild  | high    | strong  | yes
|overcast | hot   | normal  | weak    | yes
|rain     | mild  |  high   | strong  | no


New Sample:

|Outlook  |  Temperatur | Humidity | Wind | PlayTennis|  
| ----------- | ----------- |----------- | ----------- | ----------- |
|rain     |  mild  |  normal  | strong | ?

The right answer is PlayTennis = no

For this task, you can write your answers using $log_2$
1. What is the entropy $H(PlayTennis?)$?
2. What is the conditional entropy $H(PlayTennis? |Humidity)$? 
3. What is the conditional entropy $H(PlayTennis?| Wind)$?
4. What is the information gain  $IG(Humidity)$?
5. What is the information gain  $IG(Wind)$?
6. What feature will we split on at the root? Why?
7. How can the new sample be classified in view of the tree, with only two attributes Humidity and Wind?

What is the entropy $𝐻(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠?)$?

Entropy:
$$H(Y) = - \sum_{i=1}^k P(Y=y_i) \log_2 P(Y=y_i)$$

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

dataset = pd.DataFrame({'Outlook':['sunny','sunny','overcast','rain','rain','rain','overcast','sunny','sunny','rain','sunny','overcast','overcast','rain'],  
                        'Temperatur':['hot','hot','hot','mild','cold','cold','cold','mild','cold','mild','mild','mild','hot','mild'], 
                        'Humidity':['high','high','high','high','normal','normal','normal','high','normal','normal','normal','high','normal','high'], 
                        'Wind' :['weak','strong','weak','weak','weak','strong','strong','weak','weak','weak','strong','strong','weak','strong'], 
                        'PlayTennis':[0,0,1,1,1,0,1,0,1,1,1,1,1,0]})
dataset

In [None]:
from scipy.stats import entropy
P_tennis       = 9/14
P_no_tennis    = 5/14
entropy_tennis = entropy([P_tennis, P_no_tennis], base=2)
print(entropy_tennis)

$$P(PlayTennis ==  yes) = \frac{9}{14}$$
$$P(PlayTennis ==  no) = \frac{5}{14}$$
$$ H(PlayTennis) =  - \frac{9}{14} * \log_2 (\frac{9}{14})  - \frac{5}{14} * \log_2 (\frac{5}{14})  $$

In [None]:
from math import log

def entropy_(data, base):
    value,counts = np.unique(data, return_counts=True)
    # print("value", value)
    # print("counts", counts)
    probs = counts / len(data)
    ent = 0
    for p in probs:
        if p > 0.:
            ent -= p * log(p, base)
    return ent

In [None]:
print(entropy_(dataset['PlayTennis'], 2))

What is the conditional entropy 𝐻(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠?|𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦)

$$P(Humidity =  high)   = \frac{7}{14}$$
$$P(Humidity =  normal) = \frac{7}{14}$$

In [None]:
data1 = dataset['PlayTennis'][dataset['Humidity']=='high']
print(data1)

$$ H(PlayTennis|Humidity=high) =  - \frac{4}{7} * \log_2 (\frac{4}{7})  - \frac{3}{7} * \log_2 (\frac{3}{7})  $$

In [None]:
 - 4/7 * log(4/7 ,2)  - 3/7 * log(3/7 , 2)

In [None]:
data2 = dataset['PlayTennis'][dataset['Humidity']=='normal']
print(data2)

$$ H(PlayTennis|Humidity=normal) =  - \frac{1}{7} * \log_2 (\frac{1}{7})  - \frac{6}{7} * \log_2 (\frac{6}{7})  $$

In [None]:
- 1/7 * log(1/7 ,2)  - 6/7 * log(6/7 , 2)

In [None]:
def centropy_(data,attribute1,attribute2, base):
    value,counts = np.unique(data[attribute2],return_counts=True)
    probs = counts / len(data)
    cent = 0
    for v,p in zip(value,probs):
        _data =  data[attribute1][data[attribute2]==v]
        ent   = entropy_(_data, 2)
        if p > 0.:
            cent += p * ent
    return cent

In [None]:
centropy_(data,'PlayTennis','Humidity', 2)

$$ H(PlayTennis|Humidity) =  𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦=normal) H(PlayTennis|Humidity=normal)  + 𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦=ℎ𝑖𝑔ℎ) H(PlayTennis|Humidity=ℎ𝑖𝑔ℎ)$$

In [None]:
7/14 * 0.9852281360342516 + 7/14 *  0.5916727785823275

What is the conditional entropy 𝐻(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠?|Wind)

In [None]:
centropy_(dataset,'PlayTennis','Wind', 2)

What is the information gain $𝐼𝐺(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦)$?
$$𝐼𝐺(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) = 𝐻(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠?) - 𝐻(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠?|𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) = 0.9402859586706309 - 0.7884504573082896 = 0.1518 $$ 

In [None]:
def ig(data,attribute1,attribute2, base):
    return entropy_(data[attribute1], base) - centropy_(data,attribute1,attribute2, base)      

In [None]:
print(ig(dataset,'PlayTennis','Humidity', 2))

What is the information gain $𝐼𝐺(𝑊𝑖𝑛𝑑)$?

In [None]:
print(ig(dataset,'PlayTennis','Wind', 2))

What feature will we split on at the root? Why?

How can the new sample be classified in view of the tree, with only two attributes Humidity and Wind?
* Humidity = normal 	
* Wind = strong

In [None]:
node_1= dataset[dataset['Humidity']=='normal']
node_1

In [None]:
node_2= node_1[dataset['Wind']=='strong']
node_2

In [None]:
print("we will play Tennis with probablitiy", 2/3)

## Random Forest 

Consider the construction of a Random Forest with $t = 6$ binary trees. To make the trees uncorrelated, random sampling "bootstrap sampling" is used. Assume that each tree uses $n = 8$ bootstrap training samples.

* Calculate the probability that a given sample (e.g., the first sample) is never considered by the third tree.

In [None]:
n = 8
## The probability that it is considered in the third tree
p = 1/n
# The probability that it is not considered in the third tree is
p_not = 1- p 
print(p_not)

Is this anwer correct and why?