# Decision Trees


We have data on patients after a heart attack with the following items:

* *Vitamin:* - the regularly used type of vitamin.
* *The size of the family*  - the size of the household where the patient lived.
* *Exercise* -  how often the patient takes gym activities.
* *Bypass* - whether the patient has a heart bypass. 
* *Survived* - has she/he survived at least five years since the first heart attack?

| *Vitamin* | *The size of the family* | *Exercise* | *Bypass* | *Survived* |
|-----------|--------------------------|------------|----------|------------|
| B complex | large | regularly | no | - |
| B complex | large | regularly | yes | - |
| C | large | regularly | no | + |
| D |medium| regularly |no| + |
|D |small |little |no |+ |
|D |small |little |yes |- |
|C |small |little |yes |+ |
|B complex |medium |regularly |not |- |
|B complex |small |little |no |+ |
|D |medium |little |no |+ |
|B complex |medium |little |yes |+ |
|C |medium |regularly |yes |+ |
|C |big |little |no |+ |
|D |medium |regular |yes |- |

Using the algorithm TDIDT, design a decision tree with the target attribute **Survived**.

Solve the task 
1. manually, and then
2. using scikit-learn

## Building a tree classifier manually

For that, we need to compute the weighted average entropy $H(A)$ for all available attributes $A$. As weighted entropy is based on the entropy, implement the following function `set_entropy` that, for given counts of occurrences of different classes, returns the entropy of the set:
$$ H([n_1, n_2, \ldots, n_k]) = -\sum_{i=1}^k \frac{n_i}{n} \log_2 \frac{n_i}{n}\text{ , where } n = \sum_{i=1}^k n_i.$$
Additionally, when $n_i=0$, then we define $\frac{n_i}{n} \log_2 \frac{n_i}{n} = 0$.

In [None]:
import numpy as np
import scipy.stats


def set_entropy(X):
    # return entropy of the set X, where X is a list 
    # of counts for all classes in the set X
    # > set_entropy([1,2,2]) = -1/5* log(1/5) - 2/5 * log(2/5) - 2/5 * log(2/5)
    # > 1.5219
    # YOUR CODE HERE
    size = np.sum(X)
    H = 0
    for x in X:
        if x==0:
            continue
        else:
            H -= (x/size)*np.log2(x/size)

    return H

def set_entropy(X):
    # return entropy of the set X, where X is a list 
    # of counts for all classes in the set X
    # > set_entropy([1,2,2]) = -1/5* log(1/5) - 2/5 * log(2/5) - 2/5 * log(2/5)
    # > 1.5219
    # YOUR CODE HERE
    return (-(X[X != 0]/sum(X))*np.log2(X[X != 0]/sum(X))).sum()

e = set_entropy(np.array([1, 0, 2]))
print(f"{e=}")
assert np.isclose(e, 0.9182958)

e=0.9182958340544896


In [None]:
# compute H(vitamin)



In [None]:
# compute H(size)


In [None]:
#compute H(Exercise)


In [None]:
# compute H(Bypass)


## Building a tree classifier using `scikit-learn` 

In [None]:
%%file heart.csv
Vitamin,    Size,   Excercise,   Bypass, Survived
B complex,  large,  regularly,  no,     -
B complex,  large,  regularly,  yes,    - 
C,          large,  regularly,  no,     + 
D,          medium, regularly,  no,     + 
D,          small,  little,     no,     + 
D,          small,  little,     yes,    - 
C,          small,  little,     yes,    +
B complex,  medium, regularly,  not,    - 
B complex,  small,  little,     no,     + 
D,          medium, little,     no,     +
B complex,  medium, little,     yes,    +
C,          medium, regularly,  yes,    + 
C,          big,    little,     no,     +
D,          medium, regularly,    yes,    -

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

When reading the dataset, it is important to skip spaces &ndash; note the specifiction for separators `sep="\s*,\s*"`.

In [None]:
df = pd.read_csv("heart.csv", sep="\s*,\s*", engine='python')
df

In [None]:
df.info()

In [None]:
df1 = df.copy()

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

dt_clf = DecisionTreeClassifier()

print(df.drop(columns=['Survived']))

dt_clf.fit(df.drop(columns=['Survived']), df['Survived'])
tree.plot_tree(dt_clf)

In [None]:
df1['Vitamin'] = df['Vitamin'].astype("category")
df1['Size'] = df['Size'].astype("category")
df1['Excercise'] = df['Excercise'].astype("category")
df1

In [None]:
df1['Survived'] = df['Survived'].apply(lambda x: True if x == '+' else False)
df1

In [None]:
df1['Bypass'] = df['Bypass'].apply(lambda x: True if x == 'yes' else False)
df1['Excercise'] = df['Excercise'].apply(lambda x: True if x == 'regularly' else False)
df1

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

dt_clf = DecisionTreeClassifier()

print(df1.drop(columns=['Survived']))

dt_clf.fit(df1.drop(columns=['Survived']), df1['Survived'])
tree.plot_tree(dt_clf)

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

dt_clf = DecisionTreeClassifier()

one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
print(one_hot_data)
dt_clf.fit(one_hot_data, data['Class'])
tree.plot_tree(dt_clf)

In [None]:
df2 = pd.get_dummies(df1, columns=['Vitamin'])
df2

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['small', 'medium', 'large', 'big'])
print(le.classes_)
df2['Size'] = le.transform(df['Size'])
df2

Build and plot the corresponding decision tree.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=84bdaff1-26fc-4526-bf99-339c8ededa9f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>