# Supervised Learning Algorithms: Decision Trees for Classification and Regression Tasks (CART)

<b>Decision trees</b> (also called Classification And Regression Trees "CART") are a type of machine learning algorithm that makes decisions by splitting data into subsets based on certain features. They are used for both classification (assigning a label to an item) and regression (predicting a numerical value).

## Decision Trees for Classification Tasks

## Demo 1: To Play or Not To Play

In [95]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [56]:
# Generating a basic dataset
dataset = [
        ['sunny','hot','high',0,0],
        ['sunny','Hot','High',1,0],
        ['Overcast','Hot','High',0,1],
        ['Rainy','mild','High',0,1],
        ['Rainy','Cool','Normal',0,1],
        ['Rainy','Cool','Normal',1,0],
        ['Overcast','Cool','Normal',1,1],
        ['Sunny','mild','High',0,0],
        ['Sunny','cool','normal',0,1],
        ['rainy','mild','normal',0,1],
        ['Sunny','mild','normal',1,1],
        ['overcast','mild','High',1,1],
        ['overcast','hot','normal',0,1],
        ['rainy','mild','high',1,0],
]
df = pd.DataFrame(dataset, columns = ['Outlook', 'Temp', 'Humidity', 'Windy', 'Play'])
df.Outlook = df.Outlook.str.lower()
df.Temp = df.Temp.str.lower()
df.Humidity = df.Humidity.str.lower()

In [57]:
df

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,sunny,hot,high,0,0
1,sunny,hot,high,1,0
2,overcast,hot,high,0,1
3,rainy,mild,high,0,1
4,rainy,cool,normal,0,1
5,rainy,cool,normal,1,0
6,overcast,cool,normal,1,1
7,sunny,mild,high,0,0
8,sunny,cool,normal,0,1
9,rainy,mild,normal,0,1


In [59]:
## Getting only the feature column "outlook" and the output column "play", where outlook == 'sunny'
df[df['Outlook'] == "sunny"][['Outlook','Play']]

Unnamed: 0,Outlook,Play
0,sunny,0
1,sunny,0
7,sunny,0
8,sunny,1
10,sunny,1


<b>Splitting Criteria:</b> The model chooses the best features to split the data based on some criteria. For classification tasks, e.g., <b>Gini Impurity, Information Gain, Entropy, Gini Gain</b>

<b>Gini Impurity:</b> Gini impurity is a measure of how mixed the classes are in a given dataset or subset of data.
<ul>
    <li>It ranges from 0 to 1, where 0 indicates that a node contains only samples of a single class, and 1 indicates that the samples are evenly distributed among the classes.</li>
    <li>The decision tree algorithm aims to minimize the Gini impurity when making splits.</li>
</ul>

In [60]:
## Gini Impurity Formula
def GiniImpurity(data_f):
    unique_classes, class_counts = np.unique(data_f, return_counts=True)
    probabilities = class_counts / len(data_f)
    
    return 1 - np.sum(probabilities ** 2)

In [61]:
## Entropy Formula
def Entropy(data_f):
    unique_classes, class_counts = np.unique(data_f, return_counts=True)
    probabilities = class_counts / len(data_f)
    
    return - np.sum(probabilities * np.log2(probabilities))

In [63]:
## Todo Task
def InformationGain(data_f):
    pass
    # Your answer goes here

In [64]:
## Todo Task
def GiniGain(data_f):
    pass
    # Your answer goes here

In [52]:
## Gini Impurity for The main dataset
print("Gini Impurity: ",GiniImpurity(df['Play']))
## Entropy for The main dataset
print("Entropy: ",Entropy(df['Play']))

Gini Impurity:  0.4591836734693877
Entropy:  0.9402859586706311


In [68]:
## Gini Impurity and Entropy for the column "Outlook == sunny"
print("Gini Impurity (Outlook == Sunny): ",GiniImpurity(df[df['Outlook'] == "sunny"][['Play']]))
print("Entropy (Outlook == Sunny): ",Entropy(df[df['Outlook'] == "sunny"][['Play']]))

Gini Impurity (Outlook == Sunny):  0.48
Entropy (Outlook == Sunny):  0.9709505944546686


In [76]:
## Let's compute the Gini Impurity for the column "Outlook"
## Total Gini Impurity = weighted average of Gini Impurities for the leaves

gini_impurity_outlook_sunny = GiniImpurity(df.loc[(df.Outlook == "sunny"), "Play"])
gini_impurity_outlook_overcast = GiniImpurity(df.loc[(df.Outlook == "overcast"), "Play"])
gini_impurity_outlook_rainy = GiniImpurity(df.loc[(df.Outlook == "rainy"), "Play"])


print("Gini Impurity (Outlook == Sunny): ",gini_impurity_outlook_sunny)
print("Gini Impurity (Outlook == Overcast): ",gini_impurity_outlook_overcast)
print("Gini Impurity (Outlook == Rainy): ",gini_impurity_outlook_rainy)

Gini Impurity (Outlook == Sunny):  0.48
Gini Impurity (Outlook == Overcast):  0.0
Gini Impurity (Outlook == Rainy):  0.48


In [84]:
df[["Outlook"]].value_counts()

Outlook 
rainy       5
sunny       5
overcast    4
dtype: int64

In [85]:
## Total Gini Impurity = weighted average of Gini Impurities for the leaves
gini_impurity_outlook = (5/14) * gini_impurity_outlook_sunny + (4/14) * gini_impurity_outlook_overcast + (5/14) * gini_impurity_outlook_rainy

In [86]:
print("Total Gini Impurity of OUTLOOK: ", gini_impurity_outlook)

Total Gini Impurity of OUTLOOK:  0.34285714285714286


In [87]:
## Computing Gini Impurity for the remainig features [Temp. Humidity, Windy]
## Your code goes here!

### What if there is a numeric/continuous variables in the dataset?

In [91]:
## Let's change the temp column to a numerical column
df[['Temp']]

Unnamed: 0,Temp
0,hot
1,hot
2,hot
3,mild
4,cool
5,cool
6,cool
7,mild
8,cool
9,mild


In [93]:
## Assiging numerical/continuous values to the Temp column
df['Temp'] = [25.6, 27, 29.4, 18, 10.22, 9.2, 8, 15, 8, 14, 14, 18, 31, 13.8]

In [94]:
df

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,sunny,25.6,high,0,0
1,sunny,27.0,high,1,0
2,overcast,29.4,high,0,1
3,rainy,18.0,high,0,1
4,rainy,10.22,normal,0,1
5,rainy,9.2,normal,1,0
6,overcast,8.0,normal,1,1
7,sunny,15.0,high,0,0
8,sunny,8.0,normal,0,1
9,rainy,14.0,normal,0,1


<b>Dealing with numerical variables:</b> As you may notices before, dealing with categorical variables seem to be easy and understandable. But, how can we deal with numerical/continuous variables?
<ul>
    <li><b>First step:</b> We sort the rows by numerical values from the lowest to the highest</li>
    <li><b>Second step:</b> We calculate the average for all adjecent rows</li>
    <li><b>Third step:</b>  We compute the Gini Impurity for each average value of the numerical column</li>
</ul>