# TP4 - AI : 
---
_Author: CHRISTOFOROU Anthony_\
_Due Date: XX-XX-2023_\
_Updated: 29-11-2023_\
_Description: TP4 - AI_

---

In [23]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt

# Modules
from assignment4.utils import (calculate_entropy, calculate_information_gain, calculate_gini_index)
from assignment4.algorithms.decision_trees.id3 import ID3DecisionTree

# make figures appear inline
matplotlib.rcParams['figure.figsize'] = (15, 8)
%matplotlib inline

# notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Entropy and Information Gain


### 1.1. Depedent Variable Entropy

The first step in building a decision tree is to calculate the entropy of the dependent variable. The entropy of the dependent variable is also known as the class entropy.

<div class='alert alert-info'>
But what is even Entropy?
</div>

Entropy is a measure of the amount of uncertainty or randomness in data.
To calculate the entropy of the dependent variable, we need to determine the frequency of each class in the target variable and then use the entropy formula:

$$Entropy(S) = -\sum_{i=1}^{c}p_i\log_2(p_i)$$

where $p_i$ is the proportion of the number of elements in class $i$ to the number of elements in set $S$.

<div class='alert alert-info'>
So how do we do this, and how do we now which one is the dependent variable?
</div>

Let's start by loading the CSV data and find the dependent variable. We we then use the formula and calculate the entropy.​

In [24]:
file_path = 'data/data.csv'
data = pd.read_csv(file_path)

# Display the first rows of the dataset
data.head()

Unnamed: 0,A,B,C,D,E,c
0,0,1,3,1,1,0
1,0,1,2,1,2,0
2,0,0,3,4,0,1
3,0,1,1,3,1,0
4,0,0,1,3,0,1


The dataset has five independent variables (A, B, C, D, E) and one dependent variable (c). The next step is to calculate the entropy of the said dependent variable.

Let's calculate the entropy of 'c'. 

In [25]:
entropy = calculate_entropy(data, 'c')
print(f"Calculates Entropy: {entropy:.3f}")

Calculates Entropy: 0.974


The entropy of the dependent variable `'c'` in the dataset is approximately `0.974`. This value represents the amount of uncertainty or randomness in the distribution of class labels in the dependent variable.

### 1.2 Information Gain after Random Decision Criteria Application 

Next, we will calculate the information gain after applying three random decision criteria. For this, we need to:

1. Select three random features (criteria) from the independent variables.
2. For each feature, split the dataset based on its unique values.
3. Calculate the entropy for each split.
4. Compute the information gain for each feature.

<div class='alert alert-info'>
We will select three features randomly from the dataset and calculate their information gain. 
</div>

In [26]:
# Select three random features from the independent variables
random_features = np.random.choice(['A', 'B', 'C', 'D', 'E'], 3, replace=False)

# Calculate the information gain for each of these features
information_gains = {feature: calculate_information_gain(data, feature, 'c') for feature in random_features}
for feature, gain in information_gains.items():
    print(f"Information gain for {feature}: {gain:.3f}")

Information gain for D: 0.167
Information gain for B: 0.152
Information gain for A: 0.029


<div class='alert alert-info'>
These values indicate how much each feature reduces the uncertainty about the class labels. A higher information gain implies a greater reduction in uncertainty.
</div> 

### 1.3 Gini Index

Next, we will calculate the Gini index for the same three features. The Gini index is calculated as:

$$Gini(S) = 1 - \sum_{i=1}^{c}p_i^2$$

where $p_i$ is the proportion of the number of elements in class $i$ to the number of elements in set $S$.

Let's proceed with calculating the Gini index for the random features,
the Gini index should be a value between 0 and 1, where 0 indicates perfect purity (all elements in a subset belong to the same class) and 1 indicates maximal impurity (elements are evenly distributed across different classes).

In [27]:
gini_indices = {feature: calculate_gini_index(data, feature, 'c') for feature in random_features}
for feature, gain in gini_indices.items():
    print(f"Gini index for {feature}: {gain:.3f}")

Gini index for D: 0.377
Gini index for B: 0.384
Gini index for A: 0.463


The Gini index measures the impurity of a dataset after a split. A lower Gini index indicates a better split, as it implies a higher purity of the subsets created by the split. 

### 1.4 Best Decision Criteria

After calculating for the three random features multiple times, we can see:
- The Information Gain is always the highest for the feature `'E'`, suggesting it is the most informative for predicting the dependent variable `'c'`.
- That the Gini index is always the lowest for the feature `'E'`. It appears to be the most effective at reducing impurity.

Therefore, according to both the information gain and the Gini index, feature E is the preferable criterion for decision-making in this scenario.

## 2. ID3 Algorithm

### 2.1. ID3 Algorithm Implementation

In [28]:
target = 'c'
features = data.columns[:-1]

tree = ID3DecisionTree()
tree.build_tree(data, features, target)

tree.render_tree()

Node: E
    ├── Value: 0
        Node: C
            ├── Value: 0.0
                └── Leaf: 0.0
            ├── Value: 1.0
                └── Leaf: 1.0
            ├── Value: 2.0
                Node: A
                    ├── Value: 0.0
                        └── Leaf: 1.0
                    └── Value: 1.0
                        Node: B
                            ├── Value: 0.0
                                └── Leaf: 0.0
                            └── Value: 1.0
                                └── Leaf: 1.0
            └── Value: 3.0
                └── Leaf: 1.0
    ├── Value: 1
        Node: D
            ├── Value: 0.0
                Node: A
                    ├── Value: 1.0
                        └── Leaf: 0.0
                    └── Value: 2.0
                        └── Leaf: 1.0
            ├── Value: 1.0
                Node: B
                    ├── Value: 0.0
                        Node: A
                            ├── Value: 0.0
                            

### 2.3 ID3 Algorithm Data Generation

In [29]:
datapoints = tree.generate_data(10)
datapoints.head()

Unnamed: 0,E,D,B,class,A,C
0,2,1.0,1.0,0.0,,
1,2,0.0,,1.0,,
2,1,2.0,,0.0,0.0,
3,2,0.0,,1.0,,
4,1,1.0,2.0,1.0,,0.0
