# TP4 - AI : 
---
_Author: CHRISTOFOROU Anthony_\
_Due Date: XX-XX-2023_\
_Updated: 29-11-2023_\
_Description: TP4 - AI_

---

In [128]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt

# Modules
from assignment4.utils import (entropy, information_gain, gini_index, accuracy, precision, recall, f1, render_scores)
from assignment4.algorithms.decision_trees.id3 import ID3DecisionTree

# make figures appear inline
matplotlib.rcParams['figure.figsize'] = (15, 8)
%matplotlib inline

# notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Entropy and Information Gain


### 1.1. Depedent Variable Entropy

The first step in building a decision tree is to calculate the entropy of the dependent variable. The entropy of the dependent variable is also known as the class entropy.

<div class='alert alert-info'>
But what is even Entropy?
</div>

Entropy is a measure of the amount of uncertainty or randomness in data.
To calculate the entropy of the dependent variable, we need to determine the frequency of each class in the target variable and then use the entropy formula:

$$Entropy(S) = -\sum_{i=1}^{c}p_i\log_2(p_i)$$

where $p_i$ is the proportion of the number of elements in class $i$ to the number of elements in set $S$.

<div class='alert alert-info'>
So how do we do this, and how do we now which one is the dependent variable?
</div>

Let's start by loading the CSV data and find the dependent variable. We we then use the formula and calculate the entropy.​

In [129]:
file_path = 'data/data.csv'
data = pd.read_csv(file_path)

# Display the first rows of the dataset
data.head()

Unnamed: 0,A,B,C,D,E,c
0,0,1,3,1,1,0
1,0,1,2,1,2,0
2,0,0,3,4,0,1
3,0,1,1,3,1,0
4,0,0,1,3,0,1


The dataset has five independent variables (A, B, C, D, E) and one dependent variable (c). The next step is to calculate the entropy of the said dependent variable.

Let's calculate the entropy of 'c'. 

In [130]:
entropy = entropy(data, 'c')
print(f"Calculates Entropy: {entropy:.3f}")

Calculates Entropy: 0.974


The entropy of the dependent variable `'c'` in the dataset is approximately `0.974`. This value represents the amount of uncertainty or randomness in the distribution of class labels in the dependent variable.

### 1.2 Information Gain after Random Decision Criteria Application 

Next, we will calculate the information gain after applying three random decision criteria. For this, we need to:

1. Select three random features (criteria) from the independent variables.
2. For each feature, split the dataset based on its unique values.
3. Calculate the entropy for each split.
4. Compute the information gain for each feature.

<div class='alert alert-info'>
We will select three features randomly from the dataset and calculate their information gain. 
</div>

In [131]:
# Select three random features from the independent variables
random_features = np.random.choice(['A', 'B', 'C', 'D', 'E'], 3, replace=False)

# Calculate the information gain for each of these features
information_gains = {feature: information_gain(data, feature, 'c') for feature in random_features}
for feature, gain in information_gains.items():
    print(f"Information gain for {feature}: {gain:.3f}")

Information gain for C: 0.013
Information gain for A: 0.029
Information gain for E: 0.257


<div class='alert alert-info'>
These values indicate how much each feature reduces the uncertainty about the class labels. A higher information gain implies a greater reduction in uncertainty.
</div> 

### 1.3 Gini Index

Next, we will calculate the Gini index for the same three features. The Gini index is calculated as:

$$Gini(S) = 1 - \sum_{i=1}^{c}p_i^2$$

where $p_i$ is the proportion of the number of elements in class $i$ to the number of elements in set $S$.

Let's proceed with calculating the Gini index for the random features,
the Gini index should be a value between 0 and 1, where 0 indicates perfect purity (all elements in a subset belong to the same class) and 1 indicates maximal impurity (elements are evenly distributed across different classes).

In [132]:
gini_indices = {feature: gini_index(data, feature, 'c') for feature in random_features}
for feature, gain in gini_indices.items():
    print(f"Gini index for {feature}: {gain:.3f}")

Gini index for C: 0.473
Gini index for A: 0.463
Gini index for E: 0.318


The Gini index measures the impurity of a dataset after a split. A lower Gini index indicates a better split, as it implies a higher purity of the subsets created by the split. 

### 1.4 Best Decision Criteria

In [133]:
best_feature_info_gain = max(information_gains, key=information_gains.get)
print(f"Best feature by Information Gain: {best_feature_info_gain} with gain {information_gains[best_feature_info_gain]:.3f}")

best_feature_gini_index = min(gini_indices, key=gini_indices.get)
print(f"Best feature by Gini Index: {best_feature_gini_index} with index {gini_indices[best_feature_gini_index]:.3f}")

Best feature by Information Gain: E with gain 0.257
Best feature by Gini Index: E with index 0.318


We can see:
- A high Information Gain suggests the most informative for predicting the dependent variable `'c'`.
- A low Gini index appears to be the most effective at reducing impurity.

## 2. ID3 Algorithm

### 2.1. ID3 Algorithm Implementation

In [134]:
target = 'c'
features = data.columns[:-1]

tree = ID3DecisionTree(use_gini=True)
tree.build_tree(data, features, target)

tree.render_tree()
print(tree)

E
├── Value: 0
│   └── C
│       ├── Value: 0
│       │   └── Leaf: False
│       ├── Value: 1
│       │   └── Leaf: True
│       ├── Value: 2
│       │   └── A
│       │       ├── Value: 0
│       │       │   └── Leaf: True
│       │       └── Value: 1
│       │           └── B
│       │               ├── Value: 0
│       │               │   └── Leaf: False
│       │               └── Value: 1
│       │                   └── Leaf: True
│       └── Value: 3
│           └── Leaf: True
├── Value: 1
│   └── D
│       ├── Value: 0
│       │   └── A
│       │       ├── Value: 1
│       │       │   └── Leaf: False
│       │       └── Value: 2
│       │           └── Leaf: True
│       ├── Value: 1
│       │   └── C
│       │       ├── Value: 0
│       │       │   └── Leaf: True
│       │       ├── Value: 1
│       │       │   └── Leaf: False
│       │       ├── Value: 2
│       │       │   └── B
│       │       │       ├── Value: 0
│       │       │       │   └── Leaf: True
│       │       │

### 2.3 ID3 Algorithm Data Generation Procedure

Let's generate data using the tree we created in the previous section. We will use the `predict()` function to predict the class label for each data point in the dataset.

In [135]:
new_data = tree.predict(data)

new_data

0      0
1      0
2      1
3      0
4      1
      ..
195    0
196    1
197    1
198    1
199    0
Name: c, Length: 200, dtype: int64

We can see that the generated data points are classified correctly by the decision tree (match the class labels in the tree) because the decision tree is trained on that same data. So we need to use a different dataset to test the performance of the decision tree.

### 2.4. Forest of Decision Trees with Majority Vote

Using the information gain criterion, we will build 5 decision trees using random samples of 80% of the data. We will use as prediction, a majority vote of the trees.

In [136]:
def majority_vote(forest: list[ID3DecisionTree], test_data: pd.DataFrame) -> pd.Series:
    """Predict the majority vote for a given test dataset using a forest of decision trees.
    
    Parameters
    ----------
    forest : list
        A list of decision trees.
    test_data : pandas.DataFrame
        The test dataset.
    
    Returns
    -------
    pandas.Series
        The majority vote for each row in the test dataset.
    """
    predictions = [forest_tree.predict(test_data) for forest_tree in forest]

    # Combine predictions and decide based on majority vote
    predictions_df = pd.DataFrame(predictions).T
    majority_votes = predictions_df.mode(axis=1)[0]  # [0] to select the first mode in case of ties

    return majority_votes

Let's predict the class label for each data point in the test dataset using the forest of decision trees we created in the previous section.

In [137]:
forest: list[ID3DecisionTree] = []
sample_size = 0.8

for i in range(5):
    sample_data = data.sample(frac=sample_size, replace=True)
    forest_tree = ID3DecisionTree()
    forest_tree.build_tree(sample_data, features, target)
    forest.append(forest_tree)
    
test_data_path = 'data/data_test.csv'
test_data = pd.read_csv(test_data_path)

forest_predictions = majority_vote(forest, test_data)

print(f"Predictions:\n{forest_predictions}")

Predictions:
0      1
1      1
2      1
3      0
4      1
      ..
195    0
196    1
197    1
198    0
199    0
Name: 0, Length: 200, dtype: int64


### 2.5 Evaluation of Tree-Based Models

In this section, we delve into the assessment of our initial decision tree and compare its effectiveness with that of a random forest. The evaluation metrics include accuracy, precision, recall, and the F1 score. Each of these metrics offers a unique perspective on the performance of the models.

#### 2.5.1. Accuracy

Accuracy represents the proportion of all predictions that were correct. It's a straightforward measure of how often the model predicts correctly, irrespective of the prediction type.

$$Accuracy = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

Here, \( TP \) and \( TN \) represent true positives and true negatives, respectively, while \( FP \) and \( FN \) denote false positives and false negatives.

#### 2.5.2. Precision

Precision, also known as Positive Predictive Value, quantifies the accuracy of positive predictions. It shows the fraction of positive predictions that were actually correct.

$$Precision = \frac{\text{True Positives}}{\text{Total Predicted Positives}}$$

#### 2.5.3. Recall

Recall, or Sensitivity, measures the model's ability to correctly identify all relevant instances. Specifically, it's the proportion of actual positives that were correctly identified.

$$Recall = \frac{\text{True Positives}}{\text{Total Actual Positives}}$$

#### 2.5.4. F1 Score

The F1 score is a balanced measure that combines precision and recall. It is particularly useful when the distribution of the classes is imbalanced.

$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

By using these metrics, we can thoroughly evaluate and compare the performance of the decision tree and the forest of decision trees, providing a comprehensive understanding of their strengths and weaknesses.


In [138]:
tree_predictions = tree.predict(test_data)

tree_scores = {
    'Accuracy': accuracy(tree_predictions, test_data['c']),
    'Precision': precision(tree_predictions, test_data['c']),
    'Recall': recall(tree_predictions, test_data['c']),
    'F1': f1(tree_predictions, test_data['c'])
}

forest_scores = {
    'Accuracy': accuracy(forest_predictions, test_data['c']),
    'Precision': precision(forest_predictions, test_data['c']),
    'Recall': recall(forest_predictions, test_data['c']),
    'F1': f1(forest_predictions, test_data['c'])
}

render_scores(tree_scores, forest_scores)

Metric,Decision Tree,Random Forest
Accuracy,0.785,0.76
Precision,0.935,0.931
Recall,0.699,0.659
F1,0.8,0.771


### 2.6. Decision Tree Results Analysis

1. **Accuracy:** The decision tree has an accuracy of around ~0.79, while the random forest has an accuracy of ~0.76. This means that the decision tree correctly makes predictions in 79% of cases, compared to 76% for the random forest. In this aspect, the decision tree is slightly better.

2. **Precision:** Precision is very high for both models, with 0.94 for both the decision tree and the random forest. This indicates that when they predict a positive class, they are usually correct.

3. **Recall:** The recall is higher for the decision tree (0.699~0.700) than for the random forest (0.64~0.66). This means that the decision tree is better at detecting positive instances.

4. **F1 Score:** The F1 score, which is the harmonic mean of precision and recall, is also higher for the decision tree (~0.80) than for the random forest (~0.77). This score is particularly important in situations where a balance between precision and recall is crucial.

#### Conclusion on Model Choice using the F1 Score

Based on the F1 score, the decision tree should be preferred. Although its accuracy is only slightly better than that of the random forest, its ability to balance precision and recall, as evidenced by its higher F1 score, makes it more suitable, especially in contexts where it's important to maintain a good balance between detecting positive instances (recall) and minimizing false positives (precision).