# Lab 4 - Information Theory in Machine Learning

Welcome to this week's lab on Information Theory! This week, we will dive into the fascinating world of Information Theory as applied to Machine Learning. Specifically, we will focus on two key concepts: Entropy and Information Gain. These principles are fundamental in understanding how decision trees make split decisions to organize data effectively.

### Entropy
- Entropy, in the context of information theory, measures the level of uncertainty or disorder within a set of data.
- In machine learning, particularly in decision trees, entropy helps to determine how a dataset should be split. A high entropy means more disorder, indicating that our dataset is varied. Conversely, low entropy suggests more uniformity in the data.

### Information Gain
- Information Gain measures the reduction in entropy after the dataset is split on an attribute.
- It is crucial in building decision trees as it helps to decide the order of attributes the tree will use for splitting the data. The attribute with the highest Information Gain is chosen as the splitting attribute at each node.

## Part 1: Entropy and Information Gain in Decision Trees
Decision Trees use these concepts to create branches. By choosing splits that maximize Information Gain (or equivalently minimize entropy), a decision tree can effectively categorize data, leading to better classification or regression models.

### Step 1: Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

### Step 2: Load and Explore the Iris Dataset

In [5]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

### Step 3: Calculate Entropy
To calculate the `entropy` we need to:
- First, extract the target variable `y` from your dataset (like the 'target' column in the Iris dataset).
- Then, call `calculate_entropy(y)` to get the entropy.

This function calculates the entropy of a given target variable `y`. It works by first determining the unique classes in `y`, then computes the probability of each class, and uses this probability to calculate the entropy. This is a crucial step in understanding the disorder or uncertainty in the dataset, a fundamental concept in information theory.

In [7]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy

Calculate the entropy for the target variable.  What is your observastion about the calculated Entropy?

In [23]:
entropy_target = calculate_entropy(df['target'])
print(f"Entropy of the target variable: {entropy_target}")

Entropy of the target variable: 1.584962500721156


The calculated entropy of 1.584962500721156 for the Iris dataset's target variable indicates:

The Iris dataset has three classes of iris (setosa, versicolor, and virginica).  An entropy close to the maximum possible entropy for a 3-class problem (which is log2(3) ≈ 1.585) suggests that the classes are relatively balanced.  If one class were dominant, the entropy would be lower.

### Step 4: Calculate Information Gain
There are three steps for calculating the Information Gain:
1. Compute Overall Entropy: Use the entropy function from Step 3 on the entire target dataset.
2. Calculate Weighted Entropy for Each Attribute: For each unique value in the attribute, partition the dataset and calculate its entropy. Then calculate the weighted sum of these entropies, where the weights are the proportions of instances in each partition.
3. Compute Information Gain: Subtract the weighted entropy of the split from the original entropy.

The attribute with the highest Information Gain is generally chosen for splitting, as it provides the most significant reduction in uncertainty. This step is critical in constructing an effective decision tree, as it directly influences the structure and depth of the tree.

**Use the provided function to calculate the information gain for each of the features in the dataset.**

In [27]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain


Discuss your findings here.

In [29]:
feature_names = df.columns[:-1]  # Exclude the target column

target_name = 'target'  # Define the target column name
for feature in feature_names:
    information_gain = calculate_information_gain(df, feature, target_name)
    print(f"Information Gain for {feature}: {information_gain}")


Information Gain for sepal length (cm): 0.8769376208910578
Information Gain for sepal width (cm): 0.5166428756804977
Information Gain for petal length (cm): 1.4463165236458
Information Gain for petal width (cm): 1.4358978386754417


## Key Findings:

+ **Petal Features are More Informative**: The petal length (1.446) and petal width (1.436) have significantly higher information gain than the sepal features. This indicates that the petal measurements are much more effective at distinguishing between the iris species than the sepal measurements.  

+ **Sepal Features are Less Informative**: Sepal length (0.877) and sepal width (0.517) have considerably lower information gain.  Sepal measurements are less useful on their own for classifying the iris species. Sepal width, in particular, seems to be the least informative of the four features.

+ **Predictive Power**:  A higher information gain suggests that a feature has more predictive power.  As a result, petal length or petal width shall be used as the first splitting feature for a decision tree.


## Part 2: Apply Entropy and Information Gain on a different dataset

Your task is to choose a new dataset and implement what you learned in `Part 1` on this new dataset.

## Data Set: 
Bohanec, M. (1988). Car Evaluation [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5JP48.

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

CAR         car acceptability
 + buying    buying price
 + maint     price of the maintenance
 + doors     number of doors
 + persons   capacity in terms of persons to carry
 + lug_boot  the size of luggage boot
 + safety    estimated safety of the car

### Task 1: Implement Entropy and Information Gain

In [41]:
# Your code goes here

df = pd.read_csv('car.data', header=None)

# Assign column names
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'target']

feature_names = df.columns[:-1]  # Exclude the target column

target_name = 'target'  # Define the target column name
for feature in feature_names:
    information_gain = calculate_information_gain(df, feature, target_name)
    print(f"Information Gain for {feature}: {information_gain}")


Information Gain for buying: 0.09644896916961376
Information Gain for maint: 0.07370394692148574
Information Gain for doors: 0.004485716626631886
Information Gain for persons: 0.21966296333990798
Information Gain for lug_boot: 0.030008141247605202
Information Gain for safety: 0.26218435655426375


### Task 2: Discuss your findings in detail
Provide detailed explanation and discussion about your findings.

The information gain values provide valuable insights into which features of the Car Evaluation dataset are most influential in determining the car's acceptability.

+ `safety` (0.262): This feature has the highest information gain.  It means that safety is the most crucial factor in predicting the car's evaluation.  Safety rating provides the most information about whether a car will be considered acceptable or not.  

+ `persons` (0.220): The number of persons the car can accommodate is the second most important feature.  It means that car capacity is a significant factor in its overall evaluation.

+ `buying` (0.096): The buying price also has a noticeable impact, though less than safety and capacity.  This is expected, as affordability is always a consideration.

+ `maint` (0.074): Maintenance cost plays a role, but it's less influential than safety, capacity, or price.

+ `lug_boot` (0.030): The size of the luggage boot has a relatively small information gain.  It's not as decisive as the other features.

+ `doors` (0.004): The number of doors has the lowest information gain, close to zero.  It is the least important feature for predicting the car's evaluation in this dataset. The number of doors is not a strong factor between acceptable and unacceptable cars in this dataset.

Information gain calculation is crucial for deciding how decision trees learn. It forms the foundation of feature selection and tree construction. The information gain values provide a clear picture of the relative importance of each feature in the Car Evaluation dataset and how a decision tree would likely use these features to make predictions.  

Specifically, a decision tree will prioritize `safety` and `persons` features, which exhibit relatively greater information gain, leading to branches that effectively partition cars based on safety ratings and passenger capacity. After these initial splits, the data is divided into subsets. Within each subset, information gain is recalculated to determine the subsequent best feature for splitting, as the relative importance of the remaining features may vary.  Features are employed with careful consideration of potential overfitting. Techniques such as pruning or setting a maximum tree depth can be used to mitigate this risk. 



## Submission
Submit your completed Jupyter Notebook file through the submission link in Blackboard.