# Information Gain Demo

This notebook demonstrates how to calculate information gain for classification problems using the ITM 4150 toolkit.

**Learning Objectives:**
- Understand what information gain measures
- Calculate information gain for a feature

## Setup

First, let's import the necessary modules and load our dataset.

In [1]:
# Import ITM 4150 toolkit
import cuanalytics

## Load the Mushroom Dataset

We'll use the UCI Mushroom dataset, which contains descriptions of mushrooms classified as either edible (e) or poisonous (p).


In [2]:
# Load the dataset
df = cuanalytics.load_mushroom_data()

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (8124, 23)

First few rows:


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
# Check the class distribution
print("Class Distribution:")
print(df['class'].value_counts())
print(f"\nPercentage:")
print(df['class'].value_counts(normalize=True) * 100)

Class Distribution:
class
e    4208
p    3916
Name: count, dtype: int64

Percentage:
class
e    51.797144
p    48.202856
Name: proportion, dtype: float64


## Step 1: Calculate Initial Entropy

Before we split the data, let's calculate the entropy of the entire dataset. This represents the uncertainty in our target variable.

In [4]:
# Calculate entropy of the target variable
initial_entropy = cuanalytics.calculate_entropy(df['class'])

print(f"Initial Dataset Entropy: {initial_entropy:.4f} bits")
print(f"\nInterpretation:")
print(f"- Maximum possible entropy for binary classification: 1.0 bits")
print(f"- Our dataset has {initial_entropy:.4f} bits of uncertainty")
print(f"- This is relatively high, meaning classes are fairly balanced")

Initial Dataset Entropy: 0.9991 bits

Interpretation:
- Maximum possible entropy for binary classification: 1.0 bits
- Our dataset has 0.9991 bits of uncertainty
- This is relatively high, meaning classes are fairly balanced


## Step 2: Calculate Information Gain for a Single Feature

Let's start by examining the 'odor' feature, which is known to be highly informative for mushroom classification.

In [5]:
# Calculate information gain for odor
ig_odor = cuanalytics.information_gain(df, 'odor', 'class')

print(f"Information Gain for 'odor': {ig_odor:.4f} bits")
print(f"\nThis means:")
print(f"- We reduce uncertainty by {ig_odor:.4f} bits when we split on odor")
print(f"- Percentage of entropy removed: {(ig_odor/initial_entropy)*100:.1f}%")

Information Gain for 'odor': 0.9061 bits

This means:
- We reduce uncertainty by 0.9061 bits when we split on odor
- Percentage of entropy removed: 90.7%


In [6]:
# See unique odor values
print("Unique odor values:")
print(df['odor'].value_counts())

Unique odor values:
odor
n    3528
f    2160
y     576
s     576
a     400
l     400
p     256
c     192
m      36
Name: count, dtype: int64
