# Understanding Information Gain

**Information Gain** answers a critical question: *Which feature should I use to split my data?*

Think of it like playing "20 Questions":
- **Good question**: "Is it bigger than a breadbox?" (splits possibilities roughly in half)
- **Bad question**: "Is it exactly 3.7 inches tall?" (doesn't narrow things down much)

Information Gain measures how much a feature reduces our uncertainty about the target variable.

**Formula:** Information Gain = Entropy(parent) - Weighted Average Entropy(children)

---

In this notebook, you'll:
1. See how different features reduce uncertainty
2. Compare features to find the "best" splitter
3. Understand why decision trees choose certain features first

In [1]:
import pandas as pd
import cuanalytics as ca

## Our Mission: Identify Poisonous Mushrooms

The mushroom dataset contains 8,124 mushrooms described by 22 features. Each is either **edible (e)** or **poisonous (p)**.

**Your goal:** Find which features best distinguish poisonous from edible mushrooms!

In [2]:
# Load the dataset
df = cuanalytics.load_mushroom_data()

print(f"Dataset: {len(df)} mushrooms, {len(df.columns)} columns")
print(f"\nClass Distribution:")
print(df['class'].value_counts())

Dataset: 8124 mushrooms, 23 columns

Class Distribution:
class
e    4208
p    3916
Name: count, dtype: int64


In [3]:
# Quick peek at the data
df.head(3)

## Step 1: Our Starting Point (Parent Entropy)

Before we split on anything, how uncertain are we?

With a nearly 50/50 split between edible and poisonous, we expect **high entropy** (near 1.0).

In [4]:
# Calculate starting entropy
parent_entropy = cuanalytics.calculate_entropy(df['class'])

print(f"Starting Entropy: {parent_entropy:.4f} bits")
print(f"\n💡 Interpretation:")
print(f"Maximum entropy (perfect 50/50 split) = 1.0000 bits")
print(f"Our entropy = {parent_entropy:.4f} bits")
print(f"\nWe're VERY uncertain! If you pick a random mushroom,")
print(f"you have almost no idea if it's safe to eat.")
print(f"\nWe need to reduce this uncertainty by finding good features to split on!")

Starting Entropy: 0.9991 bits

💡 Interpretation:
Maximum entropy (perfect 50/50 split) = 1.0000 bits
Our entropy = 0.9991 bits

We're VERY uncertain! If you pick a random mushroom,
you have almost no idea if it's safe to eat.

We need to reduce this uncertainty by finding good features to split on!


## Step 2: Test a Feature - Does 'odor' Help?

Let's test the **odor** feature. Does knowing a mushroom's smell help us predict if it's poisonous?

**Prediction:** What do you think? Will odor be:
- Very helpful (high information gain)?
- Somewhat helpful (medium information gain)?
- Not helpful (low information gain)?

Let's find out!

In [5]:
# Calculate information gain for odor
ig_odor = cuanalytics.information_gain(df, 'odor', 'class')
remaining_entropy = parent_entropy - ig_odor
pct_reduction = (ig_odor / parent_entropy) * 100

print(f"Information Gain for 'odor': {ig_odor:.4f} bits")
print(f"\n🎯 Wow! This is HUGE!")
print(f"\nWe reduced entropy from {parent_entropy:.4f} to {remaining_entropy:.4f} bits")
print(f"That's a {pct_reduction:.1f}% reduction in uncertainty!")
print(f"\nKnowing the odor almost completely tells us if a mushroom is poisonous!")

Information Gain for 'odor': 0.9061 bits

🎯 Wow! This is HUGE!

We reduced entropy from 0.9991 to 0.0930 bits
That's a 90.7% reduction in uncertainty!

Knowing the odor almost completely tells us if a mushroom is poisonous!


### Why is odor so informative?

Let's peek under the hood to see what's happening when we split on odor.

In [6]:
# See the odor categories
print("Odor values in dataset:")
print(df['odor'].value_counts())
print("\n(n=none, f=foul, y=fishy, s=spicy, a=almond, l=anise, p=pungent, c=creosote, m=musty)")

Odor values in dataset:
odor
n    3528
f    2160
y     576
s     576
a     400
l     400
p     256
c     192
m      36
Name: count, dtype: int64

(n=none, f=foul, y=fishy, s=spicy, a=almond, l=anise, p=pungent, c=creosote, m=musty)


In [7]:
# Look at two extreme cases
print("Let's examine two extreme cases:\n")

no_smell = df[df['odor'] == 'n']
print("Mushrooms with NO smell (n):")
print(no_smell['class'].value_counts())
print("→ 96.6% are EDIBLE!\n")

foul_smell = df[df['odor'] == 'f']
print("Mushrooms with FOUL smell (f):")
print(foul_smell['class'].value_counts())
print("→ 100% are POISONOUS!")

Let's examine two extreme cases:

Mushrooms with NO smell (n):
class
e    3408
p     120
Name: count, dtype: int64
→ 96.6% are EDIBLE!

Mushrooms with FOUL smell (f):
class
p    2160
Name: count, dtype: int64
→ 100% are POISONOUS!


**Aha!** 🎓

Odor creates very "pure" groups:
- **No smell → 96.6% edible** (very low entropy)
- **Foul smell → 100% poisonous** (zero entropy!)

This is why odor has such high information gain - it creates subgroups with very low uncertainty!

## Step 3: Compare Different Features

Not all features are created equal! Let's test several features and compare their information gain.

**Before you run the code:** Which feature do you think will be most helpful?
- cap-shape (the shape of the mushroom cap)
- cap-color (the color of the cap)
- bruises (whether it bruises when touched)
- gill-spacing (spacing of the gills under the cap)

In [8]:
# Calculate information gain for each feature
ig_odor = cuanalytics.information_gain(df, 'odor', 'class')
ig_gill = cuanalytics.information_gain(df, 'gill-spacing', 'class')
ig_bruises = cuanalytics.information_gain(df, 'bruises', 'class')
ig_color = cuanalytics.information_gain(df, 'cap-color', 'class')
ig_shape = cuanalytics.information_gain(df, 'cap-shape', 'class')

print("Information Gain Comparison:")
print("="*40)
print()
print(f"odor:           {ig_odor:.4f} ({ig_odor/parent_entropy*100:.1f}%)")
print(f"gill-spacing:   {ig_gill:.4f} ({ig_gill/parent_entropy*100:.1f}%)")
print(f"bruises:        {ig_bruises:.4f} ({ig_bruises/parent_entropy*100:.1f}%)")
print(f"cap-color:      {ig_color:.4f} ({ig_color/parent_entropy*100:.1f}%)")
print(f"cap-shape:      {ig_shape:.4f} ({ig_shape/parent_entropy*100:.1f}%)")
print(f"\n🏆 Winner: odor (by a landslide!)")

Information Gain Comparison:
════════════════════════════════════════

odor:           0.9061 (90.7%)
gill-spacing:   0.3973 (39.8%)
bruises:        0.1934 (19.4%)
cap-color:      0.0522 ( 5.2%)
cap-shape:      0.0070 ( 0.7%)

🏆 Winner: odor (by a landslide!)


## 🎯 Your Turn: Find Other Good Features

The dataset has many more features! Can you find one with information gain > 0.20?

**Available features:** cap-surface, gill-attachment, gill-size, gill-color, stalk-shape, stalk-root, ring-number, ring-type, spore-print-color, population, habitat

*Hint: Try features related to the spore-print or ring!*

In [9]:
# Your code here!
# Example:
# my_ig = cuanalytics.information_gain(df, 'spore-print-color', 'class')
# print(f"Information Gain: {my_ig:.4f}")

## Key Takeaways

**1. Information Gain = Entropy Reduction**
   - It measures how much uncertainty a feature removes
   - Higher values = more useful feature

**2. Not All Features Are Equal**
   - Odor reduced uncertainty by 90.7%
   - Cap-shape only reduced it by 0.7%
   - Always test multiple features!
