# Understanding Entropy

**Entropy** measures the "disorder" or "unpredictability" in a dataset. Think of it as answering: *How surprised would I be by a random observation?*

- **Low entropy** (near 0): Very predictable - most observations are the same class
- **High entropy** (near 1 for binary): Very unpredictable - classes are evenly mixed

In this exercise, you'll calculate entropy for different scenarios and see how it changes.

In [1]:
import pandas as pd
import cuanalytics

## Scenario 1: Perfect Order (Minimum Entropy)

What happens when all observations are the same class?

In [2]:
# Create a perfectly ordered dataset - all the same class
all_edible = pd.Series(['edible'] * 5, name='class')
print("All edible mushrooms:")
print(all_edible)

entropy_perfect = cuanalytics.calculate_entropy(all_edible)
print(f"\nEntropy: {entropy_perfect:.4f}")

All edible mushrooms:
0    edible
1    edible
2    edible
3    edible
4    edible
Name: class, dtype: object

Entropy: 0.0000


**Question:** Why is entropy 0.0 here? 

<details>
<summary>Click to reveal answer</summary>
Because there's no uncertainty - you know every mushroom is edible. You'd never be surprised by a random pick!
</details>

## Scenario 2: Maximum Disorder (Maximum Entropy)

What if classes are perfectly balanced?

In [3]:
# Create a 50/50 split - maximum uncertainty
balanced = pd.Series(['edible']*3 + ['poisonous']*3, name='class')
print("50/50 split:")
print(balanced)

entropy_max = cuanalytics.calculate_entropy(balanced)
print(f"\nEntropy: {entropy_max:.4f}")

50/50 split:
0    edible
1    edible
2    edible
3    poisonous
4    poisonous
5    poisonous
Name: class, dtype: object

Entropy: 1.0000


**Question:** Why is entropy 1.0 here?

<details>
<summary>Click to reveal answer</summary>
Because there's maximum uncertainty - you have a 50/50 chance of getting either class. This is the most "disordered" a binary dataset can be!
</details>

## Scenario 3: Somewhere in Between

Most real datasets fall between these extremes.

In [4]:
# 80% edible, 20% poisonous
mostly_edible = pd.Series(['edible']*4 + ['poisonous']*1, name='class')
print("Mostly edible (80/20 split):")
print(mostly_edible)

entropy_skewed = cuanalytics.calculate_entropy(mostly_edible)
print(f"\nEntropy: {entropy_skewed:.4f}")

Mostly edible (80/20 split):
0    edible
1    edible
2    edible
3    edible
4    poisonous
Name: class, dtype: object

Entropy: 0.7219


**Notice:** Entropy is between 0 and 1. There's some uncertainty (not all edible), but not maximum uncertainty (not 50/50).

## ðŸŽ¯ Your Turn!

Create your own dataset and calculate its entropy. Try to create a dataset with entropy around **0.5**.

*Hint: You'll need an unbalanced split - maybe 75/25 or 70/30?*

In [5]:
# Your code here!
my_data = pd.Series([], name='class')  # Replace with your data

# Uncomment to test:
# my_entropy = cuanalytics.calculate_entropy(my_data)
# print(f"Your entropy: {my_entropy:.4f}")

## Real-World Example: Mushroom Dataset

Now let's see entropy in a real dataset!

In [6]:
# Load the mushroom dataset
df = cuanalytics.load_mushroom_data()

# Check the class distribution
print("Class distribution:")
print(df['class'].value_counts())

# Calculate entropy
entropy = cuanalytics.calculate_entropy(df['class'])
print(f"\nDataset entropy: {entropy:.4f}")

Class distribution:
class
edible       4208
poisonous    3916
Name: count, dtype: int64

Dataset entropy: 0.9991


**Interpretation:** The entropy is very close to 1.0! This means the mushroom dataset is nearly balanced between edible and poisonous mushrooms - there's high uncertainty about whether a random mushroom is safe to eat.

## Bonus: Visualize Entropy

See how entropy changes with class proportions.

In [7]:
cuanalytics.visualize_entropy()

## Key Takeaways

1. **Entropy = 0**: Perfect order, no surprise (all same class)
2. **Entropy = 1**: Maximum disorder, maximum surprise (50/50 split)
3. **0 < Entropy < 1**: Some uncertainty (unbalanced classes)