# Categorical Data in Pandas

## Overview

**Categorical Data** = Data with a limited, fixed number of possible values (categories)

### Examples of Categorical Data

```
✅ Gender: [Male, Female, Other]
✅ Education: [High School, Bachelor's, Master's, PhD]
✅ Size: [Small, Medium, Large, XL]
✅ Status: [Pending, Approved, Rejected]
✅ Day of Week: [Monday, Tuesday, ..., Sunday]
✅ Color: [Red, Green, Blue, Yellow]
```

### Why Use Categorical Type?

**1. Memory Efficiency** 🚀
```
String (object):  1,000,000 rows × ~50 bytes = ~50 MB
Categorical:      1,000,000 rows × ~8 bytes = ~8 MB

Savings: ~85% less memory!
```

**2. Performance** ⚡
- Faster groupby operations
- Faster value_counts
- Faster sorting (when ordered)

**3. Data Integrity** ✅
- Enforces valid categories
- Prevents typos
- Clear set of possible values

**4. Ordered Categories** 📊
- Natural ordering (Small < Medium < Large)
- Meaningful comparisons
- Better visualizations

### Categorical vs Object (String)

| Feature | Object (String) | Categorical |
|---------|----------------|-------------|
| **Memory** | High | Low |
| **Performance** | Slower | Faster |
| **Ordering** | Alphabetical | Custom/Natural |
| **Validation** | No | Yes |
| **Best for** | Unique values | Repeated values |

### What We'll Learn
1. ✅ Creating categorical data
2. ✅ Memory benefits and performance
3. ✅ Ordered vs unordered categories
4. ✅ Adding/removing categories
5. ✅ Renaming and reordering
6. ✅ Binning continuous data
7. ✅ One-hot encoding (get_dummies)
8. ✅ Label encoding
9. ✅ Real-world applications
10. ✅ Best practices

In [None]:
import pandas as pd
import numpy as np
import sys

# Display settings
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

print("✅ Libraries imported")
print(f"Pandas version: {pd.__version__}")

## 1. Creating Categorical Data

### Three Ways to Create Categorical

**Method 1: From Series**
```python
s = pd.Series(['A', 'B', 'A', 'C', 'B'])
s_cat = s.astype('category')
```

**Method 2: pd.Categorical()**
```python
cat = pd.Categorical(['A', 'B', 'A', 'C', 'B'])
s = pd.Series(cat)
```

**Method 3: Direct in DataFrame**
```python
df['column'] = df['column'].astype('category')
```

### Categorical Structure

```
Categorical has 3 components:

1. Categories: [A, B, C]        ← Unique values
2. Codes:      [0, 1, 0, 2, 1]  ← Integer mapping
3. Ordered:    False/True       ← Has order?

Storage: Stores codes (integers) instead of repeated strings
```

### Ordered vs Unordered

**Unordered** (default):
```python
colors = pd.Categorical(['red', 'blue', 'green'])
# No meaningful order, just labels
```

**Ordered**:
```python
sizes = pd.Categorical(
    ['S', 'M', 'L', 'XL'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)
# S < M < L < XL
```

### Specify Categories

```python
# Define allowed categories
pd.Categorical(
    ['A', 'B', 'A'],
    categories=['A', 'B', 'C', 'D']  # C and D exist but unused
)
```

In [None]:
print("=== CREATING CATEGORICAL DATA ===\n")

# Example 1: Convert to categorical
print("Example 1: Convert string column to categorical\n")
colors = pd.Series(['red', 'blue', 'red', 'green', 'blue', 'red'])
print("Original (object):")
print(f"  Type: {colors.dtype}")
print(f"  Values: {colors.tolist()}")

colors_cat = colors.astype('category')
print("\nCategorical:")
print(f"  Type: {colors_cat.dtype}")
print(f"  Categories: {colors_cat.cat.categories.tolist()}")
print(f"  Codes: {colors_cat.cat.codes.tolist()}")
print()

# Example 2: Create with pd.Categorical
print("="*70)
print("Example 2: Using pd.Categorical()\n")
sizes = pd.Categorical(['S', 'M', 'L', 'S', 'XL', 'M'])
print(f"Categories: {sizes.categories.tolist()}")
print(f"Codes: {sizes.codes.tolist()}")
print(f"Ordered: {sizes.ordered}")
print()

# Example 3: Ordered categorical
print("="*70)
print("Example 3: Ordered categorical (with natural order)\n")
education = pd.Categorical(
    ['Bachelor', 'PhD', 'High School', 'Master', 'Bachelor'],
    categories=['High School', 'Bachelor', 'Master', 'PhD'],
    ordered=True
)
print(f"Categories: {education.categories.tolist()}")
print(f"Ordered: {education.ordered}")
print(f"Values: {list(education)}")
print()

# Example 4: Comparison with ordered categorical
print("="*70)
print("Example 4: Comparisons work with ordered categoricals\n")
edu_series = pd.Series(education)
print("Can compare: Bachelor vs Master")
print(f"  Bachelor < Master: {pd.Categorical(['Bachelor'], categories=education.categories, ordered=True)[0] < pd.Categorical(['Master'], categories=education.categories, ordered=True)[0]}")
print("\nFilter: Education >= Bachelor")
print(edu_series[edu_series >= 'Bachelor'])
print()

# Example 5: Specify all categories (including unused)
print("="*70)
print("Example 5: Specify categories (including unused ones)\n")
status = pd.Categorical(
    ['pending', 'approved', 'pending'],
    categories=['pending', 'approved', 'rejected', 'cancelled']  # 4 possible values
)
print(f"Data: {list(status)}")
print(f"All categories: {status.categories.tolist()}")
print(f"Value counts:")
print(pd.Series(status).value_counts())
print("\n'rejected' and 'cancelled' exist as categories but have 0 count")
print()

# Example 6: DataFrame with categorical
print("="*70)
print("Example 6: Categorical in DataFrame\n")
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'grade': ['A', 'B', 'A', 'C'],
    'size': ['M', 'L', 'M', 'S']
})
print("Original dtypes:")
print(df.dtypes)

# Convert to categorical
df['grade'] = df['grade'].astype('category')
df['size'] = pd.Categorical(df['size'], categories=['S', 'M', 'L', 'XL'], ordered=True)

print("\nAfter conversion:")
print(df.dtypes)
print("\nDataFrame:")
print(df)

## 2. Memory Benefits

### How Categorical Saves Memory

```
OBJECT (String) Storage:
Each value stored as full string
'red' → 3 bytes + overhead
'blue' → 4 bytes + overhead
1,000,000 × 'red' = ~50 MB

CATEGORICAL Storage:
Categories: ['red', 'blue', 'green'] → Stored once
Codes: [0, 1, 0, 2, 1, 0, ...] → Integer array
1,000,000 × integer = ~8 MB

Savings: ~85% less memory!
```

### When to Use Categorical

**✅ Use Categorical When:**
- Limited number of unique values
- Values repeat frequently
- Need to enforce valid values
- Natural ordering exists
- Large datasets with repeated strings

**❌ Don't Use When:**
- Most values are unique
- High cardinality (many unique values)
- Values constantly changing

### Rule of Thumb

```python
# Calculate unique ratio
unique_ratio = df['column'].nunique() / len(df)

if unique_ratio < 0.5:  # Less than 50% unique
    # Good candidate for categorical
    df['column'] = df['column'].astype('category')
```

### Memory Comparison

| Rows | Unique Values | Object | Categorical | Savings |
|------|---------------|--------|-------------|----------|
| 1M | 10 | 50 MB | 8 MB | 84% |
| 1M | 100 | 50 MB | 10 MB | 80% |
| 1M | 1,000 | 50 MB | 15 MB | 70% |
| 1M | 100,000 | 50 MB | 40 MB | 20% |
| 1M | 900,000 | 50 MB | 55 MB | ❌ Worse |

In [None]:
print("=== MEMORY BENEFITS ===\n")

# Example 1: Memory comparison
print("Example 1: Memory usage comparison\n")

# Create large dataset with repeated values
n = 100_000
cities = ['New York', 'London', 'Tokyo', 'Paris', 'Mumbai'] * (n // 5)

# As object (string)
df_object = pd.DataFrame({'city': cities})
memory_object = df_object.memory_usage(deep=True).sum() / 1024**2  # MB

# As categorical
df_cat = pd.DataFrame({'city': pd.Categorical(cities)})
memory_cat = df_cat.memory_usage(deep=True).sum() / 1024**2  # MB

print(f"Dataset: {n:,} rows, 5 unique cities")
print(f"\nObject (string):  {memory_object:.2f} MB")
print(f"Categorical:      {memory_cat:.2f} MB")
print(f"\nSavings: {(1 - memory_cat/memory_object) * 100:.1f}%")
print(f"Reduction: {memory_object/memory_cat:.1f}x smaller")
print()

# Example 2: Check memory of columns
print("="*70)
print("Example 2: Compare multiple columns\n")

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'] * 1000,
    'status': ['active', 'inactive', 'pending'] * 1000,
    'level': ['beginner', 'intermediate', 'advanced'] * 1000
})

print("Original (all object):")
print(df.dtypes)
print(f"Total memory: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Convert to categorical
df_cat = df.copy()
for col in df_cat.columns:
    df_cat[col] = df_cat[col].astype('category')

print("\nConverted to categorical:")
print(df_cat.dtypes)
print(f"Total memory: {df_cat.memory_usage(deep=True).sum() / 1024:.2f} KB")

savings = (1 - df_cat.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum()) * 100
print(f"\nMemory savings: {savings:.1f}%")
print()

# Example 3: When NOT to use categorical
print("="*70)
print("Example 3: High cardinality - NOT good for categorical\n")

# Create data with many unique values
unique_ids = [f'ID_{i:06d}' for i in range(10000)]

df_unique_obj = pd.DataFrame({'id': unique_ids})
df_unique_cat = pd.DataFrame({'id': pd.Categorical(unique_ids)})

mem_obj = df_unique_obj.memory_usage(deep=True).sum() / 1024
mem_cat = df_unique_cat.memory_usage(deep=True).sum() / 1024

print(f"10,000 unique IDs:")
print(f"  Object:      {mem_obj:.2f} KB")
print(f"  Categorical: {mem_cat:.2f} KB")
print(f"\n❌ Categorical is {'worse' if mem_cat > mem_obj else 'better'}!")
print("\n💡 Don't use categorical for high cardinality data")
print()

# Example 4: Unique ratio test
print("="*70)
print("Example 4: Unique ratio test (should it be categorical?)\n")

test_data = {
    'city': ['NYC', 'LA', 'NYC'] * 1000,           # Low cardinality
    'zip_code': list(range(3000)),                 # High cardinality
    'status': ['active', 'inactive'] * 1500        # Very low
}

df_test = pd.DataFrame(test_data)

for col in df_test.columns:
    unique_ratio = df_test[col].nunique() / len(df_test)
    recommendation = '✅ Good for categorical' if unique_ratio < 0.5 else '❌ Not recommended'
    print(f"{col:12} | Unique: {df_test[col].nunique():5} | Ratio: {unique_ratio:.2%} | {recommendation}")

## 3. Categorical Operations

### Accessing Categorical Properties

```python
cat = s.cat  # Categorical accessor

cat.categories      # List of categories
cat.codes          # Integer codes
cat.ordered        # Is it ordered?
```

### Modifying Categories

```python
# Add categories
s.cat.add_categories(['new_cat'])

# Remove categories
s.cat.remove_categories(['old_cat'])

# Remove unused categories
s.cat.remove_unused_categories()

# Rename categories
s.cat.rename_categories({'old': 'new'})

# Reorder categories
s.cat.reorder_categories(['A', 'B', 'C'])

# Set as ordered
s.cat.as_ordered()
s.cat.as_unordered()
```

### Common Operations

| Operation | Method | Example |
|-----------|--------|----------|
| **Value counts** | `.value_counts()` | Count per category |
| **Unique** | `.unique()` | Get categories |
| **Sort** | `.sort_values()` | Sort by category |
| **Filter** | Boolean indexing | `df[df['cat'] == 'A']` |
| **Group by** | `.groupby()` | Group by categories |

### Setting New Values

```python
# ✅ Value in categories
s[0] = 'A'  # OK if 'A' is a category

# ❌ Value not in categories
s[0] = 'Z'  # Error! 'Z' not in categories

# ✅ Add category first
s = s.cat.add_categories(['Z'])
s[0] = 'Z'  # Now OK
```

In [None]:
print("=== CATEGORICAL OPERATIONS ===\n")

# Create sample data
sizes = pd.Series(['S', 'M', 'L', 'S', 'M', 'S', 'L', 'M'], dtype='category')

# Example 1: Access properties
print("Example 1: Categorical properties\n")
print(f"Categories: {sizes.cat.categories.tolist()}")
print(f"Codes: {sizes.cat.codes.tolist()}")
print(f"Ordered: {sizes.cat.ordered}")
print(f"Number of categories: {len(sizes.cat.categories)}")
print()

# Example 2: Add categories
print("="*70)
print("Example 2: Add new categories\n")
print(f"Original categories: {sizes.cat.categories.tolist()}")
sizes = sizes.cat.add_categories(['XL', 'XXL'])
print(f"After adding: {sizes.cat.categories.tolist()}")
print()

# Example 3: Remove unused categories
print("="*70)
print("Example 3: Remove unused categories\n")
print(f"Before: {sizes.cat.categories.tolist()}")
print(f"Value counts:")
print(sizes.value_counts())
sizes = sizes.cat.remove_unused_categories()
print(f"\nAfter removing unused: {sizes.cat.categories.tolist()}")
print("\nXL and XXL removed (not used in data)")
print()

# Example 4: Rename categories
print("="*70)
print("Example 4: Rename categories\n")
status = pd.Series(['pending', 'approved', 'pending', 'rejected'], dtype='category')
print(f"Original: {status.tolist()}")

status = status.cat.rename_categories({
    'pending': 'In Progress',
    'approved': 'Completed',
    'rejected': 'Cancelled'
})
print(f"Renamed: {status.tolist()}")
print()

# Example 5: Reorder categories
print("="*70)
print("Example 5: Reorder and set as ordered\n")
education = pd.Series(['Bachelor', 'PhD', 'High School', 'Master'], dtype='category')
print(f"Original order: {education.cat.categories.tolist()}")
print(f"Ordered: {education.cat.ordered}")

# Reorder in natural order
education = education.cat.reorder_categories(
    ['High School', 'Bachelor', 'Master', 'PhD']
)
education = education.cat.as_ordered()

print(f"\nNew order: {education.cat.categories.tolist()}")
print(f"Ordered: {education.cat.ordered}")
print()

# Example 6: Sorting with ordered categorical
print("="*70)
print("Example 6: Sorting (ordered vs unordered)\n")

# Unordered - sorts alphabetically
months_unordered = pd.Series(['Mar', 'Jan', 'Feb', 'Dec', 'Apr'], dtype='category')
print("Unordered (alphabetical sort):")
print(months_unordered.sort_values().tolist())

# Ordered - sorts by category order
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
months_ordered = pd.Categorical(
    ['Mar', 'Jan', 'Feb', 'Dec', 'Apr'],
    categories=month_order,
    ordered=True
)
print("\nOrdered (chronological sort):")
print(pd.Series(months_ordered).sort_values().tolist())
print()

# Example 7: Value counts maintains category order
print("="*70)
print("Example 7: value_counts with ordered categories\n")

sizes_ordered = pd.Categorical(
    ['M', 'S', 'L', 'M', 'S', 'M', 'L', 'S'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)
print("Value counts (sorted by size order, not frequency):")
print(pd.Series(sizes_ordered).value_counts(sort=False))
print("\nNote: XL shown even though count is 0")
print()

# Example 8: GroupBy with categorical
print("="*70)
print("Example 8: GroupBy with categorical (faster!)\n")

df = pd.DataFrame({
    'category': pd.Categorical(['A', 'B', 'A', 'C', 'B', 'A']),
    'value': [10, 20, 15, 30, 25, 12]
})

grouped = df.groupby('category')['value'].mean()
print(grouped)
print("\n✅ GroupBy operations are faster with categorical!")

## 4. Binning - Convert Continuous to Categorical

### What is Binning?

**Binning** = Convert continuous numerical data into discrete categories (bins)

```
Age (continuous):  [22, 35, 45, 28, 67, 19, 52]
                          ↓ Binning
Age Group (categorical): [Young, Adult, Adult, Young, Senior, Young, Adult]
```

### pd.cut() - Fixed Bins

Creates bins based on **value ranges**

```python
pd.cut(
    x,                    # Data to bin
    bins=4,               # Number of bins (or bin edges)
    labels=['A','B','C'], # Custom labels
    right=True            # Right edge inclusive
)
```

**Examples:**
```python
# Auto bins
pd.cut([1, 7, 5, 4, 6, 3], bins=3)
# Result: [(0.994, 3], (5, 7], (3, 5], (3, 5], (5, 7], (0.994, 3]]

# Custom edges
pd.cut([22, 35, 45], bins=[0, 25, 50, 100], labels=['Young', 'Adult', 'Senior'])
```

### pd.qcut() - Quantile Bins

Creates bins with **equal counts** (quantiles)

```python
pd.qcut(
    x,                    # Data to bin
    q=4,                  # Number of quantiles
    labels=['Q1','Q2','Q3','Q4']
)
```

**Example:**
```python
# 4 equal-sized groups
pd.qcut([1, 2, 3, 4, 5, 6, 7, 8], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
# Each quartile has 2 values
```

### cut() vs qcut()

| Method | Bins Based On | Use Case |
|--------|---------------|----------|
| **cut()** | Value ranges | Age groups, price tiers |
| **qcut()** | Quantiles (equal counts) | Percentiles, rankings |

```
cut():   Equal WIDTH bins
[0-25), [25-50), [50-75), [75-100)

qcut():  Equal COUNT bins
[0-15), [15-42), [42-68), [68-100)
(Each bin has same number of values)
```

### Common Use Cases

```python
# Age groups
pd.cut(age, bins=[0, 18, 35, 60, 100], 
       labels=['Child', 'Young Adult', 'Adult', 'Senior'])

# Income brackets
pd.cut(income, bins=[0, 30000, 60000, 100000, np.inf],
       labels=['Low', 'Medium', 'High', 'Very High'])

# Test score grades
pd.cut(score, bins=[0, 60, 70, 80, 90, 100],
       labels=['F', 'D', 'C', 'B', 'A'])

# Quartiles
pd.qcut(sales, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
```

In [None]:
print("=== BINNING CONTINUOUS DATA ===\n")

# Example 1: Basic cut with auto bins
print("Example 1: pd.cut() with automatic bins\n")
ages = [22, 35, 45, 28, 67, 19, 52, 31, 25]
age_bins = pd.cut(ages, bins=3)

print(f"Original ages: {ages}")
print(f"\nBinned (3 equal-width bins):")
print(age_bins)
print(f"\nCategories: {age_bins.categories.tolist()}")
print()

# Example 2: Cut with custom bins and labels
print("="*70)
print("Example 2: Custom bins and labels\n")
age_groups = pd.cut(
    ages,
    bins=[0, 18, 35, 60, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Senior']
)

df = pd.DataFrame({
    'age': ages,
    'age_group': age_groups
})
print(df)
print(f"\nValue counts:")
print(df['age_group'].value_counts())
print()

# Example 3: qcut for quantiles
print("="*70)
print("Example 3: pd.qcut() - equal-sized groups\n")
sales = [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650]
quartiles = pd.qcut(sales, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

df_sales = pd.DataFrame({
    'sales': sales,
    'quartile': quartiles
})
print(df_sales)
print(f"\nEach quartile has {df_sales['quartile'].value_counts().values[0]} values")
print()

# Example 4: cut() vs qcut() comparison
print("="*70)
print("Example 4: cut() vs qcut() comparison\n")
values = [10, 20, 25, 30, 35, 50, 60, 80, 90, 100]

# Equal width bins
cut_bins = pd.cut(values, bins=3, labels=['Low', 'Med', 'High'])

# Equal count bins
qcut_bins = pd.qcut(values, q=3, labels=['Low', 'Med', 'High'])

comparison = pd.DataFrame({
    'value': values,
    'cut (width)': cut_bins,
    'qcut (count)': qcut_bins
})
print(comparison)

print("\ncut() value counts:")
print(comparison['cut (width)'].value_counts())
print("\nqcut() value counts:")
print(comparison['qcut (count)'].value_counts())
print()

# Example 5: Income brackets
print("="*70)
print("Example 5: Real-world - Income brackets\n")
incomes = [25000, 45000, 55000, 75000, 95000, 120000, 35000, 65000]

income_brackets = pd.cut(
    incomes,
    bins=[0, 30000, 60000, 100000, np.inf],
    labels=['Low', 'Medium', 'High', 'Very High']
)

df_income = pd.DataFrame({
    'income': incomes,
    'bracket': income_brackets
})
print(df_income)
print()

# Example 6: Test score grades
print("="*70)
print("Example 6: Test scores to letter grades\n")
scores = [45, 78, 92, 67, 88, 55, 95, 72, 61, 85]

grades = pd.cut(
    scores,
    bins=[0, 60, 70, 80, 90, 100],
    labels=['F', 'D', 'C', 'B', 'A'],
    right=True
)

df_grades = pd.DataFrame({
    'score': scores,
    'grade': grades
})
print(df_grades.sort_values('score'))
print(f"\nGrade distribution:")
print(df_grades['grade'].value_counts().sort_index())
print()

# Example 7: Include lowest with right=False
print("="*70)
print("Example 7: right=True vs right=False\n")
test_values = [10, 20, 30, 40]

# right=True: (10, 20], (20, 30], (30, 40]
bins_right_true = pd.cut(test_values, bins=3, right=True)

# right=False: [10, 20), [20, 30), [30, 40)
bins_right_false = pd.cut(test_values, bins=3, right=False)

print("right=True (default):")
print(f"  10 → {bins_right_true[0]}")
print(f"  20 → {bins_right_true[1]}")
print("\nright=False:")
print(f"  10 → {bins_right_false[0]}")
print(f"  20 → {bins_right_false[1]}")

## 5. Encoding Categorical Data

### Why Encode?

**Machine Learning models need numbers**, not strings!

### Two Main Encoding Methods

**1. One-Hot Encoding (get_dummies)**
- Creates binary columns for each category
- Use for **nominal** data (no order)

```
Original:         One-Hot Encoded:
Color             Color_Red  Color_Blue  Color_Green
Red          →        1          0           0
Blue         →        0          1           0
Green        →        0          0           1
Red          →        1          0           0
```

**2. Label Encoding (cat.codes)**
- Maps categories to integers
- Use for **ordinal** data (has order)

```
Original:         Label Encoded:
Size              Size_Code
Small        →        0
Medium       →        1
Large        →        2
Small        →        0
```

### pd.get_dummies() - One-Hot Encoding

```python
pd.get_dummies(
    data,                  # DataFrame or Series
    columns=['col'],       # Columns to encode
    prefix='cat',          # Prefix for new columns
    drop_first=False,      # Drop first category (avoid multicollinearity)
    dtype=int              # Data type for encoded columns
)
```

### cat.codes - Label Encoding

```python
# Get integer codes
df['category_code'] = df['category'].cat.codes
```

### When to Use Which?

| Type | Example | Method |
|------|---------|--------|
| **Nominal** (no order) | Color, Country, Gender | One-Hot |
| **Ordinal** (has order) | Size, Education, Rating | Label |

```python
# ✅ One-Hot for nominal
pd.get_dummies(df['color'])  # No inherent order

# ✅ Label encoding for ordinal
size = pd.Categorical(df['size'], categories=['S', 'M', 'L'], ordered=True)
df['size_code'] = size.codes  # Preserves order
```

### Drop First?

```python
# Without drop_first (3 columns)
Red  Blue  Green
1    0     0
0    1     0
0    0     1

# With drop_first=True (2 columns)
Blue  Green
0     0       ← This means Red!
1     0
0     1

Use drop_first=True for linear models to avoid multicollinearity
```

In [None]:
print("=== ENCODING CATEGORICAL DATA ===\n")

# Example 1: One-hot encoding with get_dummies
print("Example 1: One-hot encoding (nominal data)\n")
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'color': ['Red', 'Blue', 'Green', 'Red']
})

print("Original:")
print(df)

# One-hot encode color
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')
print("\nOne-hot encoded:")
print(df_encoded)
print()

# Example 2: get_dummies on Series
print("="*70)
print("Example 2: Encode single column\n")
status = pd.Series(['active', 'inactive', 'pending', 'active', 'inactive'])
print("Original:")
print(status)

status_encoded = pd.get_dummies(status, prefix='status')
print("\nEncoded:")
print(status_encoded)
print()

# Example 3: drop_first parameter
print("="*70)
print("Example 3: drop_first=True (avoid dummy variable trap)\n")

df_test = pd.DataFrame({'category': ['A', 'B', 'C', 'A']})

print("Without drop_first:")
print(pd.get_dummies(df_test, columns=['category']))

print("\nWith drop_first=True:")
print(pd.get_dummies(df_test, columns=['category'], drop_first=True))
print("\n💡 Use drop_first=True for linear regression to avoid multicollinearity")
print()

# Example 4: Label encoding (ordinal)
print("="*70)
print("Example 4: Label encoding for ordinal data\n")

df_ordinal = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'size': ['S', 'M', 'L', 'M']
})

# Create ordered categorical
df_ordinal['size'] = pd.Categorical(
    df_ordinal['size'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)

# Get codes
df_ordinal['size_code'] = df_ordinal['size'].cat.codes

print(df_ordinal)
print("\nMapping: S→0, M→1, L→2, XL→3")
print()

# Example 5: Multiple column encoding
print("="*70)
print("Example 5: Encode multiple columns at once\n")

df_multi = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'NYC'],
    'status': ['active', 'inactive', 'active'],
    'score': [85, 90, 88]
})

print("Original:")
print(df_multi)

# Encode both city and status
df_encoded = pd.get_dummies(df_multi, columns=['city', 'status'])
print("\nEncoded:")
print(df_encoded)
print()

# Example 6: Compare nominal vs ordinal encoding
print("="*70)
print("Example 6: Nominal vs Ordinal encoding\n")

data = {
    'color': ['Red', 'Blue', 'Green', 'Red'],      # Nominal
    'education': ['High School', 'Bachelor', 'Master', 'PhD']  # Ordinal
}
df_compare = pd.DataFrame(data)

# Nominal → One-hot
color_encoded = pd.get_dummies(df_compare['color'], prefix='color')

# Ordinal → Label encoding
edu_cat = pd.Categorical(
    df_compare['education'],
    categories=['High School', 'Bachelor', 'Master', 'PhD'],
    ordered=True
)
edu_codes = pd.Series(edu_cat.codes, name='education_code')

result = pd.concat([df_compare, color_encoded, edu_codes], axis=1)
print(result)
print("\n✅ Color: One-hot (no order)")
print("✅ Education: Label encoded (preserves order)")
print()

# Example 7: Handling missing values
print("="*70)
print("Example 7: Encoding with missing values\n")

df_na = pd.DataFrame({
    'status': ['active', np.nan, 'inactive', 'active', np.nan]
})

print("Original (with NaN):")
print(df_na)

# get_dummies handles NaN automatically
encoded_na = pd.get_dummies(df_na, columns=['status'], dummy_na=True)
print("\nEncoded (dummy_na=True creates column for NaN):")
print(encoded_na)
print()

# Example 8: Convert back from codes
print("="*70)
print("Example 8: Convert codes back to categories\n")

sizes = pd.Categorical(['S', 'M', 'L'], categories=['S', 'M', 'L', 'XL'], ordered=True)
codes = sizes.codes

print(f"Original: {list(sizes)}")
print(f"Codes: {list(codes)}")

# Convert back
back_to_cat = pd.Categorical.from_codes(codes, categories=['S', 'M', 'L', 'XL'])
print(f"Back to categories: {list(back_to_cat)}")

## 6. Real-World Applications

### Application 1: Survey Data

```python
# Survey responses
df['satisfaction'] = pd.Categorical(
    df['satisfaction'],
    categories=['Very Unsatisfied', 'Unsatisfied', 'Neutral', 
                'Satisfied', 'Very Satisfied'],
    ordered=True
)

# Analysis
df['satisfaction'].value_counts()  # Preserves order
df[df['satisfaction'] >= 'Satisfied']  # Filter positive responses
```

### Application 2: Customer Segmentation

```python
# Age groups
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 45, 65, 100],
    labels=['Young', 'Middle-aged', 'Senior', 'Elderly']
)

# Spending tiers
df['spending_tier'] = pd.qcut(
    df['total_spent'],
    q=4,
    labels=['Low', 'Medium', 'High', 'Premium']
)

# Analyze by segment
df.groupby(['age_group', 'spending_tier'])['revenue'].sum()
```

### Application 3: E-commerce

```python
# Product categories
df['category'] = df['category'].astype('category')

# Order status (ordered)
df['status'] = pd.Categorical(
    df['status'],
    categories=['Pending', 'Processing', 'Shipped', 'Delivered', 'Cancelled'],
    ordered=True
)

# One-hot encode for ML
features = pd.get_dummies(df[['category', 'payment_method']])
```

### Application 4: HR Analytics

```python
# Job levels
df['level'] = pd.Categorical(
    df['level'],
    categories=['Junior', 'Mid', 'Senior', 'Lead', 'Manager'],
    ordered=True
)

# Performance rating
df['rating'] = pd.cut(
    df['score'],
    bins=[0, 60, 75, 90, 100],
    labels=['Needs Improvement', 'Meets', 'Exceeds', 'Outstanding']
)

# Salary analysis
df.groupby(['level', 'department'])['salary'].median()
```

### Application 5: ML Feature Engineering

```python
# Create features from categorical
def prepare_features(df):
    # Ordinal encoding for education
    df['education_code'] = pd.Categorical(
        df['education'],
        categories=['High School', 'Bachelor', 'Master', 'PhD'],
        ordered=True
    ).codes
    
    # One-hot for city
    city_dummies = pd.get_dummies(df['city'], prefix='city')
    
    # Binning age
    df['age_group'] = pd.cut(df['age'], bins=5)
    
    return pd.concat([df, city_dummies], axis=1)
```

In [None]:
print("=== REAL-WORLD APPLICATIONS ===\n")

# Application 1: Customer Survey Analysis
print("Application 1: Customer Satisfaction Survey\n")

survey_data = pd.DataFrame({
    'customer_id': range(1, 11),
    'satisfaction': ['Satisfied', 'Very Satisfied', 'Neutral', 'Unsatisfied',
                    'Very Satisfied', 'Satisfied', 'Very Unsatisfied', 'Neutral',
                    'Satisfied', 'Very Satisfied'],
    'age': [25, 34, 45, 28, 52, 38, 31, 47, 29, 41]
})

# Create ordered categorical
survey_data['satisfaction'] = pd.Categorical(
    survey_data['satisfaction'],
    categories=['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied'],
    ordered=True
)

print("Survey data:")
print(survey_data)

print("\nSatisfaction distribution (maintains order):")
print(survey_data['satisfaction'].value_counts(sort=False))

# Filter positive responses
positive = survey_data[survey_data['satisfaction'] >= 'Satisfied']
print(f"\nPositive responses (>= Satisfied): {len(positive)} / {len(survey_data)} ({len(positive)/len(survey_data)*100:.0f}%)")
print()

# Application 2: Customer Segmentation
print("="*70)
print("Application 2: Customer Segmentation\n")

np.random.seed(42)
customers = pd.DataFrame({
    'customer_id': range(1, 21),
    'age': np.random.randint(20, 70, 20),
    'total_spent': np.random.randint(100, 5000, 20),
    'num_orders': np.random.randint(1, 50, 20)
})

# Age groups
customers['age_group'] = pd.cut(
    customers['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle-aged', 'Senior']
)

# Spending tiers
customers['spending_tier'] = pd.qcut(
    customers['total_spent'],
    q=3,
    labels=['Low', 'Medium', 'High']
)

print("Customer segments:")
print(customers[['customer_id', 'age', 'age_group', 'total_spent', 'spending_tier']].head(10))

print("\nSegment analysis:")
segment_summary = customers.groupby(['age_group', 'spending_tier']).agg({
    'customer_id': 'count',
    'total_spent': 'mean'
}).round(0)
segment_summary.columns = ['count', 'avg_spent']
print(segment_summary)
print()

# Application 3: E-commerce Order Status
print("="*70)
print("Application 3: E-commerce Order Status\n")

orders = pd.DataFrame({
    'order_id': range(1, 9),
    'status': ['Pending', 'Shipped', 'Processing', 'Delivered', 
               'Shipped', 'Cancelled', 'Delivered', 'Processing'],
    'amount': [100, 250, 175, 300, 150, 200, 275, 125]
})

# Ordered status
orders['status'] = pd.Categorical(
    orders['status'],
    categories=['Pending', 'Processing', 'Shipped', 'Delivered', 'Cancelled'],
    ordered=True
)

print("Orders:")
print(orders.sort_values('status'))

print("\nStatus distribution:")
print(orders['status'].value_counts(sort=False))

# Filter active orders
active = orders[orders['status'] < 'Delivered']
print(f"\nActive orders (not delivered/cancelled): {len(active)}")
print()

# Application 4: HR Performance Review
print("="*70)
print("Application 4: HR Performance Review\n")

employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'level': ['Junior', 'Senior', 'Mid', 'Lead', 'Mid', 'Senior'],
    'score': [72, 88, 65, 95, 78, 91],
    'salary': [50000, 90000, 65000, 120000, 70000, 95000]
})

# Job level (ordered)
employees['level'] = pd.Categorical(
    employees['level'],
    categories=['Junior', 'Mid', 'Senior', 'Lead'],
    ordered=True
)

# Performance rating
employees['rating'] = pd.cut(
    employees['score'],
    bins=[0, 70, 80, 90, 100],
    labels=['Needs Improvement', 'Meets', 'Exceeds', 'Outstanding']
)

print(employees.sort_values('level'))

print("\nSalary by level:")
print(employees.groupby('level')['salary'].agg(['mean', 'min', 'max']).round(0))
print()

# Application 5: ML Preparation
print("="*70)
print("Application 5: Prepare data for Machine Learning\n")

ml_data = pd.DataFrame({
    'age': [25, 35, 45, 28, 52],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master'],
    'income': [50000, 75000, 95000, 55000, 80000]
})

print("Original data:")
print(ml_data)

# Prepare features
ml_features = ml_data.copy()

# 1. Age groups
ml_features['age_group'] = pd.cut(
    ml_features['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Adult', 'Senior']
)

# 2. Education (ordinal encoding)
ml_features['education'] = pd.Categorical(
    ml_features['education'],
    categories=['High School', 'Bachelor', 'Master', 'PhD'],
    ordered=True
)
ml_features['education_code'] = ml_features['education'].cat.codes

# 3. City (one-hot encoding)
city_dummies = pd.get_dummies(ml_features['city'], prefix='city')

# Combine
ml_final = pd.concat([
    ml_features[['age', 'education_code', 'income']],
    city_dummies
], axis=1)

print("\nML-ready features:")
print(ml_final)
print("\n✅ All categorical converted to numeric")
print("✅ Ready for machine learning models!")

## 7. Best Practices & Tips

### When to Use Categorical ✅

```python
# ✅ Good candidates
- Gender: 2-3 unique values
- Country: 50-200 unique values
- Product category: 10-50 unique values
- Status: 3-10 unique values
- Day of week: 7 values

# ❌ Poor candidates
- User ID: 1,000,000 unique values
- Email: 500,000 unique values
- Transaction ID: All unique
```

### Rule of Thumb

```python
unique_ratio = df['col'].nunique() / len(df)

if unique_ratio < 0.5:  # Less than 50% unique
    df['col'] = df['col'].astype('category')
```

### Ordered vs Unordered

**Ordered** - Natural sequence
```python
✅ Size: S < M < L < XL
✅ Education: High School < Bachelor < Master < PhD
✅ Rating: Poor < Fair < Good < Excellent
✅ Priority: Low < Medium < High
```

**Unordered** - No meaningful order
```python
✅ Color: Red, Blue, Green
✅ Country: USA, UK, India
✅ Gender: Male, Female, Other
✅ Payment method: Card, Cash, PayPal
```

### Encoding Best Practices

**1. Nominal → One-Hot**
```python
# No order, equal importance
pd.get_dummies(df['color'])
```

**2. Ordinal → Label Encoding**
```python
# Has order, preserve it
cat = pd.Categorical(df['size'], categories=['S','M','L'], ordered=True)
df['size_code'] = cat.codes
```

**3. High Cardinality → Target/Frequency Encoding**
```python
# Too many categories for one-hot
df['city_freq'] = df.groupby('city')['target'].transform('mean')
```

### Performance Tips 🚀

**1. Convert Early**
```python
# ✅ Convert right after loading
df = pd.read_csv('data.csv')
df['category'] = df['category'].astype('category')
```

**2. Bulk Convert**
```python
# Convert multiple columns
cat_cols = ['city', 'status', 'level']
for col in cat_cols:
    df[col] = df[col].astype('category')
```

**3. Memory Check**
```python
# Check memory usage
df.memory_usage(deep=True)
```

### Common Pitfalls ❌

**1. Adding Invalid Category**
```python
# ❌ Error
s = pd.Series(['A', 'B'], dtype='category')
s[0] = 'C'  # Error! 'C' not in categories

# ✅ Add first
s = s.cat.add_categories(['C'])
s[0] = 'C'
```

**2. Comparing Unordered**
```python
# ❌ Error
colors = pd.Categorical(['red', 'blue'])
colors[0] < colors[1]  # Error! Not ordered

# ✅ Make ordered
colors = pd.Categorical(['red', 'blue'], ordered=True)
```

**3. Forgetting to Remove Unused**
```python
# After filtering
df_filtered = df[df['status'] != 'cancelled']

# ✅ Remove unused categories
df_filtered['status'] = df_filtered['status'].cat.remove_unused_categories()
```

### Workflow Pattern

```python
# 1. Load data
df = pd.read_csv('data.csv')

# 2. Identify categorical columns
cat_cols = df.select_dtypes(include='object').columns

# 3. Check cardinality
for col in cat_cols:
    ratio = df[col].nunique() / len(df)
    if ratio < 0.5:
        df[col] = df[col].astype('category')

# 4. Set ordering if needed
if 'size' in df.columns:
    df['size'] = pd.Categorical(
        df['size'],
        categories=['S', 'M', 'L', 'XL'],
        ordered=True
    )

# 5. Encode for ML
# One-hot for nominal
nominal_cols = ['city', 'color']
df_encoded = pd.get_dummies(df, columns=nominal_cols)

# Label encoding for ordinal
df_encoded['size_code'] = df['size'].cat.codes
```

## 8. Practice Exercises

### Beginner Level (1-5)

1. **Create categorical**
   - Convert object column to categorical
   - Check categories and codes

2. **Add categories**
   - Add new category to existing categorical
   - Set a value to new category

3. **Ordered categorical**
   - Create ordered categorical for sizes
   - Sort values

4. **Value counts**
   - Count values in categorical
   - Include unused categories

5. **Basic binning**
   - Use pd.cut() on age data
   - Create 3 age groups

### Intermediate Level (6-10)

6. **Memory comparison**
   - Create large dataset
   - Compare object vs categorical memory

7. **Rename categories**
   - Rename categories in place
   - Verify changes

8. **Quantile binning**
   - Use pd.qcut() for quartiles
   - Verify equal counts

9. **One-hot encoding**
   - Encode color column
   - Use drop_first=True

10. **Label encoding**
    - Create ordered categorical
    - Extract codes

### Advanced Level (11-15)

11. **Customer segmentation**
    - Bin age into groups
    - Bin spending into quartiles
    - Analyze by segment

12. **Survey analysis**
    - Create Likert scale categorical
    - Filter positive responses (>=4)
    - Calculate satisfaction rate

13. **Mixed encoding**
    - One-hot encode nominal features
    - Label encode ordinal features
    - Combine for ML

14. **Performance comparison**
    - Create large dataset
    - Time groupby with object vs categorical

15. **Remove unused**
    - Filter dataframe
    - Remove unused categories
    - Verify cleanup

### Challenge Problems (16-20)

16. **Complete preprocessing pipeline**
    - Identify categorical columns
    - Check cardinality
    - Convert appropriate columns
    - Set ordering
    - Encode for ML

17. **E-commerce data**
    - Product categories
    - Order status (ordered)
    - Customer tiers from spending
    - Analyze by segment

18. **HR analytics**
    - Job levels (ordered)
    - Performance ratings from scores
    - Salary bands
    - Department analysis

19. **Custom binning**
    - Non-uniform bin edges
    - Handle edge cases
    - Custom labels

20. **Optimize large dataset**
    - Load CSV with 1M rows
    - Identify categorical columns
    - Convert and measure savings
    - Compare query performance

In [None]:
print("=== PRACTICE EXERCISE SOLUTIONS ===\n")

# Exercise 1: Create categorical
print("Exercise 1: Create and inspect categorical\n")
colors = pd.Series(['red', 'blue', 'green', 'red', 'blue'])
colors_cat = colors.astype('category')
print(f"Categories: {colors_cat.cat.categories.tolist()}")
print(f"Codes: {colors_cat.cat.codes.tolist()}")
print()

# Exercise 6: Memory comparison
print("="*70)
print("Exercise 6: Memory comparison\n")
n = 10000
data = ['Category_A', 'Category_B', 'Category_C'] * (n // 3)
df_obj = pd.DataFrame({'col': data})
df_cat = pd.DataFrame({'col': pd.Categorical(data)})

mem_obj = df_obj.memory_usage(deep=True).sum() / 1024
mem_cat = df_cat.memory_usage(deep=True).sum() / 1024
print(f"Object:      {mem_obj:.2f} KB")
print(f"Categorical: {mem_cat:.2f} KB")
print(f"Savings:     {(1 - mem_cat/mem_obj)*100:.1f}%")
print()

# Exercise 8: Quantile binning
print("="*70)
print("Exercise 8: Quartiles with equal counts\n")
values = list(range(1, 21))
quartiles = pd.qcut(values, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(f"Values: {values}")
print(f"\nQuartiles: {list(quartiles)}")
print(f"\nCounts (should be equal):")
print(quartiles.value_counts())
print()

# Exercise 11: Customer segmentation
print("="*70)
print("Exercise 11: Customer segmentation\n")
np.random.seed(42)
customers = pd.DataFrame({
    'id': range(1, 13),
    'age': [22, 35, 45, 28, 52, 38, 61, 29, 44, 33, 56, 41],
    'spending': [500, 2500, 3500, 1500, 4500, 2000, 5000, 1000, 3000, 2200, 4000, 2800]
})

# Age groups
customers['age_group'] = pd.cut(
    customers['age'],
    bins=[0, 30, 50, 100],
    labels=['Young', 'Middle', 'Senior']
)

# Spending quartiles
customers['spend_tier'] = pd.qcut(
    customers['spending'],
    q=4,
    labels=['Low', 'Medium', 'High', 'Premium']
)

print(customers)
print("\nSegment counts:")
print(pd.crosstab(customers['age_group'], customers['spend_tier']))
print()

# Exercise 13: Mixed encoding
print("="*70)
print("Exercise 13: Mixed encoding for ML\n")
data = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago'],      # Nominal
    'size': ['S', 'L', 'M', 'L'],                # Ordinal
    'rating': [4, 5, 3, 4]                       # Numeric
})

print("Original:")
print(data)

# One-hot for city (nominal)
city_encoded = pd.get_dummies(data['city'], prefix='city')

# Label encoding for size (ordinal)
size_cat = pd.Categorical(data['size'], categories=['S', 'M', 'L', 'XL'], ordered=True)
size_encoded = pd.Series(size_cat.codes, name='size_code')

# Combine
ml_ready = pd.concat([city_encoded, size_encoded, data['rating']], axis=1)
print("\nML-ready (all numeric):")
print(ml_ready)
print("\n✅ Ready for machine learning!")

## Quick Reference Card

### Creating Categorical

```python
# From Series
s = pd.Series(['A', 'B', 'C']).astype('category')

# With pd.Categorical
cat = pd.Categorical(['A', 'B', 'C'])

# Ordered
cat = pd.Categorical(
    ['S', 'M', 'L'],
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)

# In DataFrame
df['col'] = df['col'].astype('category')
```

### Categorical Properties

```python
s.cat.categories       # List of categories
s.cat.codes           # Integer codes
s.cat.ordered         # Is ordered?
len(s.cat.categories) # Number of categories
```

### Modifying Categories

```python
# Add categories
s.cat.add_categories(['D', 'E'])

# Remove categories
s.cat.remove_categories(['E'])

# Remove unused
s.cat.remove_unused_categories()

# Rename
s.cat.rename_categories({'A': 'Alpha', 'B': 'Beta'})

# Reorder
s.cat.reorder_categories(['C', 'B', 'A'])

# Set ordering
s.cat.as_ordered()
s.cat.as_unordered()
```

### Binning

```python
# Fixed width bins
pd.cut(
    data,
    bins=4,                          # Number of bins
    labels=['A', 'B', 'C', 'D']     # Custom labels
)

# Custom edges
pd.cut(
    ages,
    bins=[0, 18, 35, 60, 100],
    labels=['Child', 'Young', 'Adult', 'Senior']
)

# Equal count bins (quantiles)
pd.qcut(
    data,
    q=4,                             # Number of quantiles
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)
```

### Encoding

```python
# One-hot encoding (nominal)
pd.get_dummies(df['color'])
pd.get_dummies(df, columns=['color', 'city'])
pd.get_dummies(df['color'], drop_first=True)  # Avoid dummy trap

# Label encoding (ordinal)
cat = pd.Categorical(df['size'], categories=['S','M','L'], ordered=True)
df['size_code'] = cat.codes

# Convert back from codes
pd.Categorical.from_codes(codes, categories=['S', 'M', 'L'])
```

### Operations

```python
# Value counts
s.value_counts()              # Sorted by frequency
s.value_counts(sort=False)    # Sorted by category order

# Unique
s.unique()
s.nunique()

# Sort
s.sort_values()               # By category order if ordered

# Filter
s[s == 'A']
s[s >= 'M']                   # Only for ordered

# GroupBy
df.groupby('category')['value'].mean()
```

### Memory Check

```python
# Check memory usage
df.memory_usage(deep=True)

# Should it be categorical?
unique_ratio = df['col'].nunique() / len(df)
if unique_ratio < 0.5:
    df['col'] = df['col'].astype('category')
```

### Common Patterns

```python
# Pattern 1: Convert low-cardinality columns
for col in df.select_dtypes(include='object').columns:
    if df[col].nunique() / len(df) < 0.5:
        df[col] = df[col].astype('category')

# Pattern 2: Age groups
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Senior']
)

# Pattern 3: Ordered levels
df['level'] = pd.Categorical(
    df['level'],
    categories=['Junior', 'Mid', 'Senior', 'Lead'],
    ordered=True
)

# Pattern 4: ML encoding
# One-hot for nominal
nominal_encoded = pd.get_dummies(df[['city', 'color']])

# Label for ordinal
df['education_code'] = pd.Categorical(
    df['education'],
    categories=['HS', 'Bachelor', 'Master', 'PhD'],
    ordered=True
).codes

# Combine
ml_features = pd.concat([
    df[['age', 'income', 'education_code']],
    nominal_encoded
], axis=1)
```

## Summary

### Key Concepts Mastered ✅

**1. Categorical Basics**
- **Definition**: Fixed set of possible values
- **Components**: Categories + Codes + Ordered flag
- **Benefits**: Memory, performance, data integrity
- **When to use**: Low cardinality (< 50% unique)

**2. Ordered vs Unordered**
- **Ordered**: Natural sequence (S < M < L)
- **Unordered**: No meaningful order (Red, Blue, Green)
- **Comparisons**: Only work with ordered
- **Sorting**: Respects category order

**3. Memory Savings**
- Stores integer codes instead of strings
- 50-90% memory reduction typical
- Best for repeated values
- Not beneficial for high cardinality

**4. Binning**
- **cut()**: Equal width bins (value ranges)
- **qcut()**: Equal count bins (quantiles)
- Use cases: Age groups, income tiers, grades

**5. Encoding**
- **One-hot**: Nominal data → Binary columns
- **Label**: Ordinal data → Integer codes
- **When**: ML models need numeric input

---

### Method Summary

| Operation | Method | Purpose |
|-----------|--------|----------|
| **Create** | `.astype('category')` | Convert to categorical |
| **Add** | `.cat.add_categories()` | Add new categories |
| **Remove** | `.cat.remove_categories()` | Remove categories |
| **Rename** | `.cat.rename_categories()` | Rename categories |
| **Reorder** | `.cat.reorder_categories()` | Change order |
| **Bin** | `pd.cut()` / `pd.qcut()` | Convert continuous → categorical |
| **One-hot** | `pd.get_dummies()` | Nominal → binary columns |
| **Label** | `.cat.codes` | Ordinal → integers |

---

### Decision Trees

**Should I use categorical?**
```
Is unique_ratio < 50%?
├─ Yes → ✅ Use categorical
└─ No  → ❌ Keep as object
```

**How to encode?**
```
Does it have natural order?
├─ Yes (Ordinal) → Label encoding (.cat.codes)
└─ No (Nominal)  → One-hot encoding (get_dummies)
```

**cut() or qcut()?**
```
Do you want equal...
├─ Width ranges? → pd.cut()
└─ Count per bin? → pd.qcut()
```

---

### Real-World Applications

**Business Analytics**
- Customer segmentation (age groups, spending tiers)
- Product categories and subcategories
- Order status tracking
- Performance ratings

**Machine Learning**
- Feature engineering (binning, encoding)
- One-hot encoding for tree models
- Ordinal encoding for linear models
- Target encoding for high cardinality

**Data Quality**
- Enforce valid values (prevent typos)
- Clear set of allowed categories
- Consistent naming
- Data validation

**Performance**
- Faster groupby operations
- Efficient value_counts
- Reduced memory usage
- Faster sorting (ordered)

---

### Common Workflows

**Workflow 1: Data Loading**
```
1. Load CSV
2. Identify low-cardinality columns
3. Convert to categorical
4. Set ordering if needed
5. Check memory savings
```

**Workflow 2: Feature Engineering**
```
1. Bin continuous variables (age, income)
2. Create categorical from numeric
3. Set natural ordering
4. Encode for ML
5. Combine features
```

**Workflow 3: Analysis**
```
1. Create ordered categories
2. Sort by category order
3. Filter using comparisons
4. Group by categories
5. Visualize distributions
```

---

### Remember

- 🎯 **Use categorical** for repeated values (< 50% unique)
- 📊 **Ordered** enables comparisons and proper sorting
- 💾 **Memory savings** of 50-90% typical
- 🔢 **One-hot** for nominal, **label encoding** for ordinal
- 📦 **Binning** converts continuous to categorical
- ⚡ **Faster** groupby, value_counts, and operations
- ✅ **Enforces** valid categories (data integrity)

---

### Next Steps

After mastering categorical data:
1. **Advanced encoding** - Target/frequency encoding
2. **Feature engineering** - Category interactions
3. **Machine learning** - Model-specific encoding
4. **High cardinality** - Handling many categories
5. **Optimization** - Memory and performance tuning

---

**Happy Categorical Data Processing! 🐼📊🎯**