# Encoding Categorical Variables

## The Problem

Machine learning algorithms work with **numbers**, not text. We need to convert categorical data into numerical format.

**Example**:
```
Color: ['Red', 'Blue', 'Green'] → ?
```

## Types of Categorical Data

| Type | Definition | Examples |
|------|------------|----------|
| **Nominal** | No order/ranking | Colors, Countries, Gender |
| **Ordinal** | Has meaningful order | Education (High School < Bachelor < Master), Ratings (Poor < Good < Excellent) |

## Encoding Methods

| Encoder | Use For | Output | When to Use |
|---------|---------|--------|-------------|
| **LabelEncoder** | Target variable (y) | Single column: [0, 1, 2, ...] | Classification labels only |
| **OrdinalEncoder** | Ordinal features (X) | Ordered integers | When categories have meaningful order |
| **OneHotEncoder** | Nominal features (X) | Multiple binary columns | Most common for features with no order |
| **get_dummies** | Quick encoding | Multiple binary columns | Pandas convenience (not for production) |

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

## Sample Dataset: Customer Data

Let's create a realistic dataset with different types of categorical variables:

In [2]:
# Create customer dataset
data = {
    'customer_id': range(1, 11),
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 
             'Chennai', 'Mumbai', 'Bangalore', 'Chennai', 'Delhi'],
    'education': ['High School', 'Bachelor', 'Master', 'Bachelor', 'PhD',
                  'High School', 'Master', 'Bachelor', 'PhD', 'Master'],
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone',
                'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop'],
    'satisfaction': ['Good', 'Excellent', 'Poor', 'Good', 'Excellent',
                     'Good', 'Excellent', 'Poor', 'Good', 'Excellent'],
    'purchased': [1, 1, 0, 1, 1, 0, 1, 0, 1, 1]  # Target variable
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\nData Types:")
print(df.dtypes)
print("\nCategorical Columns:")
print(df.select_dtypes(include='object').columns.tolist())

Original Dataset:
   customer_id       city    education product satisfaction  purchased
0            1     Mumbai  High School  Laptop         Good          1
1            2      Delhi     Bachelor   Phone    Excellent          1
2            3     Mumbai       Master  Tablet         Poor          0
3            4  Bangalore     Bachelor  Laptop         Good          1
4            5      Delhi          PhD   Phone    Excellent          1
5            6    Chennai  High School  Tablet         Good          0
6            7     Mumbai       Master  Laptop    Excellent          1
7            8  Bangalore     Bachelor   Phone         Poor          0
8            9    Chennai          PhD  Tablet         Good          1
9           10      Delhi       Master  Laptop    Excellent          1

Data Types:
customer_id      int64
city            object
education       object
product         object
satisfaction    object
purchased        int64
dtype: object

Categorical Columns:
['city', 'educ

## 1. LabelEncoder

**Definition**: Converts each unique category to an integer (0, 1, 2, ...)

**Use Case**: Encoding **target variable (y)** only

**Warning**: ⚠️ Don't use for features! Creates false ordering.

**Example**:
```
['Red', 'Blue', 'Green', 'Red'] → [2, 0, 1, 2]
```

**Why not for features?**
- Algorithm might think Blue(0) < Green(1) < Red(2)
- No such relationship exists for colors!

In [3]:
# LabelEncoder for target variable
le = LabelEncoder()

# Example: Encode satisfaction levels
satisfaction = df['satisfaction'].values
print("Original:", satisfaction)

# Fit and transform
satisfaction_encoded = le.fit_transform(satisfaction)
print("Encoded: ", satisfaction_encoded)

# See the mapping
print("\nMapping:")
for i, label in enumerate(le.classes_):
    print(f"  {label:12s} → {i}")

# Inverse transform (get back original)
original_back = le.inverse_transform(satisfaction_encoded)
print("\nInverse transform:", original_back)

# ⚠️ Problem with LabelEncoder for features
print("\n" + "="*60)
print("WHY NOT USE FOR FEATURES?")
print("="*60)

cities = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai']
le_city = LabelEncoder()
encoded_cities = le_city.fit_transform(cities)

print("\nCities encoded with LabelEncoder:")
for city, code in zip(cities, encoded_cities):
    print(f"  {city:12s} → {code}")

print("\n⚠️ Problem: Algorithm thinks Bangalore(0) < Chennai(1) < Delhi(2) < Mumbai(3)")
print("   But cities have NO inherent order!")
print("   Solution: Use OneHotEncoder for nominal features")

Original: ['Good' 'Excellent' 'Poor' 'Good' 'Excellent' 'Good' 'Excellent' 'Poor'
 'Good' 'Excellent']
Encoded:  [1 0 2 1 0 1 0 2 1 0]

Mapping:
  Excellent    → 0
  Good         → 1
  Poor         → 2

Inverse transform: ['Good' 'Excellent' 'Poor' 'Good' 'Excellent' 'Good' 'Excellent' 'Poor'
 'Good' 'Excellent']

WHY NOT USE FOR FEATURES?

Cities encoded with LabelEncoder:
  Mumbai       → 3
  Delhi        → 2
  Bangalore    → 0
  Chennai      → 1

⚠️ Problem: Algorithm thinks Bangalore(0) < Chennai(1) < Delhi(2) < Mumbai(3)
   But cities have NO inherent order!
   Solution: Use OneHotEncoder for nominal features


## 2. OrdinalEncoder

**Definition**: Encodes categorical features with meaningful order

**Use Case**: Features with **natural ordering** (ordinal data)

**Examples**:
- Education: High School < Bachelor < Master < PhD
- Temperature: Cold < Warm < Hot
- Rating: Poor < Fair < Good < Excellent

**Advantage**: Preserves order information for the model

In [4]:
# Define custom order for education
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

# OrdinalEncoder with custom order
oe = OrdinalEncoder(categories=[education_order])

# Reshape for sklearn (needs 2D array)
education_data = df[['education']]
print("Original Education:")
print(education_data.values.flatten())

# Fit and transform
education_encoded = oe.fit_transform(education_data)
print("\nEncoded (with order):")
print(education_encoded.flatten())

print("\nMapping (preserves order):")
for i, level in enumerate(education_order):
    print(f"  {i} ← {level}")

# Multiple ordinal columns
print("\n" + "="*60)
print("MULTIPLE ORDINAL COLUMNS")
print("="*60)

satisfaction_order = ['Poor', 'Good', 'Excellent']

# Create ordinal encoder for multiple columns
oe_multi = OrdinalEncoder(categories=[education_order, satisfaction_order])

# Encode both columns
ordinal_features = df[['education', 'satisfaction']]
ordinal_encoded = oe_multi.fit_transform(ordinal_features)

result_df = pd.DataFrame(
    ordinal_encoded,
    columns=['education_encoded', 'satisfaction_encoded']
)
result_df = pd.concat([ordinal_features.reset_index(drop=True), result_df], axis=1)

print("\nOriginal vs Encoded:")
print(result_df)

Original Education:
['High School' 'Bachelor' 'Master' 'Bachelor' 'PhD' 'High School' 'Master'
 'Bachelor' 'PhD' 'Master']

Encoded (with order):
[0. 1. 2. 1. 3. 0. 2. 1. 3. 2.]

Mapping (preserves order):
  0 ← High School
  1 ← Bachelor
  2 ← Master
  3 ← PhD

MULTIPLE ORDINAL COLUMNS

Original vs Encoded:
     education satisfaction  education_encoded  satisfaction_encoded
0  High School         Good                0.0                   1.0
1     Bachelor    Excellent                1.0                   2.0
2       Master         Poor                2.0                   0.0
3     Bachelor         Good                1.0                   1.0
4          PhD    Excellent                3.0                   2.0
5  High School         Good                0.0                   1.0
6       Master    Excellent                2.0                   2.0
7     Bachelor         Poor                1.0                   0.0
8          PhD         Good                3.0                   1.0


## 3. OneHotEncoder

**Definition**: Creates binary column for each category

**Use Case**: Nominal features with **no order**

**Example**:
```
Color: ['Red', 'Blue', 'Green']

Becomes:
  Red  Blue  Green
   1     0      0     (Red)
   0     1      0     (Blue)
   0     0      1     (Green)
```

**Advantages**:
- No false ordering
- Each category is independent

**Disadvantage**: High cardinality (many categories) creates many columns

In [5]:
# OneHotEncoder for city (nominal feature)
ohe = OneHotEncoder(sparse_output=False)  # sparse_output=False for readable output

city_data = df[['city']]
print("Original Cities:")
print(city_data.values.flatten())

# Fit and transform
city_encoded = ohe.fit_transform(city_data)
print("\nOne-Hot Encoded Shape:", city_encoded.shape)
print("\nEncoded Matrix:")
print(city_encoded)

# Get feature names
feature_names = ohe.get_feature_names_out(['city'])
print("\nColumn Names:", feature_names)

# Create readable dataframe
city_encoded_df = pd.DataFrame(city_encoded, columns=feature_names)
city_encoded_df.insert(0, 'original_city', city_data.values)

print("\nReadable Format:")
print(city_encoded_df)

Original Cities:
['Mumbai' 'Delhi' 'Mumbai' 'Bangalore' 'Delhi' 'Chennai' 'Mumbai'
 'Bangalore' 'Chennai' 'Delhi']

One-Hot Encoded Shape: (10, 4)

Encoded Matrix:
[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]]

Column Names: ['city_Bangalore' 'city_Chennai' 'city_Delhi' 'city_Mumbai']


ValueError: 2

In [None]:
# OneHotEncoder for multiple columns
print("="*60)
print("ONE-HOT ENCODING MULTIPLE COLUMNS")
print("="*60)

# Select nominal features
nominal_features = df[['city', 'product']]
print("\nOriginal Data:")
print(nominal_features)

# Encode
ohe_multi = OneHotEncoder(sparse_output=False)
encoded = ohe_multi.fit_transform(nominal_features)

# Get column names
column_names = ohe_multi.get_feature_names_out(['city', 'product'])
print("\nGenerated Columns:", column_names)

# Create dataframe
encoded_df = pd.DataFrame(encoded, columns=column_names)
print("\nEncoded Result:")
print(encoded_df.head())
print(f"\nShape: {nominal_features.shape} → {encoded_df.shape}")

## Important: drop='first' to Avoid Dummy Variable Trap

**Problem**: OneHotEncoding creates multicollinearity

**Example**: For 3 colors, we create 3 columns
```
Red  Blue  Green
 1    0     0      If Red=0 AND Blue=0, we KNOW Green=1
 0    1     0      One column is redundant!
 0    0     1
```

**Solution**: Drop first column (or any one column)

**When to use**:
- Linear models (Linear/Logistic Regression)
- Tree-based models: not necessary but saves memory

In [None]:
# Without drop='first'
ohe_full = OneHotEncoder(sparse_output=False)
city_full = ohe_full.fit_transform(df[['city']])

print("Without drop='first':")
print(f"Columns: {ohe_full.get_feature_names_out()}")
print(f"Shape: {city_full.shape}")
print(city_full[:3])

# With drop='first' (recommended for linear models)
ohe_drop = OneHotEncoder(sparse_output=False, drop='first')
city_drop = ohe_drop.fit_transform(df[['city']])

print("\nWith drop='first':")
print(f"Columns: {ohe_drop.get_feature_names_out()}")
print(f"Shape: {city_drop.shape}")
print(city_drop[:3])
print("\n✓ First category (Bangalore) dropped")
print("  If all columns are 0 → Bangalore")

## Handling Unknown Categories

**Problem**: What if test data has categories not seen in training?

**Solution**: Use `handle_unknown='ignore'`

In [None]:
# Training data
train_cities = np.array([['Mumbai'], ['Delhi'], ['Bangalore']])

# Fit encoder
ohe_unknown = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
ohe_unknown.fit(train_cities)

print("Trained on:", train_cities.flatten())
print("Categories learned:", ohe_unknown.categories_)

# Test data with unknown category
test_cities = np.array([['Mumbai'], ['Kolkata'], ['Delhi']])  # Kolkata is new!

print("\nTest data:", test_cities.flatten())

# Transform (without error)
test_encoded = ohe_unknown.transform(test_cities)
print("\nEncoded:")
print(test_encoded)
print("\n'Kolkata' (unknown) → all zeros")

## Pandas get_dummies() - Quick Alternative

**Use Case**: Quick exploratory data analysis

**Limitations**:
- Not compatible with sklearn pipelines
- Can't handle unknown categories in test data
- Not recommended for production

**When to use**: Prototyping, notebooks, quick experiments

In [None]:
# Original data
print("Original DataFrame:")
print(df[['city', 'product', 'satisfaction']].head())

# get_dummies - all categorical columns
df_dummies = pd.get_dummies(df[['city', 'product', 'satisfaction']])
print("\nAfter get_dummies():")
print(df_dummies.head())
print(f"\nShape: 3 columns → {df_dummies.shape[1]} columns")

# With drop_first
df_dummies_drop = pd.get_dummies(
    df[['city', 'product', 'satisfaction']], 
    drop_first=True
)
print("\nWith drop_first=True:")
print(f"Shape: {df_dummies_drop.shape[1]} columns")
print(df_dummies_drop.head())

## Real-World Example: Predicting Purchases

Complete pipeline with proper encoding:

In [None]:
# Create larger dataset
np.random.seed(42)
n_samples = 200

cities = np.random.choice(['Mumbai', 'Delhi', 'Bangalore', 'Chennai'], n_samples)
education = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples)
product = np.random.choice(['Laptop', 'Phone', 'Tablet'], n_samples)
age = np.random.randint(20, 60, n_samples)
salary = np.random.randint(30000, 150000, n_samples)

# Target: purchased (influenced by salary and education)
purchased = ((salary > 80000) & (education != 'High School')).astype(int)
purchased = np.random.permutation(purchased)  # Add randomness

data_large = pd.DataFrame({
    'city': cities,
    'education': education,
    'product': product,
    'age': age,
    'salary': salary,
    'purchased': purchased
})

print("Dataset:")
print(data_large.head(10))
print(f"\nShape: {data_large.shape}")
print(f"Purchase rate: {data_large['purchased'].mean():.2%}")

In [None]:
# Separate features and target
X = data_large.drop('purchased', axis=1)
y = data_large['purchased']

# Split data FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Training set:", X_train.shape)
print("Test set:    ", X_test.shape)

# Identify column types
nominal_cols = ['city', 'product']
ordinal_cols = ['education']
numeric_cols = ['age', 'salary']

print("\nColumn Types:")
print(f"  Nominal: {nominal_cols}")
print(f"  Ordinal: {ordinal_cols}")
print(f"  Numeric: {numeric_cols}")

In [None]:
# Encode nominal features (city, product)
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_train_nominal = ohe.fit_transform(X_train[nominal_cols])
X_test_nominal = ohe.transform(X_test[nominal_cols])

nominal_feature_names = ohe.get_feature_names_out(nominal_cols)
print("Nominal features encoded:")
print(f"  Original: {nominal_cols}")
print(f"  Encoded:  {nominal_feature_names}")
print(f"  Shape: {X_train[nominal_cols].shape} → {X_train_nominal.shape}")

# Encode ordinal features (education)
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
oe = OrdinalEncoder(categories=[education_order])
X_train_ordinal = oe.fit_transform(X_train[ordinal_cols])
X_test_ordinal = oe.transform(X_test[ordinal_cols])

print("\nOrdinal features encoded:")
print(f"  {ordinal_cols} with order: {education_order}")

# Get numeric features
X_train_numeric = X_train[numeric_cols].values
X_test_numeric = X_test[numeric_cols].values

# Combine all features
X_train_processed = np.hstack([
    X_train_nominal,
    X_train_ordinal,
    X_train_numeric
])

X_test_processed = np.hstack([
    X_test_nominal,
    X_test_ordinal,
    X_test_numeric
])

print("\nFinal processed features:")
print(f"  Training: {X_train_processed.shape}")
print(f"  Test:     {X_test_processed.shape}")

In [None]:
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_processed, y_train)

# Predictions
y_pred_train = rf.predict(X_train_processed)
y_pred_test = rf.predict(X_test_processed)

# Evaluate
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"Training Accuracy: {train_acc:.4f}")
print(f"Test Accuracy:     {test_acc:.4f}")

# Feature importance
all_feature_names = list(nominal_feature_names) + ordinal_cols + numeric_cols
feature_importance = pd.DataFrame({
    'feature': all_feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Important Features:")
print(feature_importance.head(10))

## Comparison: When to Use Each Encoder

| Scenario | Encoder | Reason |
|----------|---------|--------|
| Target variable for classification | **LabelEncoder** | Converts classes to 0, 1, 2, ... |
| Education level (High School → PhD) | **OrdinalEncoder** | Preserves natural order |
| City/Country names | **OneHotEncoder** | No inherent order |
| Product categories | **OneHotEncoder** | Independent categories |
| Temperature (Cold/Warm/Hot) | **OrdinalEncoder** | Has meaningful order |
| Size (S/M/L/XL) | **OrdinalEncoder** | Has size order |
| Color (Red/Blue/Green) | **OneHotEncoder** | No natural order |
| Rating (1-5 stars) | **OrdinalEncoder** | Ordered scale |

## Handling High Cardinality Features

**Problem**: Feature with 100+ categories → 100+ columns after OneHotEncoding

**Solutions**:

1. **Frequency Encoding**: Replace category with its frequency
2. **Target Encoding**: Replace with mean of target for that category
3. **Grouping**: Combine rare categories into 'Other'
4. **Hashing**: Use FeatureHasher (covered in text features notebook)

In [None]:
# Example: High cardinality feature
cities_many = ['City_' + str(i) for i in range(50)]  # 50 cities
city_sample = np.random.choice(cities_many, 100)

print(f"Number of unique cities: {len(np.unique(city_sample))}")
print("OneHotEncoding would create 50 columns!\n")

# Solution 1: Frequency Encoding
city_counts = pd.Series(city_sample).value_counts()
city_freq_encoded = pd.Series(city_sample).map(city_counts)

print("Frequency Encoding:")
print(f"Original: {city_sample[:5]}")
print(f"Encoded:  {city_freq_encoded[:5].values}")
print("Each city → its frequency count")

# Solution 2: Group rare categories
top_10_cities = city_counts.head(10).index
city_grouped = pd.Series(city_sample).apply(
    lambda x: x if x in top_10_cities else 'Other'
)

print(f"\nGrouping: 50 categories → {len(city_grouped.unique())} categories")
print(f"Top 10 cities kept, rest grouped as 'Other'")

## Decision Flow Chart

```
START: Have categorical variable
  |
  ├─ Is it TARGET variable (y)?
  │   └─ YES → LabelEncoder
  |
  ├─ Is it FEATURE (X)?
  │   |
  │   ├─ Has natural ORDER?
  │   │   └─ YES → OrdinalEncoder
  │   │       (e.g., education, rating, temperature)
  │   |
  │   └─ NO ORDER (nominal)?
  │       └─ YES → OneHotEncoder
  │           (e.g., city, color, product)
  │       |
  │       ├─ High cardinality (>50 categories)?
  │       │   └─ Consider: Frequency/Target encoding or grouping
  │       |
  │       └─ Use drop='first' for linear models
```

## Key Takeaways

1. **LabelEncoder**: Only for target variable (y)
2. **OrdinalEncoder**: For features with meaningful order
3. **OneHotEncoder**: For features without order (most common)
4. **Always fit on training data**, transform on test
5. **Use `drop='first'`** for linear models to avoid dummy variable trap
6. **Use `handle_unknown='ignore'`** to handle new categories in test set
7. **High cardinality**: Consider alternatives to OneHotEncoding
8. **Pandas get_dummies**: Quick but not for production

## Common Mistakes to Avoid

❌ Using LabelEncoder for features (creates false ordering)  
❌ Fitting encoder on test data  
❌ Encoding before train-test split  
❌ Not handling unknown categories  
❌ Forgetting to combine encoded features with numeric ones  

✅ OneHotEncode nominal features  
✅ Ordinal encode ordered features  
✅ Fit on train, transform on test  
✅ Use pipelines for proper workflow  