# Categorical Encoding Techniques

In machine learning, categorical variables are those that contain discrete values representing categories or groups. Many machine learning algorithms require numerical input, so we need to convert categorical variables into numerical representations. This process is known as **categorical encoding**.

In this notebook, we'll explore various encoding techniques and when to use each one.

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

## Sample Dataset

Let's create a sample dataset with different types of categorical variables:

In [None]:
# Create a sample dataset
data = {
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'yellow', 'purple'],
    'size': ['small', 'medium', 'large', 'small', 'large', 'medium', 'large', 'small'],
    'brand': ['Nike', 'Adidas', 'Puma', 'Nike', 'Reebok', 'Adidas', 'Puma', 'Nike'],
    'rating': ['low', 'medium', 'high', 'medium', 'high', 'low', 'high', 'medium'],
    'price': [25.99, 45.50, 35.75, 29.99, 55.00, 42.25, 38.99, 27.50]
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\nData Types:")
print(df.dtypes)

## 1. Label Encoding

**Label Encoding** converts each categorical value into a numeric value. Each unique category is assigned an integer value.

**When to use:**
- For ordinal data where order matters (e.g., low < medium < high)
- For tree-based algorithms which can handle label encoded features

**Pros:**
- Simple and intuitive
- Maintains the ordinal relationship

**Cons:**
- May introduce unintended ordinal relationships for nominal data
- Can mislead algorithms into thinking nearby integers are similar

In [None]:
# Label Encoding for ordinal data (rating)
label_encoder = LabelEncoder()
df_label = df.copy()
df_label['rating_encoded'] = label_encoder.fit_transform(df['rating'])

print("Label Encoding for 'rating' column:")
print(pd.DataFrame({'Original': df['rating'], 'Encoded': df_label['rating_encoded']}))

## 2. One-Hot Encoding

**One-Hot Encoding** creates binary columns for each category. Each binary column represents the presence (1) or absence (0) of a category.

**When to use:**
- For nominal data where no order exists (e.g., colors, brands)
- For linear models which don't handle label encoded features well

**Pros:**
- No ordinal relationship assumed
- Works well with linear models

**Cons:**
- Can lead to high dimensionality (curse of dimensionality)
- Creates sparse matrices

In [None]:
# One-Hot Encoding for nominal data (color)
df_onehot = df.copy()
one_hot_encoded = pd.get_dummies(df['color'], prefix='color')
df_onehot = pd.concat([df_onehot, one_hot_encoded], axis=1)

print("One-Hot Encoding for 'color' column:")
print(df_onehot[['color', 'color_blue', 'color_green', 'color_purple', 'color_red', 'color_yellow']])

In [None]:
# Using sklearn's OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
encoded_array = onehot_encoder.fit_transform(df[['brand']])

print("Using sklearn's OneHotEncoder for 'brand' column:")
print("Categories:", onehot_encoder.categories_)
print("Encoded array shape:", encoded_array.shape)
print("First few rows of encoded data:")
print(encoded_array[:5])

## 3. Ordinal Encoding

**Ordinal Encoding** is similar to label encoding but allows you to specify the order of categories explicitly.

**When to use:**
- For ordinal data where you want to specify the exact order
- When you want explicit control over the mapping

In [None]:
# Ordinal Encoding for size (small < medium < large)
ordinal_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
df_ordinal = df.copy()
df_ordinal['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])

print("Ordinal Encoding for 'size' column:")
print(pd.DataFrame({'Original': df['size'], 'Encoded': df_ordinal['size_encoded'].astype(int)}))

## 4. Target Encoding (Mean Encoding)

**Target Encoding** replaces categories with the mean of the target variable for that category.

**When to use:**
- When dealing with high cardinality categorical features
- When you suspect a strong relationship between the category and target

**Pros:**
- Creates one numerical feature
- Captures relationship between category and target

**Cons:**
- Risk of overfitting (data leakage)
- Requires careful cross-validation

In [None]:
# Target Encoding example (we'll create a dummy target variable)
np.random.seed(42)
df_target = df.copy()
df_target['sales'] = np.random.randint(10, 100, size=len(df))

# Calculate mean sales for each brand
brand_mean_sales = df_target.groupby('brand')['sales'].mean()
df_target['brand_target_encoded'] = df_target['brand'].map(brand_mean_sales)

print("Target Encoding for 'brand' column based on mean sales:")
print(pd.DataFrame({
    'Brand': df_target['brand'], 
    'Sales': df_target['sales'],
    'Target_Encoded': df_target['brand_target_encoded']
}))

## 5. Frequency Encoding

**Frequency Encoding** replaces categories with their frequency (count) in the dataset.

**When to use:**
- When the frequency of a category is informative
- For high cardinality features

**Pros:**
- Simple to implement
- Preserves information about category prevalence

**Cons:**
- May not capture relationship with target
- Categories with same frequency get same encoding

In [None]:
# Frequency Encoding
df_freq = df.copy()
freq_map = df['color'].value_counts().to_dict()
df_freq['color_freq_encoded'] = df['color'].map(freq_map)

print("Frequency Encoding for 'color' column:")
print(pd.DataFrame({
    'Color': df['color'], 
    'Frequency': df_freq['color_freq_encoded']
}))

## 6. Binary Encoding

**Binary Encoding** combines Hash Encoding and One-Hot Encoding. It first converts categories to ordinal values, then converts those to binary, and splits the binary digits into separate columns.

**When to use:**
- For high cardinality categorical features
- When you want to reduce dimensionality compared to One-Hot Encoding

**Pros:**
- Lower dimensionality than One-Hot Encoding
- Captures some information about categories

**Cons:**
- Less interpretable
- Requires additional library (category_encoders)

In [None]:
# Note: Binary Encoding requires the category_encoders library
# Uncomment the following lines if you have category_encoders installed

# !pip install category_encoders

# import category_encoders as ce
#
# # Binary Encoding
# binary_encoder = ce.BinaryEncoder(cols=['brand'])
# df_binary = binary_encoder.fit_transform(df[['brand']])
# print("Binary Encoding for 'brand' column:")
# print(df_binary.head())

## Complete Example: Applying Multiple Encodings

Let's apply different encoding techniques to our sample dataset:

In [None]:
# Create a complete example with multiple encodings
df_final = df.copy()

# 1. One-Hot Encoding for nominal variables
df_final = pd.get_dummies(df_final, columns=['color', 'brand'], prefix=['color', 'brand'])

# 2. Ordinal Encoding for size
size_mapping = {'small': 0, 'medium': 1, 'large': 2}
df_final['size_ordinal'] = df['size'].map(size_mapping)

# 3. Label Encoding for rating
rating_mapping = {'low': 0, 'medium': 1, 'high': 2}
df_final['rating_label'] = df['rating'].map(rating_mapping)

print("Dataset after applying various encoding techniques:")
print(df_final.head())
print("\nShape of final dataset:", df_final.shape)

## Using ColumnTransformer for Pipeline Integration

ColumnTransformer allows you to apply different transformations to different columns in a clean and systematic way:

In [None]:
# Using ColumnTransformer for systematic encoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), ['color', 'brand']),  # One-hot encode color and brand
        ('ordinal', OrdinalEncoder(categories=[['small', 'medium', 'large']]), ['size']),  # Ordinal encode size
        ('label', OrdinalEncoder(categories=[['low', 'medium', 'high']]), ['rating'])  # Label encode rating
    ],
    remainder='passthrough'  # Keep other columns (price) as is
)

# Apply transformations
transformed_data = preprocessor.fit_transform(df)

print("Transformed data shape:", transformed_data.shape)
print("First few rows of transformed data:")
print(transformed_data[:5])

## Choosing the Right Encoding Technique

| Technique | Best For | Pros | Cons |
|-----------|----------|------|------|
| Label Encoding | Ordinal data, Tree-based models | Simple, preserves order | Assumes order for nominal data |
| One-Hot Encoding | Nominal data, Linear models | No assumptions about order | High dimensionality |
| Ordinal Encoding | Ordered categories | Explicit control | Need to define order |
| Target Encoding | High cardinality, predictive power | Reduces dimensions | Risk of overfitting |
| Frequency Encoding | Category prevalence | Simple, informative | Doesn't capture target relationship |

**General Guidelines:**
1. **Nominal data** (no order): Use One-Hot Encoding
2. **Ordinal data** (ordered): Use Ordinal or Label Encoding
3. **High cardinality**: Consider Target or Frequency Encoding
4. **Tree-based models**: Label Encoding often works fine
5. **Linear models**: One-Hot Encoding is usually better
6. **Small datasets**: Be cautious with Target Encoding (risk of overfitting)

## Summary

Categorical encoding is a crucial preprocessing step in machine learning. The choice of encoding technique depends on:
- The nature of your categorical data (nominal vs. ordinal)
- The machine learning algorithm you're using
- The cardinality of your categorical features
- The relationship between categories and your target variable

Understanding these techniques and when to apply them will significantly improve your machine learning models' performance.