# Encoding Techniques
    Transforming Categorical Data Without Losing Meaning
## Objective

This notebook provides a systematic treatment of categorical encoding, covering:

- Nominal vs ordinal variables

- Low vs high cardinality

- Statistical and target-aware encodings

- Leakage risks

- Encoding inside pipelines

It answers:

    How do we encode categorical variables in a way that preserves information, avoids leakage, and scales to production?

## Why Encoding Is a Critical Design Choice

Incorrect encoding can:

- Destroy ordinality

- Introduce leakage

- Inflate dimensionality

- Degrade model performance

- Break deployment pipelines

Encoding is not cosmetic — it is a modeling decision.

## Imports and Dataset

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


We use the synthetic __classification dataset.__

In [2]:
df = pd.read_csv("../datasets/synthetic_customer_churn_classification_complete.csv")


# Step 1 – Identify Categorical Feature Types

In [3]:
categorical_cols = df.select_dtypes(include="object").columns
categorical_cols


Index(['satisfaction_level', 'customer_segment', 'region'], dtype='object')

Feature classification:


| Feature            | Type                       |
| ------------------ | -------------------------- |
| region             | Nominal (low cardinality)  |
| customer_segment   | Nominal (high cardinality) |
| satisfaction_level | Ordinal                    |






# Step 2 – Ordinal Encoding (Preserve Order)
__When to Use__

- Clear, meaningful order

- Distance between levels matters

__Example: Satisfaction Level__




In [4]:
mode_sat_level = df[["satisfaction_level"]].mode().values[0][0]
mode_sat_level

'Medium'

In [5]:
satisfaction_level_no_missing = df[["satisfaction_level"]].fillna(value=mode_sat_level)

In [6]:
ordinal_encoder = OrdinalEncoder(
    categories=[["Very Low", "Low", "Medium", "High", "Very High"]]
)

df["satisfaction_encoded"] = ordinal_encoder.fit_transform(
    satisfaction_level_no_missing 
)

df[["satisfaction_level", "satisfaction_encoded"]].head()


Unnamed: 0,satisfaction_level,satisfaction_encoded
0,,2.0
1,Very High,4.0
2,Medium,2.0
3,,2.0
4,High,3.0


- [pos] - Preserves order
- [neg] - Assumes equal spacing (acceptable for trees, risky for linear models)

# Step 3 – One-Hot Encoding (Nominal, Low Cardinality)
When to Use

- No natural order

- Small number of categories

In [7]:
ohe = OneHotEncoder(
    drop="first",
    handle_unknown="ignore",
    sparse_output=False
)

encoded_region = ohe.fit_transform(df[["region"]])

pd.DataFrame(
    encoded_region,
    columns=ohe.get_feature_names_out(["region"])
).head()


Unnamed: 0,region_North,region_South,region_West
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,0.0


- [pos] -  Interpretable
- [neg] -  Dimensionality grows with categories



# Step 4 – High Cardinality Strategies
## Frequency Encoding (Conceptual)

 Replace categories by their frequency.

In [8]:
freq_map = df["customer_segment"].value_counts(normalize=True)

df["segment_freq"] = df["customer_segment"].map(freq_map)
df[["customer_segment", "segment_freq"]].head()


Unnamed: 0,customer_segment,segment_freq
0,segment_18,0.0068
1,segment_98,0.0069
2,segment_134,0.0079
3,segment_72,0.0073
4,segment_147,0.0062


- [pos] -  Simple
- [pos] -  Scales well
- [neg] -  Loses category identity

## Rare Category Grouping (Recommended)




In [9]:
min_freq = 0.02

segment_counts = df["customer_segment"].value_counts(normalize=True)
rare_segments = segment_counts[segment_counts < min_freq].index

df["segment_grouped"] = df["customer_segment"].replace(
    rare_segments, "Other"
)

df["segment_grouped"].value_counts(normalize=True)


segment_grouped
Other    1.0
Name: proportion, dtype: float64

## OneHotEncoder with min_frequency (Pipeline-Safe)

In [10]:
ohe_high = OneHotEncoder(
    min_frequency=0.02,
    handle_unknown="ignore"
)

ohe_high.fit(df[["customer_segment"]])


- [pos] - Production-safe
- [pos] -  Automatically groups rare levels

# Step 5 – Target Encoding ( Leakage Risk)
## Concept

Encode categories using target statistics.

In [11]:
df.groupby("customer_segment")["churn"].mean().head()


customer_segment
segment_1      0.129032
segment_10     0.242857
segment_100    0.173913
segment_101    0.176471
segment_102    0.226190
Name: churn, dtype: float64


Never compute on full dataset

- [pos] -  Only inside CV folds
- [pos] -  Prefer libraries like category_encoders

# Step 6 – Encoding Inside Pipelines (Correct Approach)
## Feature Split




In [12]:
numeric_features = [
    "age", "income", "tenure_years",
    "avg_monthly_usage", "support_tickets_last_year"
]

ordinal_features = ["satisfaction_level"]
nominal_low = ["region"]
nominal_high = ["customer_segment"]


## Pipeline

In [13]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", RobustScaler())
])

ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(
        categories=[["Very Low", "Low", "Medium", "High", "Very High"]]
    ))
])

nominal_low_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(
        drop="first",
        handle_unknown="ignore"
    ))
])

nominal_high_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(
        min_frequency=0.02,
        handle_unknown="ignore"
    ))
])


## ColumnTransformer

In [14]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("ord", ordinal_pipeline, ordinal_features),
        ("nom_low", nominal_low_pipeline, nominal_low),
        ("nom_high", nominal_high_pipeline, nominal_high)
    ]
)


- [pos] -  Leakage-safe
- [pos] -  Reproducible
- [pos] -  Deployment-ready


# Step 7 – Common Encoding Mistakes (Avoided)

- [pos] - Label encoding nominal variables

- [pos] - One-hot encoding high-cardinality blindly

- [pos] - Target encoding outside CV

- [pos] - Encoding before train/test split

- [pos] - Hard-coding categories without fallback

# Summary Table


| Encoding           | Use Case                 |
| ------------------ | ------------------------ |
| OrdinalEncoder     | Ordered categories       |
| OneHotEncoder      | Nominal, low cardinality |
| Frequency encoding | High cardinality         |
| Rare grouping      | High cardinality         |
| Target encoding    | Only with CV             |
| min_frequency OHE  | Production-safe          |


## Key Takeaways

- Encoding is a modeling decision

- Preserve meaning first, optimize later

- High cardinality requires explicit strategy

- Pipelines prevent leakage and deployment bugs

- Default ≠ correct