# 04 – Feature Transformation and Encoding
    Converting Raw Attributes into Model-Consumable Signals



### Objective

This notebook provides a systematic treatment of **feature transformation and encoding**, covering:

- Why transformation is not optional
- Ordinal vs nominal encoding
- Cardinality-aware encoding strategies
- Target leakage risks in encoding
- Transformation inside pipelines

It answers:

    How do we transform heterogeneous features into numeric representations without destroying meaning or leaking information?


### Why Transformation and Encoding Matter

Machine learning models operate on numbers — not meaning.

Poor encoding can:
- Introduce artificial order
- Inflate dimensionality
- Leak target information
- Degrade model generalization

Encoding is not a mechanical step — it is a modeling decision.



### Imports and Dataset


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("../datasets/03_feature_engineering/customer_feature_encoding_benchmark.csv")
df.head()

Unnamed: 0,customer_id,age,income,avg_monthly_usage,support_tickets,satisfaction_level,region,customer_segment,churn
0,0,69,32133.0,81.606182,0,Low,North,segment_20,0
1,1,32,17875.0,28.316858,1,Medium,East,segment_12,0
2,2,78,26139.0,41.475782,1,High,East,segment_11,0
3,3,38,54872.0,82.361651,1,Low,East,segment_9,0
4,4,41,48679.0,57.360148,0,Very High,South,segment_10,0


## Step 1 – Feature Type Audit

Before encoding, we must understand feature semantics.




In [3]:
df.dtypes

customer_id             int64
age                     int64
income                float64
avg_monthly_usage     float64
support_tickets         int64
satisfaction_level     object
region                 object
customer_segment       object
churn                   int64
dtype: object

## Feature Categories

We distinguish:

- **Numeric continuous**: income, usage
- **Numeric discrete**: support tickets
- **Ordinal categorical**: satisfaction_level
- **Nominal categorical**: region, customer_segment
- **Identifiers**: customer_id (never encoded)


| Column               | Type                             | Purpose            |
| -------------------- | -------------------------------- | ------------------ |
| `customer_id`        | Identifier                       | Never encoded      |
| `age`                | Numeric continuous               | Pipeline numeric   |
| `income`             | Numeric continuous (skewed)      | Log transform      |
| `avg_monthly_usage`  | Numeric continuous               | Model input        |
| `support_tickets`    | Numeric discrete                 | Count feature      |
| `satisfaction_level` | Ordinal categorical              | OrdinalEncoder     |
| `region`             | Nominal categorical (low card.)  | One-Hot            |
| `customer_segment`   | Nominal categorical (high card.) | Frequency / Target |
| `churn`              | Target                           | Encoding demo      |


## Step 2 – Ordinal Encoding

Ordinal features have **intrinsic order**.
Encoding must preserve rank, not distance.

### Ordinal Mapping

In [4]:
satisfaction_mapping = {
    "Very Low": 1,
    "Low": 2,
    "Medium": 3,
    "High": 4,
    "Very High": 5
}

df["satisfaction_encoded"] = df["satisfaction_level"].map(satisfaction_mapping)
df[["satisfaction_level", "satisfaction_encoded"]].head()


Unnamed: 0,satisfaction_level,satisfaction_encoded
0,Low,2
1,Medium,3
2,High,4
3,Low,2
4,Very High,5


## Step 3 – Nominal Encoding (Low Cardinality)

Nominal features have **no natural order**.

### One-Hot Encoding

In [5]:
low_cardinality_features = ["region"]

df_ohe = pd.get_dummies(
    df,
    columns=low_cardinality_features,
    drop_first=True
)

df_ohe.filter(like="region_").head()


Unnamed: 0,region_North,region_South,region_West
0,True,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,True,False


## Step 4 – High Cardinality Encoding

High-cardinality features require special handling to avoid:
- Dimensional explosion
- Overfitting


### Frequency Encoding

In [6]:
segment_freq = df["customer_segment"].value_counts(normalize=True)

df["segment_freq_encoded"] = df["customer_segment"].map(segment_freq)
df[["customer_segment", "segment_freq_encoded"]].head()


Unnamed: 0,customer_segment,segment_freq_encoded
0,segment_20,0.0384
1,segment_12,0.0414
2,segment_11,0.035
3,segment_9,0.0398
4,segment_10,0.038


## Step 5 – Target Encoding (Risky but Powerful)

Target encoding uses label statistics.

- Must be done **inside cross-validation** to avoid leakage.


### Naive Target Encoding (Demonstration Only)

In [7]:
target_mean = df.groupby("customer_segment")["churn"].mean()

df["segment_target_encoded"] = df["customer_segment"].map(target_mean)


## Leakage Warning

`[NO]` The previous approach is NOT production-safe.

Correct target encoding requires:
- Fold-aware computation
- Regularization / smoothing
- Pipeline integration


## Step 6 – Numeric Transformations

Transformations stabilize variance and improve model learning.
### Log Transformation

In [8]:
df["income_log"] = np.log1p(df["income"])


## Step 7 – Binning as Transformation

Discretization can:
- Improve interpretability
- Reduce noise
- Enable monotonic relationships

### Income Binning

In [9]:
df["income_band"] = pd.qcut(
    df["income"],
    q=4,
    labels=["Low", "Mid-Low", "Mid-High", "High"]
)

df[["income", "income_band"]].head()


Unnamed: 0,income,income_band
0,32133.0,Mid-Low
1,17875.0,Low
2,26139.0,Mid-Low
3,54872.0,Mid-High
4,48679.0,Mid-High


## Step 8 – Encoding Inside Pipelines (Correct Approach)


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

numeric_features = ["age", "income", "avg_monthly_usage"]
ordinal_features = ["satisfaction_level"]
nominal_features = ["region"]

ordinal_encoder = OrdinalEncoder(
    categories=[["Very Low", "Low", "Medium", "High", "Very High"]]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="median"), numeric_features),
        ("ord", ordinal_encoder, ordinal_features),
        ("nom", OneHotEncoder(drop="first"), nominal_features)
    ]
)


## Step 9 – Encoding Strategy by Model Type

| Model | Encoding |
|-----|--------|
| Linear / Logistic | OHE / Ordinal |
| Tree-Based | Label / Ordinal |
| Boosting | Target / Frequency |
| Neural Nets | OHE / Embeddings |


## Common Mistakes (Avoided)

- `[neg] -`  Label encoding nominal categories
- `[neg] -`  Target encoding without CV
- `[neg] -`  Encoding identifiers
- `[neg] -`  Ignoring cardinality
- `[neg] -`  Encoding outside pipelines

## Summary Table

| Feature Type | Strategy |
|------------|----------|
| Ordinal | OrdinalEncoder |
| Nominal (low card.) | One-Hot |
| Nominal (high card.) | Frequency |
| Skewed numeric | Log transform |
| Risk-aware | Pipeline encoding |


## Key Takeaways

- Encoding is model- and data-dependent
- Preserve semantics before transformation
- Cardinality dictates encoding choice
- Leakage-safe encoding is mandatory
- Pipelines are non-negotiable


## Next Notebook

03_Feature_Engineering/

└── [05_feature_selection_and_dimensionality_reduction.ipynb](05_feature_selection_and_dimensionality_reduction.ipynb)
