# 04 – Feature Transformation and Encoding
    Converting Raw Attributes into Model-Consumable Signals



### Objective

This notebook provides a systematic treatment of **feature transformation and encoding**, covering:

- Why transformation is not optional
- Ordinal vs nominal encoding
- Cardinality-aware encoding strategies
- Target leakage risks in encoding
- Transformation inside pipelines

It answers:

    How do we transform heterogeneous features into numeric representations without destroying meaning or leaking information?


### Why Transformation and Encoding Matter

Machine learning models operate on numbers — not meaning.

Poor encoding can:
- Introduce artificial order
- Inflate dimensionality
- Leak target information
- Degrade model generalization

Encoding is not a mechanical step — it is a modeling decision.



### Imports and Dataset


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("../datasets/03_feature_engineering/customer_feature_encoding_benchmark.csv")
df.head()

Unnamed: 0,customer_id,age,income,avg_monthly_usage,support_tickets,satisfaction_level,region,customer_segment,churn
0,0,69,32133.0,81.606182,0,Low,North,segment_20,0
1,1,32,17875.0,28.316858,1,Medium,East,segment_12,0
2,2,78,26139.0,41.475782,1,High,East,segment_11,0
3,3,38,54872.0,82.361651,1,Low,East,segment_9,0
4,4,41,48679.0,57.360148,0,Very High,South,segment_10,0


## Step 1 – Feature Type Audit

Before encoding, we must understand feature semantics.




## Feature Categories

We distinguish:

- **Numeric continuous**: income, usage
- **Numeric discrete**: support tickets
- **Ordinal categorical**: satisfaction_level
- **Nominal categorical**: region, customer_segment
- **Identifiers**: customer_id (never encoded)


| Column               | Type                             | Purpose            |
| -------------------- | -------------------------------- | ------------------ |
| `customer_id`        | Identifier                       | Never encoded      |
| `age`                | Numeric continuous               | Pipeline numeric   |
| `income`             | Numeric continuous (skewed)      | Log transform      |
| `avg_monthly_usage`  | Numeric continuous               | Model input        |
| `support_tickets`    | Numeric discrete                 | Count feature      |
| `satisfaction_level` | Ordinal categorical              | OrdinalEncoder     |
| `region`             | Nominal categorical (low card.)  | One-Hot            |
| `customer_segment`   | Nominal categorical (high card.) | Frequency / Target |
| `churn`              | Target                           | Encoding demo      |


| Notebook Step        | Works |
| -------------------- | ----- |
| Feature type audit   | ✅     |
| Ordinal encoding     | ✅     |
| One-Hot (low card.)  | ✅     |
| Frequency encoding   | ✅     |
| Target encoding demo | ✅     |
| Log transform        | ✅     |
| Binning with `qcut`  | ✅     |
| Pipeline example     | ✅     |
