# Scaling and Normalization
Making Features Comparable Without Distorting Signal
Objective

This notebook provides a systematic treatment of feature scaling, covering:

When scaling is required (and when it is not)

Standardization vs normalization

Robust scaling under outliers

Log and power transformations

Scaling inside pipelines (leakage-safe)

It answers:

How do we scale numeric features to support model learning without corrupting business signal?

Why Scaling Matters

Incorrect scaling can:

Bias distance-based models

Slow or prevent convergence

Inflate the influence of outliers

Break regularization assumptions

Scaling is model-dependent, not universal.

Imports and Dataset

This notebook demonstrates scaling and normalization techniques in a leakage-safe, production-aligned way.

## Imports

In [None]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    PowerTransformer,
    FunctionTransformer
)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


## Load Synthetic Dataset

In [None]:

df = pd.read_csv("synthetic_customer_churn_classification_complete.csv")
df.head()


## Identify Numeric Features

In [None]:

numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
numeric_features.remove("churn")
numeric_features.remove("customer_id")
numeric_features


## Distribution Diagnostics

In [None]:

df[numeric_features].hist(bins=30, figsize=(12, 8))
plt.tight_layout()
plt.show()


## StandardScaler

In [None]:

standard_scaler = StandardScaler()
standard_scaled = standard_scaler.fit_transform(df[numeric_features])

pd.DataFrame(standard_scaled, columns=numeric_features).describe()


## MinMaxScaler

In [None]:

minmax_scaler = MinMaxScaler()
minmax_scaled = minmax_scaler.fit_transform(df[numeric_features])


## RobustScaler

In [None]:

robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(df[numeric_features])


## Log Transformation

In [None]:

log_transformer = FunctionTransformer(np.log1p, feature_names_out="one-to-one")
log_transformed = log_transformer.fit_transform(df[numeric_features])


## Power Transformation (Yeo-Johnson)

In [None]:

pt = PowerTransformer(method="yeo-johnson")
pt_scaled = pt.fit_transform(df[numeric_features])


## Scaling Inside Pipelines (Best Practice)

In [None]:

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("log", FunctionTransformer(np.log1p, feature_names_out="one-to-one")),
    ("scaler", RobustScaler())
])

numeric_pipeline


## Key Takeaways
- Scaling is model-dependent
- RobustScaler handles outliers best
- Always scale inside pipelines to avoid leakage