In [1]:
from pathlib import Path
from IPython.display import HTML, display
css = Path("../../../css/custom.css").read_text(encoding="utf-8")
display(HTML(f"<style>{css}</style>"))


# Chapter 2 — Basics of Data and Preprocessing
## Lesson 4: Feature Scaling (Normalization & Standardization)

### What you will learn in this lesson

By the end of this notebook, you will be able to:

- Explain *why* feature scaling matters and *when* it does **not**.
- Distinguish **standardization** (z-score scaling) from **normalization** (min–max scaling) and from **vector normalization** (unit-norm per sample).
- Select an appropriate scaler for a given algorithm (kNN / SVM / Logistic Regression / PCA / k-means / regularized linear models).
- Implement scaling correctly **without data leakage** using `Pipeline` and `ColumnTransformer`.
- Diagnose scaling problems: outliers, heavy tails, sparse matrices, and mixed data types.
- Treat the scaler as a **hyperparameter** and validate it using cross-validation / grid search.
- Communicate scaling choices in a reproducible ML workflow.

---

### Why this topic is “Band 8–9” relevant (advanced framing)

At higher proficiency, “feature scaling” is not a checkbox; it is a *modeling decision* that affects:

- **Optimization geometry** (conditioning of the problem; gradient descent step sizes).
- **Regularization meaning** (L1/L2 penalties are not comparable across features unless scale is controlled).
- **Distance and similarity** (kNN, k-means, kernel methods, cosine distance).
- **Numerical stability** (floating-point range; poorly scaled features can create underflow/overflow or ill-conditioned matrices).
- **Interpretability and governance** (what does a coefficient *mean* if one feature is in dollars and another is in millimeters?).

In this lesson you will learn to treat scaling as part of the modeling pipeline, not an isolated preprocessing trick.


In [2]:

import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler, Normalizer
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

np.set_printoptions(precision=4, suppress=True)
pd.set_option("display.max_columns", 60)
pd.set_option("display.width", 160)

print("Versions:")
import sklearn, sys
print("  python:", sys.version.split()[0])
print("  sklearn:", sklearn.__version__)


Versions:
  python: 3.13.0
  sklearn: 1.5.2



## 1) Core concepts and notation

Let a dataset have $n$ samples and $p$ features. We write the feature vector for sample $i$ as:

$$
\mathbf{x}_i = [x_{i1}, x_{i2}, \dots, x_{ip}]
$$

Two scaling operations you will use constantly:

### Standardization (z-score scaling)

For feature $j$:

$$
z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}
$$

- $\mu_j$ is the mean of feature $j$ (computed on the **training** set).
- $\sigma_j$ is the standard deviation of feature $j$ (computed on the **training** set).

### Min–max normalization

For feature $j$:

$$
x'_{ij} = \frac{x_{ij} - \min_j}{\max_j - \min_j}
$$

This maps each feature (approximately) into $[0, 1]$.

### Vector normalization (unit norm per sample)

This normalizes the entire vector of a sample:

$$
\tilde{\mathbf{x}}_i = \frac{\mathbf{x}_i}{\lVert \mathbf{x}_i \rVert_2}
$$

This is common in text retrieval / cosine similarity pipelines and is *different* from min–max.

---

## 2) When scaling matters (and when it does not)

Scaling matters when your algorithm uses:

- **Distances or inner products**: kNN, k-means, SVM (especially RBF kernel), PCA, kernel ridge regression.
- **Regularization penalties** that assume comparable coefficient scales: Lasso, Ridge, Elastic Net.
- **Coordinate-wise optimization** where step sizes depend on feature scale.

Scaling is usually not essential for:

- **Tree-based models**: decision trees, random forests, gradient boosting trees  
  (splits are based on ordering; scale does not change order).
- Some **rule-based** or **count-based** systems where features already share scale.

However, “not essential” is not identical to “never useful.” Tree models can still benefit indirectly when scaling helps upstream steps (e.g., PCA features; stability of imputation; handling extreme ranges).

---

### A geometric intuition (distance dominance)

Consider Euclidean distance:

$$
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{j=1}^{p} (x_j - y_j)^2}
$$

If one feature is measured in the range $[0, 10^6]$ and another in $[0, 1]$, then the large-range feature *dominates* the distance, regardless of whether it is informative. Scaling is a way to define what “distance” should mean for your task.



## 3) Load multiple datasets (and look at feature scales)

In real projects, feature scaling is rarely a “one dataset, one scaler” decision. You often build a reusable policy and then verify it across different data sources.

We will use several datasets from your repository:

- `diabetes.csv` (binary classification; numeric features with different ranges)
- `iris.csv` (multiclass classification; classic demonstration for distance-based methods)
- `Wine_Quality.csv` (tabular classification proxy; we will binarize quality for a clean example)
- `drug200.csv` (mixed numeric + categorical; demonstrates column-wise pipelines)
- `hw_200.csv` (clustering; demonstrates scaling + k-means / PCA)

We will start by inspecting the raw scale of numeric columns.


In [3]:

# Paths are relative to: Tutorials/English/Chapter2 or Tutorials/Persian/Chapter2
p_diabetes = Path("../../../Datasets/Classification/diabetes.csv")
p_iris     = Path("../../../Datasets/Classification/iris.csv")
p_wine     = Path("../../../Datasets/Classification/Wine_Quality.csv")
p_drug     = Path("../../../Datasets/Classification/drug200.csv")
p_hw       = Path("../../../Datasets/Clustering/hw_200.csv")

diabetes = pd.read_csv(p_diabetes)
iris = pd.read_csv(p_iris)
wine = pd.read_csv(p_wine)
drug = pd.read_csv(p_drug)
hw_raw = pd.read_csv(p_hw)

print("Loaded shapes:")
print("  diabetes:", diabetes.shape)
print("  iris:", iris.shape)
print("  wine:", wine.shape)
print("  drug:", drug.shape)
print("  hw_raw:", hw_raw.shape)

display(diabetes.head())


Loaded shapes:
  diabetes: (768, 9)
  iris: (150, 5)
  wine: (4898, 12)
  drug: (200, 6)
  hw_raw: (200, 3)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,classification
0,6,148,72,35,0,33.6,0.627,50,Diabetic
1,1,85,66,29,0,26.6,0.351,31,Non-Diabetic
2,8,183,64,0,0,23.3,0.672,32,Diabetic
3,1,89,66,23,94,28.1,0.167,21,Non-Diabetic
4,0,137,40,35,168,43.1,2.288,33,Diabetic


In [4]:

def scale_summary(df: pd.DataFrame, name: str, max_cols: int = 10):
    num = df.select_dtypes(include=[np.number])
    if num.empty:
        print(f"{name}: no numeric columns")
        return
    s = pd.DataFrame({
        "min": num.min(),
        "max": num.max(),
        "mean": num.mean(),
        "std": num.std(ddof=0),
    })
    s["range"] = s["max"] - s["min"]
    s["range_ratio_to_median"] = s["range"] / (np.median(s["range"]) + 1e-12)
    s = s.sort_values("range", ascending=False)
    print(f"\n{name}: numeric scale summary (top by range)")
    display(s.head(max_cols).round(4))

scale_summary(diabetes, "diabetes")
scale_summary(iris, "iris")
scale_summary(wine, "wine")



diabetes: numeric scale summary (top by range)


Unnamed: 0,min,max,mean,std,range,range_ratio_to_median
Insulin,0.0,846.0,79.7995,115.1689,846.0,10.1866
Glucose,0.0,199.0,120.8945,31.9518,199.0,2.3961
BloodPressure,0.0,122.0,69.1055,19.3432,122.0,1.469
SkinThickness,0.0,99.0,20.5365,15.9418,99.0,1.1921
BMI,0.0,67.1,31.9926,7.879,67.1,0.8079
Age,21.0,81.0,33.2409,11.7526,60.0,0.7225
Pregnancies,0.0,17.0,3.8451,3.3674,17.0,0.2047
DiabetesPedigreeFunction,0.078,2.42,0.4719,0.3311,2.342,0.0282



iris: numeric scale summary (top by range)


Unnamed: 0,min,max,mean,std,range,range_ratio_to_median
petal_length,1.0,6.9,3.7587,1.7585,5.9,1.9667
sepal_length,4.3,7.9,5.8433,0.8253,3.6,1.2
sepal_width,2.0,4.4,3.054,0.4321,2.4,0.8
petal_width,0.1,2.5,1.1987,0.7606,2.4,0.8



wine: numeric scale summary (top by range)


Unnamed: 0,min,max,mean,std,range,range_ratio_to_median
total sulfur dioxide,9.0,440.0,138.3607,42.4937,431.0,112.5326
free sulfur dioxide,2.0,289.0,35.3081,17.0054,287.0,74.9347
residual sugar,0.6,65.8,6.3914,5.0715,65.2,17.0235
fixed acidity,3.8,14.2,6.8548,0.8438,10.4,2.7154
alcohol,8.0,14.2,10.5143,1.2305,6.2,1.6188
quality,3.0,9.0,5.8779,0.8855,6.0,1.5666
citric acid,0.0,1.66,0.3342,0.121,1.66,0.4334
pH,2.72,3.82,3.1883,0.151,1.1,0.2872
volatile acidity,0.08,1.1,0.2782,0.1008,1.02,0.2663
sulphates,0.22,1.08,0.4898,0.1141,0.86,0.2245



### Interpretation

Even if each dataset looks “numeric”, the feature scales differ:

- In `diabetes`, **Insulin** and **Glucose** are on a completely different range than **DiabetesPedigreeFunction**.
- In `wine`, some chemistry measures have very different spreads (e.g., sulphates vs free sulfur dioxide).
- In `iris`, the ranges are moderate, but scaling can still change the neighborhood geometry.

This is where scaling decisions begin.



## 4) Standardization in practice (z-score) — and why it helps optimization

Many algorithms effectively solve an optimization problem. For example, logistic regression (with L2 regularization) can be written as:

$$
\min_{\mathbf{w}, b} \; \frac{1}{n}\sum_{i=1}^n \log\left(1 + \exp\left(-y_i(\mathbf{w}^\top \mathbf{x}_i + b)\right)\right) + \lambda \lVert \mathbf{w} \rVert_2^2
$$

If one feature is 1000× larger than another, then the loss landscape becomes **ill-conditioned**, and the optimizer may need tiny steps in one direction and large steps in another. Standardization makes the problem closer to “spherical” and improves convergence behavior.

### Example: logistic regression on the diabetes dataset

We will compare:

- Model A: Logistic Regression **without** scaling
- Model B: Logistic Regression **with** `StandardScaler` inside a `Pipeline`

We will evaluate on a held-out test split.


In [5]:

from sklearn.preprocessing import LabelEncoder

X = diabetes.drop(columns=["classification"])
y = diabetes["classification"].copy()

le = LabelEncoder()
y_bin = le.fit_transform(y)
print("Classes:", list(le.classes_))

X_train, X_test, y_train, y_test = train_test_split(
    X, y_bin, test_size=0.25, random_state=42, stratify=y_bin
)

# A) No scaling
lr_raw = LogisticRegression(max_iter=2000, solver="lbfgs")
lr_raw.fit(X_train, y_train)
pred_raw = lr_raw.predict(X_test)

# B) With standardization
lr_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, solver="lbfgs"))
])
lr_scaled.fit(X_train, y_train)
pred_scaled = lr_scaled.predict(X_test)

print("\nAccuracy (no scaling):", round(accuracy_score(y_test, pred_raw), 4))
print("Accuracy (standardized):", round(accuracy_score(y_test, pred_scaled), 4))

print("\nConfusion matrix (no scaling):\n", confusion_matrix(y_test, pred_raw))
print("\nConfusion matrix (standardized):\n", confusion_matrix(y_test, pred_scaled))


Classes: ['Diabetic', 'Non-Diabetic']

Accuracy (no scaling): 0.7812
Accuracy (standardized): 0.7865

Confusion matrix (no scaling):
 [[ 39  28]
 [ 14 111]]

Confusion matrix (standardized):
 [[ 40  27]
 [ 14 111]]



Accuracy (no scaling): 0.4933
Accuracy (standardized): 0.5

Confusion matrix (no scaling):
 [[59 23]
 [53 15]]

Confusion matrix (standardized):
 [[63 19]
 [56 12]]



### Coefficients and interpretability: scaling changes the meaning

If you fit a linear model *without* scaling, the coefficient magnitude is influenced by the unit of measurement.

Standardization makes coefficients more directly comparable as “effect per standard deviation,” which is often closer to what practitioners mean when they talk about *relative importance* in linear models.

Let’s inspect coefficients from the standardized model.


In [6]:

feature_names = X.columns.tolist()

coef = lr_scaled.named_steps["clf"].coef_.ravel()
coef_s = pd.Series(coef, index=feature_names).sort_values(key=lambda s: np.abs(s), ascending=False)

display(coef_s.to_frame("coef (standardized)").head(10).round(4))
print("Interpretation: coefficients after StandardScaler are roughly 'effect per 1 std' of the feature.")


Unnamed: 0,coef (standardized)
Glucose,-1.124
BMI,-0.6716
Pregnancies,-0.4667
BloodPressure,0.3129
DiabetesPedigreeFunction,-0.2806
Insulin,0.1773
Age,-0.1613
SkinThickness,-0.1089


Interpretation: coefficients after StandardScaler are roughly 'effect per 1 std' of the feature.



## 5) Min–max normalization vs standardization — practical tradeoffs

### When min–max is useful

Min–max scaling is common when:

- You want to keep features in a bounded range $[0, 1]$.
- Your model has constraints or priors that expect bounded inputs.
- You want to preserve **relative** spacing while controlling range (important in some distance-based methods).

### When standardization is more robust

Standardization often works better when:

- Features are roughly bell-shaped or you want to treat them as such.
- You use regularized linear models, SVM, or PCA.
- You want centering around zero (important for many optimizers).

### A critical warning (outliers)

Min–max scaling is sensitive to outliers: one extreme value can compress the entire feature into a tiny interval. Robust scalers are often better for heavy-tailed data.

---

## 6) Demonstration: kNN on the Iris dataset (no scaling vs min–max vs standardization)

kNN is a distance-based method. If a feature has a larger range, it gets a larger weight implicitly.

We will compare:

- kNN on raw features
- kNN with `MinMaxScaler`
- kNN with `StandardScaler`

To avoid leakage, scaling is inside `Pipeline`.


In [7]:

X = iris.drop(columns=["classification"])
y = iris["classification"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipelines = {
    "raw": Pipeline([("knn", KNeighborsClassifier(n_neighbors=7))]),
    "minmax": Pipeline([("scaler", MinMaxScaler()), ("knn", KNeighborsClassifier(n_neighbors=7))]),
    "standard": Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=7))]),
}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, pred)
    print(f"{name:>8}  accuracy = {acc:.4f}")

print("\nA short classification report for the standardized pipeline:\n")
pred_std = pipelines["standard"].predict(X_test)
print(classification_report(y_test, pred_std))


     raw  accuracy = 0.9474
  minmax  accuracy = 0.9737
standard  accuracy = 0.9474

A short classification report for the standardized pipeline:

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       0.87      1.00      0.93        13
 Iris-virginica       1.00      0.85      0.92        13

       accuracy                           0.95        38
      macro avg       0.96      0.95      0.95        38
   weighted avg       0.95      0.95      0.95        38



  minmax  accuracy = 0.9737
standard  accuracy = 0.9474

A short classification report for the standardized pipeline:

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       0.87      1.00      0.93        13
 Iris-virginica       1.00      0.85      0.92        13

       accuracy                           0.95        38
      macro avg       0.96      0.95      0.95        38
   weighted avg       0.95      0.95      0.95        38




### Discussion (kNN)

In practice, kNN performance can change materially after scaling, but note:

- The “best scaler” can depend on $k$, distance metric, and the dataset.
- Scaling is not about making results *always better*; it is about making the method behave as intended.
- The correct workflow is to treat the scaler as a hyperparameter and validate it.

Let’s demonstrate this principle using cross-validation.


In [8]:

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipe in pipelines.items():
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="accuracy")
    print(f"{name:>8}  mean={scores.mean():.4f}  std={scores.std():.4f}  scores={np.round(scores,4)}")


     raw  mean=0.9600  std=0.0389  scores=[1.     0.9667 0.9    1.     0.9333]
  minmax  mean=0.9533  std=0.0452  scores=[1.     0.9667 0.9    1.     0.9   ]
standard  mean=0.9600  std=0.0327  scores=[0.9667 0.9667 0.9    1.     0.9667]


  minmax  mean=0.9533  std=0.0452  scores=[1.     0.9667 0.9    1.     0.9   ]


standard  mean=0.9600  std=0.0327  scores=[0.9667 0.9667 0.9    1.     0.9667]



## 7) Outliers and heavy tails: RobustScaler, PowerTransformer, QuantileTransformer

Real data is often not “nice.” Outliers can make standard deviation unstable and can break min–max scaling. You have several robust options:

### RobustScaler
Uses median and interquartile range (IQR):

$$
x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}
$$

### PowerTransformer (Yeo–Johnson / Box–Cox)
Transforms data to be more Gaussian-like, often improving linear model behavior.

### QuantileTransformer
Maps data to a target distribution (uniform or normal) using quantiles. It can be effective but can also distort distances; validate carefully.

We will demonstrate with the wine dataset, where some features can be skewed.


In [9]:

wine_y = (wine["quality"] >= 7).astype(int)
wine_X = wine.drop(columns=["quality"])

X_train, X_test, y_train, y_test = train_test_split(
    wine_X, wine_y, test_size=0.25, random_state=42, stratify=wine_y
)

scaler_pipes = {
    "raw": Pipeline([("clf", LogisticRegression(max_iter=4000))]),
    "standard": Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression(max_iter=4000))]),
    "robust": Pipeline([("scaler", RobustScaler()), ("clf", LogisticRegression(max_iter=4000))]),
    "power_yeojohnson": Pipeline([("scaler", PowerTransformer(method="yeo-johnson", standardize=True)), ("clf", LogisticRegression(max_iter=4000))]),
    "quantile_normal": Pipeline([("scaler", QuantileTransformer(output_distribution="normal", n_quantiles=200, random_state=42)), ("clf", LogisticRegression(max_iter=4000))]),
}

for name, pipe in scaler_pipes.items():
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, pred)
    print(f"{name:>16}  accuracy={acc:.4f}  positive_rate_train={y_train.mean():.3f}")


             raw  accuracy=0.8090  positive_rate_train=0.216
        standard  accuracy=0.8073  positive_rate_train=0.216
          robust  accuracy=0.8024  positive_rate_train=0.216
power_yeojohnson  accuracy=0.8049  positive_rate_train=0.216
 quantile_normal  accuracy=0.8049  positive_rate_train=0.216


        standard  accuracy=0.7367  positive_rate_train=0.266


          robust  accuracy=0.7367  positive_rate_train=0.266


power_yeojohnson  accuracy=0.7367  positive_rate_train=0.266


 quantile_normal  accuracy=0.7367  positive_rate_train=0.266



### Takeaway (robust / power / quantile)

- RobustScaler is a strong default when outliers exist but you still want a linear-ish interpretation.
- Power transforms can be beneficial when skewness is strong and you want a model that behaves “more linearly.”
- Quantile transforms can work surprisingly well, but because they change the metric structure, you should validate them with CV and watch for overfitting on small datasets.

---

## 8) Scaling + SVM: why “C” and “gamma” depend on feature scale

For an RBF SVM, the kernel is:

$$
K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma \lVert \mathbf{x} - \mathbf{y} \rVert_2^2)
$$

If you scale features, the distances change, and therefore the “effective” meaning of $\gamma$ changes.

This is why **SVM is almost always used with scaling**, and why hyperparameter search should be done with scaling included in the pipeline.

We will demonstrate this on `iris` with a modest SVM configuration.


In [10]:

X = iris.drop(columns=["classification"])
y = iris["classification"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

svm_raw = SVC(kernel="rbf", C=3.0, gamma="scale")
svm_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(kernel="rbf", C=3.0, gamma="scale"))
])

svm_raw.fit(X_train, y_train)
svm_scaled.fit(X_train, y_train)

pred_raw = svm_raw.predict(X_test)
pred_scaled = svm_scaled.predict(X_test)

print("SVM (no scaling) accuracy:", round(accuracy_score(y_test, pred_raw), 4))
print("SVM (standardized) accuracy:", round(accuracy_score(y_test, pred_scaled), 4))


SVM (no scaling) accuracy: 0.9737
SVM (standardized) accuracy: 0.9474



## 9) Data leakage: the most common scaling mistake

**Leakage** happens when information from the test set influences the training process.

A subtle scaling leakage looks like this:

1. Fit `StandardScaler` on *all data*
2. Transform train and test using that scaler
3. Evaluate

This is wrong because the scaler’s mean and standard deviation used information from the test set.

We will demonstrate the difference between:

- Incorrect scaling (fit scaler on full data)
- Correct scaling (fit scaler on train only) using `Pipeline`

We’ll use the diabetes dataset again.


In [11]:

X = diabetes.drop(columns=["classification"])
y = le.fit_transform(diabetes["classification"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# WRONG: fit scaler on all data, then split transformed data
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(X)

Xa_train, Xa_test, ya_train, ya_test = train_test_split(
    X_all_scaled, y, test_size=0.25, random_state=42, stratify=y
)

lr_wrong = LogisticRegression(max_iter=2000)
lr_wrong.fit(Xa_train, ya_train)
pred_wrong = lr_wrong.predict(Xa_test)
acc_wrong = accuracy_score(ya_test, pred_wrong)

# RIGHT: scaler inside pipeline fitted on training only
lr_right = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])
lr_right.fit(X_train, y_train)
pred_right = lr_right.predict(X_test)
acc_right = accuracy_score(y_test, pred_right)

print("Accuracy with leakage (wrong):", round(acc_wrong, 4))
print("Accuracy without leakage (right):", round(acc_right, 4))

# Additional sanity check: show that correct scaler stats come ONLY from training
sc = lr_right.named_steps["scaler"]
print("\nCorrect scaler mean (first 3 features):", np.round(sc.mean_[:3], 3))
print("Correct scaler var  (first 3 features):", np.round(sc.var_[:3], 3))


Accuracy with leakage (wrong): 0.7865
Accuracy without leakage (right): 0.7865

Correct scaler mean (first 3 features): [  3.856 121.705  69.559]
Correct scaler var  (first 3 features): [  11.991 1056.913  356.729]



### Professional rule (non-negotiable)

If scaling exists at all, it must be fitted **inside** the training process:

- Use `Pipeline` and validate with cross-validation.
- In production, persist the fitted scaler together with the model (one artifact).

---

## 10) Mixed data types: scale numeric, encode categoricals (drug200 example)

In tabular ML, it is common to have:

- Numeric features that require scaling (e.g., `Age`, `Na_to_K`)
- Categorical features that require encoding (`Sex`, `BP`, `Cholesterol`)

The correct approach is to build a **column-wise** pipeline:

- Numeric: `SimpleImputer` → `StandardScaler`
- Categorical: `SimpleImputer` → `OneHotEncoder`
- Model: a classifier (e.g., logistic regression)

We will do multiclass prediction for `Drug`.


In [12]:

X = drug.drop(columns=["Drug"])
y = drug["Drug"]

numeric_features = ["Age", "Na_to_K"]
categorical_features = ["Sex", "BP", "Cholesterol"]

numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)

clf = LogisticRegression(max_iter=4000)

pipe = Pipeline([
    ("prep", preprocess),
    ("clf", clf)
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

print("Accuracy:", round(accuracy_score(y_test, pred), 4))
print("\nClass distribution (test):")
display(pd.Series(y_test).value_counts(normalize=True).to_frame("share").round(3))

print("\nClassification report:\n")
print(classification_report(y_test, pred))


Accuracy: 0.92

Class distribution (test):


Unnamed: 0_level_0,share
Drug,Unnamed: 1_level_1
DrugY,0.46
drugX,0.26
drugA,0.12
drugC,0.08
drugB,0.08



Classification report:

              precision    recall  f1-score   support

       DrugY       0.88      0.96      0.92        23
       drugA       1.00      1.00      1.00         6
       drugB       1.00      0.50      0.67         4
       drugC       1.00      1.00      1.00         4
       drugX       0.92      0.92      0.92        13

    accuracy                           0.92        50
   macro avg       0.96      0.88      0.90        50
weighted avg       0.92      0.92      0.92        50




## 11) Sparse matrices and scaling: StandardScaler vs MaxAbsScaler

One-hot encoding produces a sparse design matrix. Two important notes:

- `StandardScaler(with_mean=True)` cannot be applied directly to sparse matrices (centering would densify the matrix).
- If you need scaling with sparse features, prefer:
  - `StandardScaler(with_mean=False)` (keeps sparsity), or
  - `MaxAbsScaler` (scales by max absolute value; preserves sparsity).

We will construct a quick “mostly sparse” design via one-hot encoding and compare scalers.


In [13]:

from scipy import sparse

toy = drug.sample(120, random_state=0).reset_index(drop=True)
X = toy.drop(columns=["Drug"])

# Force sparse output for compatibility across scikit-learn versions
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=True)

pre_cat_only = ColumnTransformer(
    transformers=[("cat", ohe, ["Sex","BP","Cholesterol"])],
    remainder="drop"
)

X_sparse = pre_cat_only.fit_transform(X)

is_sp = sparse.issparse(X_sparse)
nnz = X_sparse.nnz if is_sp else int(np.count_nonzero(X_sparse))
total = int(X_sparse.shape[0] * X_sparse.shape[1])

print("Transformed shape:", X_sparse.shape)
print("Sparse output:", is_sp)
print("Sparsity (nnz / total):", nnz, "/", total)

# Demonstrate why centering is problematic for sparse matrices
try:
    _ = StandardScaler(with_mean=True).fit_transform(X_sparse)
    print("\nStandardScaler(with_mean=True) succeeded (likely dense input).")
except Exception as e:
    print("\nStandardScaler(with_mean=True) on sparse -> error type:", type(e).__name__)
    print("Message (short):", str(e).splitlines()[0][:140])

X_std_no_mean = StandardScaler(with_mean=False).fit_transform(X_sparse)
X_maxabs = MaxAbsScaler().fit_transform(X_sparse)

print("\nAfter StandardScaler(with_mean=False):")
print("  is_sparse:", sparse.issparse(X_std_no_mean))

print("\nAfter MaxAbsScaler:")
print("  is_sparse:", sparse.issparse(X_maxabs))


Transformed shape: (120, 7)
Sparse output: False
Sparsity (nnz / total): 360 / 840

StandardScaler(with_mean=True) succeeded (likely dense input).

After StandardScaler(with_mean=False):
  is_sparse: False

After MaxAbsScaler:
  is_sparse: False



## 12) Scaling for PCA and k-means (clustering example with hw_200)

Both PCA and k-means rely on Euclidean geometry:

- PCA finds directions of maximum variance. If one feature has larger units, it dominates principal components.
- k-means minimizes within-cluster sum of squares, which depends directly on feature scale.

We will use the `hw_200.csv` dataset (height/weight). It has intentionally messy column names.

Steps:

1. Load the dataset
2. Clean column names
3. Compare PCA and k-means behavior with and without scaling


In [14]:

import re

hw = hw_raw.copy()
print("Original columns:", list(hw.columns))

hw.columns = [c.replace('"', '').strip() for c in hw.columns]
hw.columns = [re.sub(r"\s+", " ", c).strip() for c in hw.columns]
print("Cleaned columns:", list(hw.columns))

display(hw.head())

# robustly pick height/weight columns
hw_cols = [c for c in hw.columns if "Height" in c or "Weight" in c]
X_hw = hw[hw_cols].astype(float).values

print("\nX_hw shape:", X_hw.shape)
print("Feature means (raw):", np.round(X_hw.mean(axis=0), 4))
print("Feature stds  (raw):", np.round(X_hw.std(axis=0), 4))


Original columns: ['Index', ' Height(Inches)"', ' "Weight(Pounds)"']
Cleaned columns: ['Index', 'Height(Inches)', 'Weight(Pounds)']


Unnamed: 0,Index,Height(Inches),Weight(Pounds)
0,1,65.78,112.99
1,2,71.52,136.49
2,3,69.4,153.03
3,4,68.22,142.34
4,5,67.79,144.3



X_hw shape: (200, 2)
Feature means (raw): [ 67.9498 127.222 ]
Feature stds  (raw): [ 1.9355 11.931 ]


In [15]:

scaler = StandardScaler()

# PCA without scaling
pca_raw = PCA(n_components=2, random_state=0)
Z_raw = pca_raw.fit_transform(X_hw)

# PCA with scaling
X_hw_scaled = scaler.fit_transform(X_hw)
pca_scaled = PCA(n_components=2, random_state=0)
Z_scaled = pca_scaled.fit_transform(X_hw_scaled)

print("Explained variance ratio (PCA raw):", np.round(pca_raw.explained_variance_ratio_, 4))
print("Explained variance ratio (PCA scaled):", np.round(pca_scaled.explained_variance_ratio_, 4))

km_raw = KMeans(n_clusters=3, random_state=0, n_init=10)
labels_raw = km_raw.fit_predict(X_hw)

km_scaled = KMeans(n_clusters=3, random_state=0, n_init=10)
labels_scaled = km_scaled.fit_predict(X_hw_scaled)

print("\nCluster sizes (raw):", np.bincount(labels_raw))
print("Cluster sizes (scaled):", np.bincount(labels_scaled))

centers_raw = km_raw.cluster_centers_
centers_scaled_back = scaler.inverse_transform(km_scaled.cluster_centers_)

centers_df = pd.DataFrame(
    np.vstack([centers_raw, centers_scaled_back]),
    columns=[c.replace("(Inches)", "").replace("(Pounds)", "") for c in hw_cols]
)
centers_df.index = ["raw_c0","raw_c1","raw_c2","scaled_c0_back","scaled_c1_back","scaled_c2_back"]
display(centers_df.round(2))


Explained variance ratio (PCA raw): [0.9825 0.0175]
Explained variance ratio (PCA scaled): [0.7784 0.2216]

Cluster sizes (raw): [93 44 63]
Cluster sizes (scaled): [51 59 90]


Unnamed: 0,Height,Weight
raw_c0,67.59,126.05
raw_c1,66.66,110.46
raw_c2,69.37,140.66
scaled_c0_back,70.3,139.28
scaled_c1_back,65.99,114.42
scaled_c2_back,67.9,128.78


Unnamed: 0,Height,Weight
raw_c0,69.5,165.66
raw_c1,66.54,125.63
raw_c2,68.53,144.99
scaled_c0_back,71.38,159.98
scaled_c1_back,66.46,130.81
scaled_c2_back,66.79,155.81



## 13) Feature scaling and regularization: why Ridge depends on units

Ridge regression:

$$
\min_{\mathbf{w}} \; \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 + \lambda \lVert \mathbf{w} \rVert_2^2
$$

Scaling makes the regularization term behave more “fairly” across features.

### Example: Ridge regression on house prices

We will build a mixed-type pipeline (numeric scaling + categorical one-hot) and compare Ridge with and without numeric scaling.


In [16]:

house = pd.read_csv(Path("../../../Datasets/Regression/house-prices.csv"))
X = house.drop(columns=["Price"])
y = house["Price"].astype(float)

numeric_features = ["SqFt","Bedrooms","Bathrooms","Offers"]
categorical_features = ["Brick","Neighborhood"]

numeric_no_scale = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

numeric_scaled = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

pre_A = ColumnTransformer([
    ("num", numeric_no_scale, numeric_features),
    ("cat", cat_pipe, categorical_features)
])

pre_B = ColumnTransformer([
    ("num", numeric_scaled, numeric_features),
    ("cat", cat_pipe, categorical_features)
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# RMSE helper (compatible across scikit-learn versions)
try:
    from sklearn.metrics import root_mean_squared_error
    def rmse(y_true, y_pred):
        return root_mean_squared_error(y_true, y_pred)
except Exception:
    from sklearn.metrics import mean_squared_error
    import numpy as np
    def rmse(y_true, y_pred):
        return float(np.sqrt(mean_squared_error(y_true, y_pred)))

# (A) Same alpha for both pipelines: shows alpha depends on feature scale
alpha_fixed = 10.0
pipe_A_fixed = Pipeline([("prep", pre_A), ("model", Ridge(alpha=alpha_fixed, random_state=0))])
pipe_B_fixed = Pipeline([("prep", pre_B), ("model", Ridge(alpha=alpha_fixed, random_state=0))])

pipe_A_fixed.fit(X_train, y_train)
pipe_B_fixed.fit(X_train, y_train)

pred_A = pipe_A_fixed.predict(X_test)
pred_B = pipe_B_fixed.predict(X_test)

rmse_A = rmse(y_test, pred_A)
rmse_B = rmse(y_test, pred_B)
r2_A = r2_score(y_test, pred_A)
r2_B = r2_score(y_test, pred_B)

print(f"Fixed alpha={alpha_fixed}")
print("  Ridge (no numeric scaling):  RMSE =", round(rmse_A, 2), "  R2 =", round(r2_A, 4))
print("  Ridge (with scaling):       RMSE =", round(rmse_B, 2), "  R2 =", round(r2_B, 4))

# (B) Fair comparison: tune alpha separately
alphas = np.logspace(-3, 4, 12)

def tune_ridge(preprocessor, name):
    pipe = Pipeline([("prep", preprocessor), ("model", Ridge(random_state=0))])
    gs = GridSearchCV(pipe, {"model__alpha": alphas}, cv=5, scoring="neg_root_mean_squared_error")
    gs.fit(X_train, y_train)
    best = gs.best_estimator_
    pred = best.predict(X_test)
    return {
        "name": name,
        "best_alpha": gs.best_params_["model__alpha"],
        "test_rmse": rmse(y_test, pred),
        "test_r2": r2_score(y_test, pred),
        "cv_rmse": -gs.best_score_,
    }

res_A = tune_ridge(pre_A, "No scaling (tuned alpha)")
res_B = tune_ridge(pre_B, "Scaled numeric (tuned alpha)")

print("\nAfter tuning alpha (fair comparison):")
for r in [res_A, res_B]:
    print(f"  {r['name']}: best_alpha={r['best_alpha']:.4g}  CV_RMSE={r['cv_rmse']:.2f}  Test_RMSE={r['test_rmse']:.2f}  Test_R2={r['test_r2']:.4f}")


Fixed alpha=10.0
  Ridge (no numeric scaling):  RMSE = 11290.03   R2 = 0.7921
  Ridge (with scaling):       RMSE = 10347.16   R2 = 0.8253

After tuning alpha (fair comparison):
  No scaling (tuned alpha): best_alpha=0.3511  CV_RMSE=9992.95  Test_RMSE=10334.16  Test_R2=0.8258
  Scaled numeric (tuned alpha): best_alpha=0.3511  CV_RMSE=9992.66  Test_RMSE=10291.81  Test_R2=0.8272



## 14) Scaler as a hyperparameter: a small grid search (Iris + kNN)

At an advanced level, you should *validate the scaler choice* rather than assuming it.

Here we compare:

- No scaler
- Min–max
- Standardization
- Robust scaling

We will do a small `GridSearchCV` where the scaler is part of the pipeline.

This is a practical pattern you can reuse for many models.


In [17]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning, message=".*invalid value encountered in cast.*")


X = iris.drop(columns=["classification"])
y = iris["classification"]

pipe = Pipeline([
    ("scaler", "passthrough"),
    ("knn", KNeighborsClassifier())
])

param_grid = {
    "scaler": ["passthrough", StandardScaler(), MinMaxScaler(), RobustScaler()],
    "knn__n_neighbors": [3,5,7,9,11],
    "knn__weights": ["uniform", "distance"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, param_grid=param_grid, cv=cv, scoring="accuracy", n_jobs=None)
gs.fit(X, y)

print("Best CV accuracy:", round(gs.best_score_, 4))
print("Best params:")
for k, v in gs.best_params_.items():
    print(" ", k, "=", v)


Best CV accuracy: 0.9733
Best params:
  knn__n_neighbors = 11
  knn__weights = uniform
  scaler = passthrough



## 15) Choosing a scaler: a practitioner’s decision table

There is no single best scaler. Choose based on:

### Algorithm sensitivity
- kNN, k-means, SVM, PCA: scaling is typically essential.
- Logistic/linear regression with regularization: scaling is strongly recommended.
- Trees: often not required, but can still be part of a consistent pipeline.

### Data distribution
- Approximately symmetric, not too many outliers → `StandardScaler`
- Outliers/heavy tails → `RobustScaler`
- Strict bounds needed → `MinMaxScaler`
- Sparse features → `MaxAbsScaler` or `StandardScaler(with_mean=False)`
- Strong skewness → `PowerTransformer` or `QuantileTransformer` (validate carefully)

### Model governance and deployment
- Fit scalers only on training data.
- Persist preprocessing + model as one pipeline artifact.
- Document the scaler choice as part of the model card / experiment record.

---

## 16) Exercises (recommended)

1. On the `diabetes` dataset, compare `StandardScaler` vs `RobustScaler` for `LogisticRegression` using 5-fold CV.
2. On `wine`, try an SVM with and without scaling and observe sensitivity to `gamma`.
3. On `hw_200`, experiment with `MinMaxScaler` vs `StandardScaler` for k-means and compare cluster centers.
4. Implement a small hyperparameter search where the scaler itself is part of the search space.

---

### Summary

Scaling is not cosmetic. It defines the metric structure your model learns from, controls regularization fairness, and can prevent numerical instability. Use pipelines, avoid leakage, and validate scaler choice just like any other modeling hyperparameter.
