
# 📘 Chapter 5: Preparing Data

This chapter explains how data are prepared for machine learning: handling missing values, encoding categorical variables, scaling, splitting into train/validation/test sets, and building leak-free preprocessing pipelines. The exposition uses a third-person textbook style with clear LaTeX and runnable code.



## 5.1 Importance of Data Preparation

Model performance depends strongly on data quality. Effective preparation improves learnability and generalization by ensuring features are informative, comparable in scale, and consistently transformed across training and test data.



## 5.2 Handling Missing Values

Real data often contain missing entries.

**Strategies**
- **Removal**: drop rows/columns containing many missing values (risking information loss).
- **Imputation**: replace missing values with a statistic (mean/median/mode) or model-based estimates.

**Mean imputation (per feature \(j\))**
$$
x^{(i)}_j \leftarrow \frac{1}{\left|\{k : x^{(k)}_j \text{ observed}\}\right|} \sum_{k:\, x^{(k)}_j \text{ observed}} x^{(k)}_j.
$$



## 5.3 Encoding Categorical Variables

Most learning algorithms require numerical features.

- **One‑hot encoding**: creates a binary indicator per category (no ordinal assumptions).
- **Ordinal encoding**: maps ordered categories to integers (use only when order is meaningful).

**Example (one‑hot):** Color $(\in\{\text{Red}, \text{Blue}, \text{Green}\}$)  
$\rightarrow$ $([1,0,0], [0,1,0], [0,0,1]$).



## 5.4 Feature Scaling

Algorithms that rely on distances or gradient-based optimization are scale-sensitive.

- **Normalization (min–max)**
$$
x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \in [0,1].
$$

- **Standardization (z‑score)**
$$
x' = \frac{x - \mu}{\sigma},
$$
where $(\mu$) and $(\sigma$) are the feature mean and standard deviation computed on the **training set only**.



## 5.5 Splitting Data

To estimate generalization performance and tune hyperparameters, data are partitioned into **training**, **validation**, and **test** subsets:

$$
D = D_{\text{train}} \cup D_{\text{val}} \cup D_{\text{test}}, \qquad
D_{\text{train}} \cap D_{\text{val}} \cap D_{\text{test}} = \varnothing.
$$

The model is fit on $(D_{\text{train}}$), tuned on $(D_{\text{val}}$), and finally assessed once on $(D_{\text{test}}$).



## 5.6 Pipelines and Data Leakage

**Data leakage** occurs when information from outside the training data influences the model fit. To prevent leakage, preprocessing operations (e.g., scaling, encoding, imputation) must be **fitted on the training set only** and then applied to validation/test sets using the learned parameters.  
Scikit‑learn **`Pipeline`** and **`ColumnTransformer`** help enforce this discipline.



## 5.7 Hands‑On: Preprocessing the Iris Dataset

The following example demonstrates a common workflow: split into train/validation/test, scale numerical features inside a pipeline, and evaluate a classifier. If packages are missing, install them first.


In [1]:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Load dataset
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

# Train/validation/test split: 60/20/20 with stratification
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# All features are numeric in Iris; demonstrate a numeric pipeline.
numeric_features = X.columns.tolist()
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ],
    remainder='drop'
)

clf = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LogisticRegression(max_iter=500))
])

# Fit on training set only
clf.fit(X_train, y_train)

print("Validation accuracy:", clf.score(X_val, y_val))
print("Test accuracy:", clf.score(X_test, y_test))


Validation accuracy: 0.9333333333333333
Test accuracy: 0.9333333333333333



## 5.8 Summary

- Data preparation includes handling missing values, encoding categorical variables, scaling, and careful splitting.  
- Proper use of pipelines avoids leakage by fitting preprocessing only on the training set.  
- The Iris example illustrated a reproducible preprocessing‑plus‑model pipeline evaluated on validation and test sets.

