## Lecture: A Comprehensive Machine Learning Workflow with Pipeline

**Objective:** To build a complete, automated, and robust ML workflow using Scikit-learn's `Pipeline`. 

This lecture will guide the construction of a complete ML workflow, covering the following tools and concepts:
1.  **Foundation:** `train_test_split` (with key parameters).
2.  **Basic `Pipeline`:** Understanding the workflow (Numerical-Only Example).
3.  **`ColumnTransformer`:** Handling numerical and categorical data in parallel.
4.  **Full `Pipeline`:** Chaining all steps:
    - **Data Cleaning** (`SimpleImputer`)
    - **Data Transform** (`StandardScaler`, `OneHotEncoder`, `OrdinalEncoder`)
    - **Feature Selection** (`SelectKBest`)
    - **Modeling** (e.g., `LogisticRegression`)
5.  **`KFold` and `cross_val_score`:** Reliably evaluating the model.
6.  **`joblib`:** Saving and loading the trained pipeline for reuse.

### Part 1: Initialization: Load Libraries and Sample Data

In [22]:
import pandas as pd
import numpy as np
import joblib

# 1. Splitting & Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, KFold

# 2. Pipeline Construction
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# 3. Preprocessing Steps
from sklearn.impute import SimpleImputer       # Data Cleaning
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder # Data Transform
from sklearn.feature_selection import SelectKBest, f_classif # Feature Selection

# 4. Model (Modeling)
from sklearn.linear_model import LogisticRegression

# 5. Metrics (Metrics)
from sklearn.metrics import accuracy_score

# --- Create Sample Data ---
# We create a complex dataset: with missing values, numeric columns, categorical columns
data = {
    'age': [25, 30, 45, 55, np.nan, 35, 60, 65, 70, 22, 48, 52],
    'income': [50000, 60000, 100000, 80000, 120000, 75000, np.nan, 200000, 180000, 45000, 90000, 110000],
    'city': ['Hanoi', 'HCMC', 'Hanoi', 'Danang', 'HCMC', 'Danang', 'Hanoi', 'HCMC', 'Hanoi', 'Danang', 'HCMC', 'Hanoi'],
    'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor', 'PhD', 'Master', 'PhD', 'Bachelor', 'Bachelor', 'Master', 'PhD'],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
    'target': [0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1]
}
df = pd.DataFrame(data)

print("Sample data created:")
df.head()

Sample data created:


Unnamed: 0,age,income,city,education,gender,target
0,25.0,50000.0,Hanoi,Bachelor,Male,0
1,30.0,60000.0,HCMC,Master,Female,1
2,45.0,100000.0,Hanoi,PhD,Male,1
3,55.0,80000.0,Danang,Master,Female,0
4,,120000.0,HCMC,Bachelor,Male,1


#### ► Review Questions (Part 1)

1.  Which library is imported for building the main workflow? (`sklearn.pipeline`)
2.  Which library is used to handle parallel processing for different data types? (`sklearn.compose`)
3.  What is the purpose of importing `SimpleImputer`? (For data cleaning / handling missing values)

### Part 2: `train_test_split` (The Foundation)

This is the first **MANDATORY** step. We must separate the Test set from the training process to avoid **Data Leakage**. All `fit` operations (for imputers, scalers, models, etc.) must only be performed on `X_train`.

**Key parameters of `train_test_split`:**
* `X, y`: The features (X) and target (y) data.
* `test_size=0.2`: The proportion of data to reserve for the test set (e.g., 20%).
* `random_state=42`: A seed for the random number generator to ensure the split is *always the same* each time it's run. This is crucial for reproducibility.
* `stratify=y`: (Very important for Classification) Ensures that the proportion of classes (e.g., 0s and 1s) in the `train` and `test` sets is the *same* as the original dataset. Useful for imbalanced data.

In [23]:
# Split X (features) and y (target)
X = df.drop('target', axis=1)
y = df['target']

# Perform the data split
# Use stratify=y to ensure even distribution of classes in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=y)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train proportions: \n{y_train.value_counts(normalize=True)}")
print(f"y_test proportions: \n{y_test.value_counts(normalize=True)}")

X_train shape: (9, 5)
X_test shape: (3, 5)
y_train proportions: 
target
1    0.555556
0    0.444444
Name: proportion, dtype: float64
y_test proportions: 
target
1    0.666667
0    0.333333
Name: proportion, dtype: float64


#### ► Review Questions (Part 2)

1.  Why must we `train_test_split` *before* any preprocessing (like scaling or imputation)? (To prevent Data Leakage from the test set into the training process).
2.  What is the purpose of the `stratify=y` parameter? (To ensure the class proportions in `y_train` and `y_test` are the same as the original `y`).
3.  What happens if you forget to set `random_state`? (You will get a different split every time you run the code, making results not reproducible).

### Part 3: The Basic `Pipeline` (Numerical-Only Example)

Before we handle complex, mixed data, let's understand how a `Pipeline` works in a simple case.

Imagine a `Pipeline` as an assembly line. We will create one for **numerical-only data**.
Our assembly line will have 3 steps:
1.  **Impute:** Fill missing values (`SimpleImputer`).
2.  **Scale:** Standardize the data (`StandardScaler`).
3.  **Model:** Train the model (`LogisticRegression`).

When we call `.fit()` on the pipeline, it will `fit_transform` on Step 1, pass the result to Step 2 to `fit_transform`, and finally pass that result to Step 3 to `fit`.
When we call `.score()` or `.predict()`, it will only `transform` on Steps 1 & 2, preventing data leakage.

In [24]:
# 1. Define the steps for the simple pipeline
# This pipeline only works on numerical data
simple_numerical_steps = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(random_state=42))
]

# 2. Create the simple Pipeline object
simple_num_pipeline = Pipeline(steps=simple_numerical_steps)

# 3. Select only the numerical features for this example
numeric_features = ['age', 'income'] # From Part 1

# 4. Fit the pipeline on the numerical training data
simple_num_pipeline.fit(X_train[numeric_features], y_train)

# 5. Score the pipeline on the numerical test data
score = simple_num_pipeline.score(X_test[numeric_features], y_test)

print(f"Simple numerical-only pipeline accuracy: {score:.4f}")
print("This shows the basic concept. Now, how do we add the categorical columns? -> ColumnTransformer")

Simple numerical-only pipeline accuracy: 1.0000
This shows the basic concept. Now, how do we add the categorical columns? -> ColumnTransformer


#### ► Review Questions (Part 3)

1.  What is the main benefit of using a `Pipeline` even for this simple case? (It bundles all steps into one object and automatically prevents data leakage. The `imputer` and `scaler` are `fit` on training data and only `transform` test data).
2.  What is the required format for the `steps` parameter in a `Pipeline`? (A list of tuples, where each tuple is `(name, transformer_or_estimator)`).
3.  What would happen if we tried to `.fit()` this `simple_num_pipeline` on the full `X_train`? (It would fail with an error, because `StandardScaler` cannot process string columns like 'city' or 'education').

### Part 4: `ColumnTransformer` (Handling Mixed Data Types)

This `simple_num_pipeline` fails on mixed data. The solution is `ColumnTransformer`.

Our data has 3 types: numerical (`age`, `income`), nominal categorical (`city`, `gender`), and ordinal categorical (`education`).
* **Numerical Data:** Needs imputation (Cleaning) and scaling (Transform).
* **Nominal Data:** Needs imputation (Cleaning) and One-Hot encoding (Transform).
* **Ordinal Data:** Needs imputation (Cleaning) and Ordinal encoding (Transform).

In [25]:
# 1. Define column lists
numeric_features = ['age', 'income']
nominal_features = ['city', 'gender'] # Nominal columns (no order)
ordinal_features = ['education']     # Ordinal columns (has order)

# 2. Create a sub-pipeline for NUMERICAL data
numeric_transformer = Pipeline(steps=[
    # SimpleImputer: Fills missing values (NaN)
    # strategy='median': Fill with the median value (robust to outliers)
    ('imputer', SimpleImputer(strategy='median')), 
    # StandardScaler: Scale the data (mean=0, std=1)
    ('scaler', StandardScaler())                  
])

# 3. Create a sub-pipeline for CATEGORICAL (NOMINAL) data
nominal_transformer = Pipeline(steps=[
    # strategy='most_frequent': Fill missing values with the most common value
    ('imputer', SimpleImputer(strategy='most_frequent')), 
    # OneHotEncoder: Convert 'Hanoi', 'HCMC' into 0/1 columns
    # handle_unknown='ignore': If an unknown value (e.g., 'Haiphong') is seen in the test set, ignore it (all new columns will be 0)
    ('onehot', OneHotEncoder(handle_unknown='ignore'))   
])

# 4. CREATE SUB-PIPELINE FOR CATEGORICAL (ORDINAL) DATA - CHANGE
# Define the order for the 'education' column
education_order = ['Bachelor', 'Master', 'PhD']

ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    # OrdinalEncoder: Convert 'Bachelor' -> 0, 'Master' -> 1, 'PhD' -> 2
    # categories=[education_order]: Specify the desired order
    # handle_unknown='use_encoded_value', unknown_value=-1: If an unknown value is found, assign it -1
    ('ordinal', OrdinalEncoder(categories=[education_order], 
                                handle_unknown='use_encoded_value', 
                                unknown_value=-1)) 
])

# 5. Combine with ColumnTransformer
# ColumnTransformer takes a list of 'transformers'
# Each transformer is a tuple: (name, sub_pipeline, list_of_columns_to_apply_to)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat_nominal', nominal_transformer, nominal_features), # Pipeline for nominal columns
        ('cat_ordinal', ordinal_transformer, ordinal_features)  # Pipeline for ordinal columns
    ])

print("'preprocessor' created successfully with OrdinalEncoder for 'education'.")

'preprocessor' created successfully with OrdinalEncoder for 'education'.


#### ► Review Questions (Part 4)

1.  Why can't we just use a single `Pipeline` for all columns in this example? (Because different columns need different transformations, e.g., `StandardScaler` for numbers vs. `OneHotEncoder` for text).
2.  What is the main difference between `OneHotEncoder` (used for `city`) and `OrdinalEncoder` (used for `education`)? (`OneHotEncoder` creates new 0/1 columns and assumes no order. `OrdinalEncoder` creates one column with integer values (0, 1, 2) that represent a specific order).
3.  What does the `handle_unknown='ignore'` parameter in `OneHotEncoder` do? (It prevents an error if the model sees a new category in the test data that it never saw in the training data).

### Part 5: The Full `Pipeline` (Combining All Steps)

Now we connect all the pieces:

1.  **`preprocessor`**: (Cleaning + Transform) as defined above.
2.  **`selector`**: (Feature Selection) `SelectKBest` to select features.
3.  **`model`**: (Modeling) The `LogisticRegression` model.

In [26]:
# Pipeline connects processing steps into a single workflow.
# 'steps' is a list of tuples: (step_name, transformer/estimator_object)
full_pipeline = Pipeline(steps=[
    # STEP 1: Cleaning + Transform (Using the ColumnTransformer)
    ('preprocessor', preprocessor),
    
    # STEP 2: Feature Selection (Select features)
    # SelectKBest: Select 'k' best features
    # score_func=f_classif: Use f_classif (ANOVA F-test) to score features
    # (After processing, we have 2 numeric + 5 OHE (city+gender) + 1 ordinal = 8 features)
    ('selector', SelectKBest(score_func=f_classif, k=6)), # Select 6 of the 8 best features
    
    # STEP 3: Modeling
    # LogisticRegression: The final model for prediction
    ('model', LogisticRegression(random_state=42))
])

print("Full pipeline created.")

# -- Train and evaluate (test run) --
# Fit only on X_train
full_pipeline.fit(X_train, y_train)

# Predict only on X_test
y_pred = full_pipeline.predict(X_test)

print(f"Accuracy on Test set (single split): {accuracy_score(y_test, y_pred):.4f}")

Full pipeline created.
Accuracy on Test set (single split): 1.0000


#### ► Review Questions (Part 5)

1.  What is the purpose of the `full_pipeline` object? (To chain all steps of the ML workflow—preprocessing, selection, modeling—into a single object).
2.  Does the order of steps in the `Pipeline` matter? (Yes, absolutely. Data must be cleaned/transformed *before* features can be selected, and features must be selected *before* the model is trained).
3.  What does the `('selector', SelectKBest(k=6))` step do? (It selects the top 6 features that have the strongest relationship with the target variable, based on the `f_classif` score).

### Part 6: Evaluating the Pipeline with Cross-Validation

To evaluate the model objectively, we don't rely on just one `train_test_split`. Instead, we use **Cross-Validation**.

#### Introducing `KFold`
`KFold` is a *data splitting strategy*. It splits the entire dataset (e.g., `X_train`, `y_train`) into `k` equal parts (called 'folds').
How it works:
1.  Iteration 1: Use Fold 1 as the test set (validation), and the remaining (k-1) folds as the training set.
2.  Iteration 2: Use Fold 2 as the test set, and the remaining (k-1) folds as the training set.
3.  ... (repeat k times)



This way, every data point is used for both training and validation.
* `n_splits=5`: Specify splitting into 5 folds.
* `shuffle=True`: Shuffle the data before splitting. This is very important to ensure the folds are representative, avoiding cases where data is sorted (e.g., all class 0s at the top, class 1s at the bottom).
* `random_state=42`: Ensures the shuffling (`shuffle`) is fixed, making the results reproducible.

#### Introducing `cross_val_score`
`cross_val_score` is the function that automates the `KFold` process.
* `estimator`: This is the model or pipeline we want to evaluate (e.g., `full_pipeline`).
* `X`, `y`: The data to perform cross-validation on.
* `cv`: The splitting strategy. We will pass the `kfold` object we created above here.
* `scoring='accuracy'`: The metric we want to use for evaluation (e.g., accuracy).

This function will return an array of `k` scores (one score for each fold).

**Important Note when using CV with Pipelines:**

A single split result (like the one above) can be due to luck. **CRITICAL:** We must pass the **entire `full_pipeline`** into `cross_val_score`. DO NOT preprocess the data (e.g., `preprocessor.fit_transform(X)`) *before* passing it to CV, as this will cause data leakage between folds.

`cross_val_score` will automatically `fit` the pipeline on (k-1) folds and `transform/predict` on the remaining fold, repeating k times.

In [27]:
# We use cross-validation on the entire original X and y
# (Or X_train, y_train if you want to tune parameters on the train set)
# Here, we use (X, y) to get the most general evaluation

# Define the K-Fold splitting strategy (e.g., 5 folds)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Call cross_val_score with the ENTIRE pipeline
cv_scores = cross_val_score(full_pipeline, X, y, 
                              cv=kfold, 
                              scoring='accuracy')

print(f"Cross-Validation Scores (5-fold): {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")

Cross-Validation Scores (5-fold): [0.  1.  0.5 1.  1. ]
Mean Accuracy: 0.7000
Standard Deviation: 0.4000


  f = msb / msw


#### ► Review Questions (Part 6)

1.  Why is using `cross_val_score` (e.g., with 5 folds) generally better than a single `train_test_split` for evaluating a model? (A single split might be 'lucky' or 'unlucky'. CV gives a more stable and reliable estimate of model performance by averaging 5 different splits).
2.  What is the correct object to pass into `cross_val_score`'s `estimator` argument: the `preprocessor`, the `model`, or the `full_pipeline`? (The `full_pipeline`. This is critical to prevent data leakage during cross-validation).
3.  What does the `Mean Accuracy` and `Standard Deviation` of the CV scores tell us? (Mean Accuracy is the average performance. Standard Deviation tells us how much the performance *varied* between folds; a low std is good and means the model is stable).

### Part 7: Train, Save, and Load the Model

After being satisfied with the CV results, we train the pipeline on the **entire training dataset** (`X_train`, `y_train`) (or all of `X`, `y` if you are ready to deploy).

We will save the **entire `full_pipeline` object**, not just the model. This ensures that when reloaded, all imputers, scalers, encoders, etc., are preserved.

In [28]:
print("--- STARTING TRAINING AND SAVING MODEL ---")

# 1. Train the pipeline on the ENTIRE X_train, y_train
full_pipeline.fit(X_train, y_train)
print("Pipeline has been trained on X_train, y_train.")

# 2. Final evaluation on the Test set (unseen data)
final_accuracy = full_pipeline.score(X_test, y_test)
print(f"FINAL accuracy on X_test: {final_accuracy:.4f}")

# 3. Save the pipeline
# Use joblib.dump to 'freeze' the entire pipeline (including imputer, scaler, model...)
model_filename = 'final_model_pipeline.joblib'
joblib.dump(full_pipeline, model_filename)
print(f"Pipeline saved to file: {model_filename}")

--- STARTING TRAINING AND SAVING MODEL ---
Pipeline has been trained on X_train, y_train.
FINAL accuracy on X_test: 1.0000
Pipeline saved to file: final_model_pipeline.joblib


In [29]:
print("--- LOADING AND USING THE MODEL ---")

# 1. Load the pipeline
# Use joblib.load to restore the saved pipeline
try:
    loaded_pipeline = joblib.load(model_filename)
    print("Pipeline loaded successfully.")

    # 2. Create new data (example)
    # New data must have the EXACT same column structure as the original X_train
    X_new = pd.DataFrame({
        'age': [42, np.nan, 68],
        'income': [150000, 78000, np.nan],
        'city': ['HCMC', 'Hanoi', 'Danang'],
        'education': ['PhD', 'Master', 'Bachelor'],
        'gender': ['Male', 'Female', 'Male']
    })

    print("\nNew data for prediction:")
    print(X_new)

    # 3. Predict using the loaded pipeline
    # The pipeline will AUTOMATICALLY: impute -> scale -> onehot -> ordinal -> select -> predict
    new_predictions = loaded_pipeline.predict(X_new)
    new_proba = loaded_pipeline.predict_proba(X_new)

    print(f"\nPrediction results (target): {new_predictions}")
    print(f"Prediction probabilities (proba): \n{new_proba}")

except FileNotFoundError:
    print(f"Error: File not found {model_filename}. Please run the cell above first.")

--- LOADING AND USING THE MODEL ---
Pipeline loaded successfully.

New data for prediction:
    age    income    city education  gender
0  42.0  150000.0    HCMC       PhD    Male
1   NaN   78000.0   Hanoi    Master  Female
2  68.0       NaN  Danang  Bachelor    Male

Prediction results (target): [1 0 1]
Prediction probabilities (proba): 
[[0.17645397 0.82354603]
 [0.52858782 0.47141218]
 [0.40768686 0.59231314]]


#### ► Review Questions (Part 7)

1.  Why is it better to save the `full_pipeline` object instead of just the `model` object (e.g., just the `LogisticRegression` step)? (Because the pipeline contains all the preprocessing steps. If you only save the model, you would have to manually re-apply the *exact same* imputation, scaling, and encoding steps to new data, which is error-prone. Saving the pipeline automates this).
2.  What function from `joblib` is used to save a pipeline? What function is used to load it? (`joblib.dump()` to save, `joblib.load()` to load).
3.  What must be true about the `X_new` data frame used for prediction? (It must have the *exact same* column names and structure as the original `X_train` data, even if it contains missing values).

### Part 8: Summary

We have successfully built a comprehensive ML workflow:

1.  **`train_test_split`** is always the first step to get a "clean" test set.
2.  **`ColumnTransformer`** is essential for handling mixed data types (numeric, categorical).
3.  **`Pipeline`** bundles all steps (Cleaning, Transform, Selection, Model) into a single object.
4.  **`cross_val_score(pipeline, ...)`** is the correct way to evaluate the model, preventing data leakage during CV.
5.  **`joblib.dump(pipeline, ...)`** saves the *entire* workflow, ensuring consistency when predicting on new data.