EM 538-001: Practical Machine Learning for Enginering Analystics (Spring 2025)  
Instructor: Fred Livingston (fjliving@ncsu.edu) 

## Scikit-Learn Pipelines

- Scikit-learn pipelines are an extremely convenient and powerful concept -- one of the things that sets scikit-learn apart from other machine learning libraries.
- Pipelines basically let us define a series of perprocessing steps together with fitting an estimator.
- Pipelines will automatically take care of pitfalls like estimating feature scaling parameters from the training set and applying those to scale new data 
- Below is an visualization of how pipelines work.

<img src="images/sklearn-pipeline.png" alt="drawing" width="400"/>

- Below is an example pipeline that combines the feature scaling step with the *k*NN classifier.

### Load and Prepare Datasets

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X, y = iris.data[:, 2:], iris.target

X_temp, X_test, y_temp, y_test = \
        train_test_split(X, y, test_size=0.2, 
                         shuffle=True, random_state=123, stratify=y)
X_train, X_valid, y_train, y_valid = \
        train_test_split(X_temp, y_temp, test_size=0.2,
                         shuffle=True, random_state=123, stratify=y_temp)

In [None]:
y_test

### Make Model Pipeline

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline


pipe = make_pipeline(StandardScaler(),
                     KNeighborsClassifier(n_neighbors=3))

In [None]:
pipe

### Make Pipeline prediction using Test set 

In [None]:
pipe.fit(X_train, y_train)
pipe.predict(X_test)

In [None]:
print('Test accuracy: %.2f%%' % (pipe.score(X_test, y_test)*100))

- As you can see above, the Pipeline itself follows the scikit-learn estimator API.