# Pipelines

Hello, welcome to this last module in your learning journey about machine learning with scikit-learn. It's time to learn some more tools that will make your models easier to develop and put into production.

## Pipelines

Pipelines are a sequence of steps to process information.

Likewise, following this concept, a pipeline in Scikit-Learn is a way to sequentially apply a list of transformations or predictions to a dataset.

Instead of carrying out the execution and storage of each step manually, pipelines allow you to organize pre-processing, feature extraction, and training in one place. And then, you can reuse them when you have to make new predictions.

This simplifies your code, provides consistency in your projects, and makes the task of sharing and reusing code very simple.

Pipelines follow exactly the same interface that we have already seen shared by many objects in Scikit-Learn.

## The `Pipeline` class

The class around which everything is centered is the `Pipeline` class:

In [None]:
from sklearn.pipeline import Pipeline


This receives a list of tuples of transformers associated with a name. For example, let's create a pipeline with two steps: one that scales some variables and another that reduces the dimensions of a dataset – two transformations that we have already seen in this book:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipeline = Pipeline([
	('scaler', StandardScaler()),
	('pca', PCA(n_components=2)),
])


And now we're going to load some data to demonstrate how it works – note that `X_train` is a matrix with 4 columns:

In [None]:
from utils import load_split_iris

X_train, X_test, y_train, y_test = load_split_iris()


With this, we can now train our pipeline:

In [None]:
pipeline.fit(X_train)


After that, we can transform our two datasets – if you look at the resulting values, you'll see that they are now only two dimensions thanks to the dimensionality reduction we added:

In [None]:
X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)


And now, we can use this data in a classifier, for example:

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train_transformed, y_train)
y_pred = lr.predict(X_test_transformed)
score = lr.score(X_test_transformed, y_test)
print(f'Test accuracy: {score:.2f}')


Excellent, right? Now we don't have to worry about saving the scaler and PCA separately. And now we can use the same pipeline when we put our data into production...

## Pipelines as machine learning models

But what if I told you that we can include our model as part of the pipeline instead of having it separate?

Let's define exactly that:

In [None]:
pipeline = Pipeline([
	('scaler', StandardScaler()),
	('pca', PCA(n_components=2)),
	('lr', LogisticRegression()),
])

pipeline.fit(X_train, y_train)


Just as you see it, the last step of a `Pipeline` can be a machine learning model. And then we can use it to predict new values:

In [None]:
y_pred = pipeline.predict(X_test)
score = pipeline.score(X_test, y_test)
print(f'Test accuracy: {score:.2f}')


## They are compatible with other Scikit-Learn tools

`Pipelines` are also compatible with other tools available in Scikit-Learn, for example, the cross-validation tools that we have previously seen in this book:

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
	('scaler', StandardScaler()),
	('pca', PCA(n_components=2)),
	('lr', LogisticRegression()),
])

cv = 5
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv)

# Mostrar los resultados
print(f'Scores de validación cruzada ({cv} folds): {cv_scores}')
print(f'Score promedio: {np.mean(cv_scores):0.2f}')


And also with hyperparameter search:

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()), # Paso 1: Escalar los datos
    ('pca', PCA()),               # Paso 2: Reducción de dimensionalidad
    ('lr', LogisticRegression()), # Paso 3: Modelo de regresión logística
])

param_grid = {
    'pca__n_components': [1, 2, 3],
    'lr__penalty': ['l1', 'l2', 'elasticnet', None],
    'lr__C': np.logspace(-3, 3, 7),
}


The peculiarity lies in how we define the parameter grid. You have to use the name with which you associated the transformer followed by two underscores, followed by the name of the argument.

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)


In [None]:
# Mostrar los resultados
print(f'Mejores parámetros: {grid_search.best_params_}')
print(f'Mejor puntaje: {grid_search.best_score_:.2f}')


Interesting, isn't it?

Shall we look a bit more into pipelines and how we can do more complex things with them in the next chapter?