# Pipelines (Parte 2)

Hello again, welcome to the second part of the book chapter on Pipelines with scikit-learn, where I'm going to teach you how to create more complex pipelines and deal with DataFrames.

## Composite Pipelines

So far, we've seen the usefulness of *pipelines* and how we can use them. But we've created fairly simple pipelines, don't you think?

Let's create a slightly more complicated one, but for that, we're going to need a slightly more complicated dataset as well:

In [None]:
from utils import load_complex_data

dataset = load_complex_data()
dataset


There are 6 columns, one of them is an `ID`, `job`, `marital` are categories, `balance`, `age` and `loyalty` are numerical, and `subscribed`, the target variable, is binary categorical.

Let's prepare this dataset.

## `ColumnTransformer`

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

one_hot_encode_categories = ColumnTransformer([
    (
        'one_hot_encode_categories', # Nombre de la transformación
        OneHotEncoder(sparse_output=False), # Transformación a aplicar
        ["job", 'marital'] # Columnas involucradas
    )
])


Let's see what it does with our dataset after training it with `fit`:

In [None]:
one_hot_encode_categories.fit(dataset)

transformed_dataset = one_hot_encode_categories.transform(dataset)
transformed_dataset


One can access the elements of `ColumnTransformer` with the `named_transformers_` attribute and from there we will access the `categories_` attribute to retrieve the headers:

In [None]:
cats = one_hot_encode_categories.named_transformers_['one_hot_encode_categories'].categories_


We can use this function that I created to view this matrix as a dataframe with the columns:

In [None]:
from utils import show_transformed_data

show_transformed_data(transformed_dataset, cats)


## Nested pipelines

Let's do something with the `age` variable. The first thing to notice is that the `age` variable has null values, we need to impute its values and then we're going to discretize it, let's make a pipeline for that:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer

handle_age_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('discretize', KBinsDiscretizer(encode="onehot-dense"))
])


If we test it by passing the `age` column:

In [None]:
handle_age_pipeline.fit_transform(dataset[['age']])


We are going to wrap this pipeline in a column transformer so that it works directly with the dataframe:

In [None]:
handle_age_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('discretize', KBinsDiscretizer(encode="onehot-dense"))
])

handle_age_transformer = ColumnTransformer([
    (
        'handle_age_transformer', # Nombre de la transformación
        handle_age_pipeline, # Transformación a aplicar
        ["age"] # Columnas involucradas
    )
])


And we can verify that it works:

In [None]:
handle_age_transformer.fit_transform(dataset)


## Leaving variables untransformed

You can use the `passthrough` string to let variables pass through without any transformation:

In [None]:
let_loyalty_pass_transformer = ColumnTransformer([
    (
        'leave_loyalty_alone',
        'passthrough',
        ['loyalty']
    )
])

let_loyalty_pass_transformer.fit_transform(dataset)


## `FeatureUnion` to put it all together

Let's recreate everything we just did above

In [None]:
# Ya lo vimos más arriba
one_hot_encode_categories = ColumnTransformer([
    (
        'one_hot_encode_categories', # Nombre de la transformación
        OneHotEncoder(sparse_output=False), # Transformación a aplicar
        ["job", 'marital'] # Columnas involucradas
    )
])

# Ya lo vimos más arriba
handle_age_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='mean')),
    ('discretize', KBinsDiscretizer(encode="onehot-dense"))
])
handle_age_transformer = ColumnTransformer([
    (
        'handle_age_transformer', # Nombre de la transformación
        handle_age_pipeline, # Transformación a aplicar
        ["age"] # Columnas involucradas
    )
])

# Ya lo vimos más arriba
let_loyalty_pass_transformer = ColumnTransformer([
    (
        'leave_loyalty_alone',
        'passthrough',
        ['loyalty']
    )
])

# Este es nuevo
from sklearn.preprocessing import StandardScaler

scale_balance = ColumnTransformer([
    ('scale_balance', StandardScaler(), ['balance'])
])


Remember that thanks to `ColumnTransformer`, each of these individual transformers acts on only a few columns of the dataset and discards the rest. But in reality, what we want is to generate a single dataset.

We can use the `FeatureUnion` class to join our features horizontally:

In [None]:
from sklearn.pipeline import FeatureUnion

all_the_features = FeatureUnion([
    ('one_hot_encode_categories', one_hot_encode_categories),
    ('handle_age_transformer', handle_age_transformer),
    ('let_loyalty_pass_transformer', let_loyalty_pass_transformer),
    ('scale_balance', scale_balance)
])


And if we call `fit_transform`, we will obtain a new transformed dataset:

In [None]:
transformed_dataset = all_the_features.fit_transform(dataset)
transformed_dataset


This dataset has 22 columns:

In [None]:
transformed_dataset.shape


15 of them come from the categorical variables `job`, `marital`, 5 come from the `age` column that we binarized, and then `balance` and `loyalty` are the two remaining ones. And well, in the process we got rid of the `ID` column which is useless for us in this case.

## Training a model

To finish, we're going to add a machine learning model at the end to be the crown jewel and have everything in one place.

First, we're going to use `clone` to create untrained copies of our entire pipeline already created:

In [None]:
from sklearn.base import clone

feature_transformer = clone(all_the_features)


We create the final pipeline:

In [None]:
from sklearn.linear_model import LogisticRegression

inference_pipeline = Pipeline([
    ('featurize', feature_transformer),
    ('classifier', LogisticRegression()),
])


To visualize what is happening, you can simply display it by leaving it alone in a cell:

In [None]:
inference_pipeline


Now, let's train it like any other estimator:

In [None]:
inference_pipeline.fit(
    dataset,
    dataset['subscribed']
)


And if we create a new example, we can execute predict without any problem:

In [None]:
import pandas as pd

nuevos_datos = pd.DataFrame([
    {
        "ID": 2432,
        "job": "technician",
        "marital": "single",
        "balance": 90,
        "age": 34,
        "loyalty": 0.5
    }
])

nuevos_datos


In [None]:
inference_pipeline.predict(nuevos_datos)


And that's it, now all you need to store and share is the `inference_pipeline` object!

## When to use them and when not to?

As you can see, pipelines are very useful in many cases and offer various advantages. However, there are situations where they are not the best option. Here are some general tips on when to use or not use pipelines:

### **When to use pipelines:**

 1. Sequential processing: If your machine learning workflow follows a sequential structure, pipelines are ideal for organizing and simplifying the process.
 1. Cross-validation and hyperparameter tuning: Pipelines facilitate cross-validation and hyperparameter tuning, ensuring that data transformations are applied consistently and avoiding problems such as data leakage.
 1. Reproducibility and maintainability: If you want to improve the reproducibility and maintainability of your code, pipelines are an excellent option, as they allow you to encapsulate the entire workflow in a single structure.
 1. Project collaboration: If you're working in a team, pipelines can facilitate collaboration by providing a clear and coherent representation of the different stages of the machine learning process.

### **When not to use pipelines:**

 1. Complex preprocessing: If your dataset requires operations that cannot be easily represented as scikit-learn transformers, pipelines may not be suitable.
 1. Custom workflows: If you need to make transformations that don't fit into the sequential structure of a scikit-learn pipeline, you may need to handle the steps manually.
 1. Models outside of scikit-learn: If you're using machine learning models or tools from other libraries that don't follow the scikit-learn API, you may not be able to use a pipeline directly.
 1. If you're dealing with enormous amounts of data: it may sometimes be better to carry out data transformations in other languages, such as SQL to save time.

In summary, scikit-learn pipelines are a powerful tool for many machine learning workflows, but they may not be suitable for all situations. Consider the specific needs and limitations of your project before deciding if a pipeline is the best option.

See you in the next chapter where we'll discover how to save our models.