# sklearn.pipeline.Pipeline

[sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) 

## 6.1. Pipelines and composite estimators

[6.1. Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html#pipeline)

Los transformadores generalmente se combinan con clasificadores, regresores u otros estimadores para construir un estimador compuesto.

La herramienta más común es una tubería. Pipeline se usa a menudo en combinación con FeatureUnion, que concatena la salida de los transformadores en un espacio de características compuesto.

TransformedTargetRegressor se ocupa de transformar el objetivo (es decir, log-transform y). Por el contrario, las canalizaciones solo transforman los datos observados (X).

### 6.1.1. Pipeline: encadenamiento de estimadores

Pipeline se puede utilizar para encadenar varios estimadores en uno.

Esto es útil ya que a menudo hay una secuencia fija de pasos en el procesamiento de los datos, por ejemplo, selección, normalización y clasificación de características.

Pipeline tiene varios propósitos aquí:

**Conveniencia y encapsulación**  
Solo tiene que llamar a ajustar y predecir una vez en sus datos para ajustar una secuencia completa de estimadores.

**Selección de parámetros conjuntos**  
Puede realizar búsquedas en cuadrículas sobre los parámetros de todos los estimadores en la tubería a la vez.

**Seguridad**  
Las canalizaciones ayudan a evitar la filtración de estadísticas de sus datos de prueba al modelo entrenado en la validación cruzada, al garantizar que se utilicen las mismas muestras para entrenar los transformadores y predictores.

Todos los estimadores en una tubería, excepto el último, deben ser transformadores (es decir, deben tener un método de transformación). El último estimador puede ser de cualquier tipo (transformador, clasificador, etc.).

#### 6.1.1.1. Uso
##### 6.1.1.1.1. Construcción

El Pipeline se construye usando una lista de pares (clave, valor), donde la clave es una cadena que contiene el nombre que desea dar a este paso y el valor es un objeto estimador:

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])

La función de utilidad make_pipeline es una forma abreviada de construir tuberías; toma un número variable de estimadores y devuelve una canalización, completando los nombres automáticamente:

In [2]:
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB()) 

Pipeline(steps=[('binarizer', Binarizer()), ('multinomialnb', MultinomialNB())])

##### 6.1.1.1.2. Accessing steps

Los estimadores de una canalización se almacenan como una lista en el atributo pasos, pero se puede acceder a ellos por índice o nombre indexando (con [idx]) la canalización:

In [4]:
>>> pipe.steps[0]

('reduce_dim', PCA())

In [5]:
>>> pipe[0]

PCA()

In [6]:
>>> pipe['reduce_dim']

PCA()

El atributo ``named_steps`` de Pipeline permite acceder a los pasos por nombre con la terminación de tabulación en entornos interactivos:

In [7]:
 pipe.named_steps.reduce_dim is pipe['reduce_dim']

True

También se puede extraer una sub-canalización utilizando la notación de corte que se usa comúnmente para las secuencias de Python, como listas o cadenas (aunque solo se permite un paso de 1). Esto es conveniente para realizar solo algunas de las transformaciones (o su inversa):

In [8]:
pipe[:1]

Pipeline(steps=[('reduce_dim', PCA())])

In [9]:
pipe[-1:]

Pipeline(steps=[('clf', SVC())])

##### 6.1.1.1.3. Parámetros anidados

Se puede acceder a los parámetros de los estimadores en la tubería utilizando la sintaxis ``<estimator> __ <parameter>``:

In [10]:
pipe.set_params(clf__C=10)

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])

Esto es particularmente importante para realizar búsquedas en cuadrículas:

In [11]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                      clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

Los pasos individuales también se pueden reemplazar como parámetros, y los pasos no finales se pueden ignorar configurándolos como 'passthrough':

In [12]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
                      clf=[SVC(), LogisticRegression()],
                      clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

Los estimadores de la canalización se pueden recuperar por índice:

In [13]:
pipe[0]

PCA()

o por nombre:

In [14]:
pipe['reduce_dim'] 

PCA()

**Examples:**

* Pipeline Anova SVM

* Sample pipeline for text feature extraction and evaluation

* Pipelining: chaining a PCA and a logistic regression

* Explicit feature map approximation for RBF kernels

* SVM-Anova: SVM with univariate feature selection

* Selecting dimensionality reduction with Pipeline and GridSearchCV

#### See Also:

* Composite estimators and parameter spaces

##### 6.1.1.2. Notes

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

##### 6.1.1.3. Caching transformers: avoid repeated computation

Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.

The parameter memory is needed in order to cache the transformers. memory can be either a string containing the directory where to cache the transformers or a joblib.Memory object:

In [15]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
pipe

Pipeline(memory='C:\\Users\\Usuario\\AppData\\Local\\Temp\\tmpk0uc5w46',
         steps=[('reduce_dim', PCA()), ('clf', SVC())])

Borre el directorio de caché cuando ya no lo necesite

In [16]:
>>> rmtree(cachedir)

Warning Side effect of caching transformers
Using a Pipeline without cache enabled, it is possible to inspect the original instance such as:

In [17]:
>>> from sklearn.datasets import load_digits
>>> X_digits, y_digits = load_digits(return_X_y=True)
>>> pca1 = PCA()
>>> svm1 = SVC()
>>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
>>> pipe.fit(X_digits, y_digits)

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])

In [18]:
>>> # The pca instance can be inspected directly
>>> print(pca1.components_)

[[-1.77484909e-19 -1.73094651e-02 -2.23428835e-01 ... -8.94184677e-02
  -3.65977111e-02 -1.14684954e-02]
 [ 3.27805401e-18 -1.01064569e-02 -4.90849204e-02 ...  1.76697117e-01
   1.94547053e-02 -6.69693895e-03]
 [-1.68358559e-18  1.83420720e-02  1.26475543e-01 ...  2.32084163e-01
   1.67026563e-01  3.48043832e-02]
 ...
 [ 0.00000000e+00 -8.73056983e-16 -8.00882817e-17 ...  4.50992264e-17
  -6.85099394e-17  1.37105203e-16]
 [ 0.00000000e+00 -1.43163189e-16  1.69094260e-16 ...  3.09312540e-17
  -5.28224496e-17  4.51534285e-17]
 [ 1.00000000e+00 -1.68983002e-17  5.73338351e-18 ...  8.66631300e-18
  -1.57615962e-17  4.07058917e-18]]


Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. In following example, accessing the PCA instance pca2 will raise an AttributeError since pca2 will be an unfitted transformer. Instead, use the attribute named_steps to inspect estimators within the pipeline:

In [19]:
>>> cachedir = mkdtemp()
>>> pca2 = PCA()
>>> svm2 = SVC()
>>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
                           memory=cachedir)
>>> cached_pipe.fit(X_digits, y_digits)

Pipeline(memory='C:\\Users\\Usuario\\AppData\\Local\\Temp\\tmpfk8tnkin',
         steps=[('reduce_dim', PCA()), ('clf', SVC())])

In [20]:
>>> print(cached_pipe.named_steps['reduce_dim'].components_)

[[-1.77484909e-19 -1.73094651e-02 -2.23428835e-01 ... -8.94184677e-02
  -3.65977111e-02 -1.14684954e-02]
 [ 3.27805401e-18 -1.01064569e-02 -4.90849204e-02 ...  1.76697117e-01
   1.94547053e-02 -6.69693895e-03]
 [-1.68358559e-18  1.83420720e-02  1.26475543e-01 ...  2.32084163e-01
   1.67026563e-01  3.48043832e-02]
 ...
 [ 0.00000000e+00 -8.73056983e-16 -8.00882817e-17 ...  4.50992264e-17
  -6.85099394e-17  1.37105203e-16]
 [ 0.00000000e+00 -1.43163189e-16  1.69094260e-16 ...  3.09312540e-17
  -5.28224496e-17  4.51534285e-17]
 [ 1.00000000e+00 -1.68983002e-17  5.73338351e-18 ...  8.66631300e-18
  -1.57615962e-17  4.07058917e-18]]


In [23]:
>>> # Remove the cache directory
>>> rmtree(cachedir)

FileNotFoundError: [WinError 3] El sistema no puede encontrar la ruta especificada: 'C:\\Users\\Usuario\\AppData\\Local\\Temp\\tmpfk8tnkin'

Examples:

Selecting dimensionality reduction with Pipeline and GridSearchCV

### 6.1.2. Transforming target in regression
TransformedTargetRegressor transforms the targets y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for prediction, and the transformer that will be applied to the target variable:

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
X, y = X[:2000, :], y[:2000]  # select a subset of data
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                      transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train)

TransformedTargetRegressor(regressor=LinearRegression(),
                           transformer=QuantileTransformer(output_distribution='normal'))

In [None]:
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

In [None]:
>>> raw_target_regr = LinearRegression().fit(X_train, y_train)
>>> print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

For simple transformations, instead of a Transformer object, a pair of functions can be passed, defining the transformation and its inverse mapping:

In [None]:
>>>
>>> def func(x):
...     return np.log(x)
>>> def inverse_func(x):
...     return np.exp(x)

Subsequently, the object is created as:

In [None]:
>>>
>>> regr = TransformedTargetRegressor(regressor=regressor,
...                                   func=func,
...                                   inverse_func=inverse_func)
>>> regr.fit(X_train, y_train)

In [None]:
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

    

By default, the provided functions are checked at each fit to be the inverse of each other. However, it is possible to bypass this checking by setting check_inverse to False:

In [None]:
>>>
>>> def inverse_func(x):
...     return x
>>> regr = TransformedTargetRegressor(regressor=regressor,
...                                   func=func,
...                                   inverse_func=inverse_func,
...                                   check_inverse=False)
>>> regr.fit(X_train, y_train)
TransformedTargetRegressor(...)
>>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: -1.57
Note The transformation can be triggered by setting either transformer or the pair of functions func and inverse_func. However, setting both options will raise an error.
Examples:

Effect of transforming the targets in regression model

6.1.3. FeatureUnion: composite feature spaces
FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

When you want to apply different transformations to each field of the data, see the related class ColumnTransformer (see user guide).

FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and validation.

FeatureUnion and Pipeline can be combined to create complex models.

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

6.1.3.1. Usage
A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:

>>>
>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined
FeatureUnion(transformer_list=[('linear_pca', PCA()),
                               ('kernel_pca', KernelPCA())])
Like pipelines, feature unions have a shorthand constructor called make_union that does not require explicit naming of the components.

Like Pipeline, individual steps may be replaced using set_params, and ignored by setting to 'drop':

>>>
>>> combined.set_params(kernel_pca='drop')
FeatureUnion(transformer_list=[('linear_pca', PCA()),
                               ('kernel_pca', 'drop')])
Examples:

Concatenating multiple feature extraction methods

6.1.4. ColumnTransformer for heterogeneous data
Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

You may want to include the parameters of the preprocessors in a parameter search.

The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas DataFrames.

To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:

>>>
>>> import pandas as pd
>>> X = pd.DataFrame(
...     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
...      'title': ["His Last Bow", "How Watson Learned the Trick",
...                "A Moveable Feast", "The Grapes of Wrath"],
...      'expert_rating': [5, 3, 4, 5],
...      'user_rating': [4, 5, 4, 3]})
For this data, we might want to encode the 'city' column as a categorical variable using OneHotEncoder but apply a CountVectorizer to the 'title' column. As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say 'city_category' and 'title_bow'. By default, the remaining rating columns are ignored (remainder='drop'):

>>>
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(dtype='int'),['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder='drop')

>>> column_trans.fit(X)
ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
                                 ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']

>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)
In the above example, the CountVectorizer expects a 1D array as input and therefore the columns were specified as a string ('title'). However, OneHotEncoder as most of other transformers expects 2D data, therefore in that case you need to specify the column as a list of strings (['city']).

Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, a boolean mask, or with a make_column_selector. The make_column_selector is used to select columns based on data type or column name:

>>>
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.compose import make_column_selector
>>> ct = ColumnTransformer([
...       ('scale', StandardScaler(),
...       make_column_selector(dtype_include=np.number)),
...       ('onehot',
...       OneHotEncoder(),
...       make_column_selector(pattern='city', dtype_include=object))])
>>> ct.fit_transform(X)
array([[ 0.904...,  0.      ,  1. ,  0. ,  0. ],
       [-1.507...,  1.414...,  1. ,  0. ,  0. ],
       [-0.301...,  0.      ,  0. ,  1. ,  0. ],
       [ 0.904..., -1.414...,  0. ,  0. ,  1. ]])
Strings can reference columns if the input is a DataFrame, integers are always interpreted as the positional columns.

We can keep the remaining rating columns by setting remainder='passthrough'. The values are appended to the end of the transformation:

>>>
>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(dtype='int'),['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder='passthrough')

>>> column_trans.fit_transform(X)
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)
The remainder parameter can be set to an estimator to transform the remaining rating columns. The transformed values are appended to the end of the transformation:

>>>
>>> from sklearn.preprocessing import MinMaxScaler
>>> column_trans = ColumnTransformer(
...     [('city_category', OneHotEncoder(), ['city']),
...      ('title_bow', CountVectorizer(), 'title')],
...     remainder=MinMaxScaler())

>>> column_trans.fit_transform(X)[:, -2:]
array([[1. , 0.5],
       [0. , 1. ],
       [0.5, 0.5],
       [1. , 0. ]])
The make_column_transformer function is available to more easily create a ColumnTransformer object. Specifically, the names will be given automatically. The equivalent for the above example would be:

>>>
>>> from sklearn.compose import make_column_transformer
>>> column_trans = make_column_transformer(
...     (OneHotEncoder(), ['city']),
...     (CountVectorizer(), 'title'),
...     remainder=MinMaxScaler())
>>> column_trans
ColumnTransformer(remainder=MinMaxScaler(),
                  transformers=[('onehotencoder', OneHotEncoder(), ['city']),
                                ('countvectorizer', CountVectorizer(),
                                 'title')])
6.1.5. Visualizing Composite Estimators
Estimators can be displayed with a HTML representation when shown in a jupyter notebook. This can be useful to diagnose or visualize a Pipeline with many estimators. This visualization is activated by setting the display option in set_config:

>>>
>>> from sklearn import set_config
>>> set_config(display='diagram')   
>>> # diplays HTML representation in a jupyter context
>>> column_trans  
An example of the HTML output can be seen in the HTML representation of Pipeline section of Column Transformer with Mixed Types. As an alternative, the HTML can be written to a file using estimator_html_repr:

>>>
>>> from sklearn.utils import estimator_html_repr
>>> with open('my_estimator.html', 'w') as f:  
...     f.write(estimator_html_repr(clf))
Examples:

Column Transformer with Heterogeneous Data Sources

Column Transformer with Mixed Types