In [3]:
# default_exp algo.ml.pipeline

%reload_ext autoreload
%autoreload 2

# algo-ml-pipeline
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. 
* Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. 
* The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. 
* A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or 
* a transformer removed by setting it to ‘passthrough’ or None.
# Pipelines and composite estimators
https://scikit-learn.org/stable/modules/compose.html#pipeline

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. 

The most common tool is a Pipeline. Pipeline is often used in combination with `FeatureUnion` which concatenates the output of transformers into a composite feature space. `TransformedTargetRegressor` deals with transforming the target (i.e. log-transform y). In contrast, `Pipelines` only transform the observed data (X).

## Pipeline: chaining estimators
Pipeline can be used to chain multiple estimators into one. 

This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

### Convenience and encapsulation

    You only have to call fit and predict once on your data to fit a whole sequence of estimators.
### Joint parameter selection

    You can grid search over parameters of all estimators in the pipeline at once.
### Safety

    Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [6]:
# !pip install scikit-learn -U
!pip freeze | grep scikit-learn

scikit-learn==0.23.1


In [1]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn import set_config
set_config(display='diagram')

### build pipeline
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

The utility function make_pipeline is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB())

#### Accessing steps

The estimators of a pipeline are stored as a list in the steps attribute, but can be accessed by index or name by indexing (with [idx]) the Pipeline:

In [6]:
pipe.steps[0]

('reduce_dim', PCA())

In [7]:
pipe[0]

In [8]:
pipe['reduce_dim']

In [9]:
pipe.named_steps.reduce_dim

A sub-pipeline can also be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). 

This is convenient for performing only some of the transformations (or their inverse):

In [10]:
pipe[:1]

In [11]:
pipe[-1:]

In [2]:

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

In [3]:
pipe

#### Nested parameters
Parameters of the estimators in the pipeline can be accessed using the `<estimator>__<parameter>` syntax:

In [12]:
pipe.set_params(clf__C=10)

This is particularly important for doing grid searches:

In [13]:
from sklearn.model_selection import GridSearchCV
param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to 'passthrough':

In [14]:
from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
                  clf=[SVC(), LogisticRegression()],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)


### Notes

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. 

The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

### Caching transformers: avoid repeated computation

Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.

The parameter memory is needed in order to cache the transformers. memory can be either a string containing the directory where to cache the transformers or a joblib.Memory object:

## Transforming target in regression
TransformedTargetRegressor transforms the targets y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for prediction, and the transformer that will be applied to the target variable:

## FeatureUnion: composite feature spaces
`FeatureUnion` combines several transformer objects into a new transformer that combines their output. 

A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

When you want to apply different transformations to each field of the data, see the related class sklearn.compose.ColumnTransformer (see user guide).

FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and validation.

FeatureUnion and Pipeline can be combined to create complex models.

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are the caller’s responsibility.)


## ColumnTransformer for heterogeneous data

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. 

Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

    Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

    You may want to include the parameters of the preprocessors in a parameter search.

The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas DataFrames.

To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:

In [15]:
import pandas as pd
X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


For this data, we might want to 
* encode the 'city' column as a categorical variable using preprocessing.OneHotEncoder but 
* apply a feature_extraction.text.CountVectorizer to the 'title' column. 

As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say 'city_category' and 'title_bow'. By default, the remaining rating columns are ignored (remainder='drop'):

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop')

column_trans.fit(X)

In [17]:
column_trans.get_feature_names()

['city_category__x0_London',
 'city_category__x0_Paris',
 'city_category__x0_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

In [18]:
column_trans.transform(X).toarray()

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]], dtype=int64)

## Visualizing Composite Estimators
Estimators can be displayed with a HTML representation when shown in a jupyter notebook. This can be useful to diagnose or visualize a Pipeline with many estimators. This visualization is activated by setting the display option in sklearn.set_config:

In [None]:
>>> from sklearn import set_config
>>> set_config(display='diagram')   
>>> # diplays HTML representation in a jupyter context
>>> column_trans 

An example of the HTML output can be seen in the HTML representation of Pipeline section of Column Transformer with Mixed Types. As an alternative, the HTML can be written to a file using estimator_html_repr:

In [None]:
>>> from sklearn.utils import estimator_html_repr
>>> with open('my_estimator.html', 'w') as f:  
...     f.write(estimator_html_repr(clf))

In [None]:
pipe.

In [12]:

# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)

# nb_export

In [1]:
from nbdev.export import *
notebook2script()

Converted 00_core.ipynb.
Converted 00_template.ipynb.
Converted active_learning.ipynb.
Converted algo_dl_keras.ipynb.
Converted algo_dl_loss.ipynb.
Converted algo_dl_optimizers.ipynb.
Converted algo_dl_pytorch.ipynb.
Converted algo_ml_tree_catboost.ipynb.
Converted algo_ml_tree_lgb.ipynb.
Converted algo_rs_match_associated_rules.ipynb.
Converted algo_rs_match_deepmatch.ipynb.
Converted algo_rs_match_matrix.ipynb.
Converted algo_rs_search_vector_faiss.ipynb.
Converted algo_seq_embeding.ipynb.
Converted algo_seq_embeding_glove.ipynb.
Converted algo_seq_features_extraction_text.ipynb.
Converted data-processing-eda.ipynb.
Converted data-processing-tf_data.ipynb.
Converted data_processing_split.ipynb.
Converted datastructure_dict_list_set.ipynb.
Converted datastructure_generator.ipynb.
Converted datastructure_matrix_sparse.ipynb.
Converted engineering-colab-kagglelab.ipynb.
Converted engineering_concurrency.ipynb.
Converted engineering_docker.ipynb.
Converted engineering_gc.ipynb.
Converted

In [7]:
!nbdev_build_docs

No notebooks were modified
converting /Users/luoyonggui/PycharmProjects/nbdevlib/index.ipynb to README.md
