<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pipelines:-Automating-the-Automatic-Learning" data-toc-modified-id="Pipelines:-Automating-the-Automatic-Learning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pipelines: Automating the Automatic Learning</a></span></li><li><span><a href="#Advantages" data-toc-modified-id="Advantages-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Advantages</a></span><ul class="toc-item"><li><span><a href="#Reduces-Complexity" data-toc-modified-id="Reduces-Complexity-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Reduces Complexity</a></span></li><li><span><a href="#Convenient" data-toc-modified-id="Convenient-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Convenient</a></span></li><li><span><a href="#Flexible" data-toc-modified-id="Flexible-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Flexible</a></span></li><li><span><a href="#Prevent-Mistakes!" data-toc-modified-id="Prevent-Mistakes!-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Prevent Mistakes!</a></span></li></ul></li><li><span><a href="#Example-of-Using-a-Pipeline" data-toc-modified-id="Example-of-Using-a-Pipeline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Example of Using a Pipeline</a></span><ul class="toc-item"><li><span><a href="#Without-the-Pipeline-class" data-toc-modified-id="Without-the-Pipeline-class-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Without the Pipeline class</a></span></li><li><span><a href="#With-the-Pipeline-class" data-toc-modified-id="With-the-Pipeline-class-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>With the Pipeline class</a></span></li></ul></li><li><span><a href="#Parts-of-a-Pipeline" data-toc-modified-id="Parts-of-a-Pipeline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Parts of a Pipeline</a></span><ul class="toc-item"><li><span><a href="#Estimator" data-toc-modified-id="Estimator-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Estimator</a></span><ul class="toc-item"><li><span><a href="#Usage-(Methods)" data-toc-modified-id="Usage-(Methods)-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Usage (Methods)</a></span><ul class="toc-item"><li><span><a href="#fit" data-toc-modified-id="fit-4.1.1.1"><span class="toc-item-num">4.1.1.1&nbsp;&nbsp;</span><code>fit</code></a></span></li></ul></li></ul></li><li><span><a href="#Transformer" data-toc-modified-id="Transformer-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Transformer</a></span><ul class="toc-item"><li><span><a href="#Usage-(Methods)" data-toc-modified-id="Usage-(Methods)-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Usage (Methods)</a></span><ul class="toc-item"><li><span><a href="#transform" data-toc-modified-id="transform-4.2.1.1"><span class="toc-item-num">4.2.1.1&nbsp;&nbsp;</span><code>transform</code></a></span></li><li><span><a href="#fit_transform" data-toc-modified-id="fit_transform-4.2.1.2"><span class="toc-item-num">4.2.1.2&nbsp;&nbsp;</span><code>fit_transform</code></a></span></li></ul></li></ul></li><li><span><a href="#Predictor" data-toc-modified-id="Predictor-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Predictor</a></span><ul class="toc-item"><li><span><a href="#Usage-(Methods)" data-toc-modified-id="Usage-(Methods)-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Usage (Methods)</a></span><ul class="toc-item"><li><span><a href="#predict" data-toc-modified-id="predict-4.3.1.1"><span class="toc-item-num">4.3.1.1&nbsp;&nbsp;</span><code>predict</code></a></span></li><li><span><a href="#score" data-toc-modified-id="score-4.3.1.2"><span class="toc-item-num">4.3.1.2&nbsp;&nbsp;</span><code>score</code></a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Using-a-Pipeline" data-toc-modified-id="Using-a-Pipeline-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Using a Pipeline</a></span></li></ul></div>

# Pipelines: Automating the Automatic Learning

**Pipelines** are a nice tool to use to help in the full data science process!

Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

But like with all things, you need to know how to make a proper and useful pipeline:

![Original Source: https://imgs.xkcd.com/comics/data_pipeline.png](images/data_pipeline_xkcd.png)

# Advantages

## Reduces Complexity

You can focus on parts of the pipeline at a time and debug or adjust parts as needed

## Convenient

You can summarize your fine-detail steps into the pipeline. That way you can focus on the big-picture aspects.

## Flexible 

You can also use pipelines to be applied to different models and can perform optimization techniques like grid search and random search on hyperparameters!

## Prevent Mistakes!

We can focus on one section at a time.

We also can ensure data leakage between our training and doesn't occur between our training dataset and validation/testing datasets!

![](images/pipe_leaking_cartoon.jpg)

# Example of Using a Pipeline

We can imagine doing the full steps planned out for a dataset. We _technically_ don't need to use the Pipeline class but it makes it much more manageable

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Getting some data
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=27)

## Without the Pipeline class

In [None]:
# Define transformers (will adjust/massage the data)
imputer = SimpleImputer(strategy="median") # replaces missing values
std_scaler = StandardScaler() # scales the data
pca = PCA()

# Define the classifier (predictor) to train
rf_clf = RandomForestClassifier()

# Have the classifer (and full pipeline) learn/train/fit from the data
X_train_filled = imputer.fit_transform(X_train)
X_train_scaled = std_scaler.fit_transform(X_train_filled)
X_train_reduce = pca.fit_transform(X_train_scaled)
rf_clf.fit(X_train_reduce, y_train)

# Predict using the trained classifier (still need to do the transformations)
X_test_filled = imputer.transform(X_test)
X_test_scaled = std_scaler.transform(X_test_filled)
X_test_reduce = pca.fit_transform(X_test_scaled)
y_pred = rf_clf.predict(X_test_reduce)

> Note that if we were to add more steps in this process, we'd have to change both the *training* and *testing* processes.

## With the Pipeline class

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), 
        ('std_scaler', StandardScaler()),
        ('pca', PCA()),
        ('rf_clf', RandomForestClassifier()),
])


# Train the pipeline (tranformations & predictor)
pipeline.fit(X_train, y_train)

# Predict using the pipeline (includes the transfomers & trained predictor)
predicted = pipeline.predict(X_test)

> If we need to change our process, we change it _just once_ in the Pipeline

# Parts of a Pipeline

Scikit-learn has a class called [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that is very logical and versatile. We can break up the steps within a full process. But it'll help if we define what the different parts are.

## Estimator

This is any object in the pipeline that can can take in data and *estimate* (or **learn**) some parameters. 

This means regression and classification models are estimators but so are objects that transform the original dataset ([Transformers](pipeline_intro.ipynb#Transformer)) such as a standard scaling.

### Usage (Methods)

#### `fit`

All estimators estimate/learn by calling the `fit()` method by passing in the dataset. Other parameters can be passed in to "help" the estimator to learn. These are called **hyperparameters**, parameters used to tweak the learning process.

## Transformer

Some estimators can change the original data to something new, a **transformation**. You can think of examples of these **transformers** when you do scaling, data cleaning, or expanding/reducing on a dataset.

### Usage (Methods)

#### `transform`

Transformers will call the `transform()` method to apply the transformation to a dataset.

####  `fit_transform`

Remember that all estimators have a `fit()` method, so a transformer can use the `fit()` method to learn something about the given dataset. After learning with `fit()`, a transformation on the dataset can be made with the `transform()` method. 

An example of this would be a function that performs normalization on the dataset; the `fit()` method would learn the minimum and maximum of the dataset and the `transform()` method will scale the dataset.

When you call `fit` and `transform` with the same dataset, you can simply call the `fit_transform()` method. This essentially has the same results as calling `fit()` and then `transform()` on the dataset but possibly with some optimization and efficiencies baked in.

## Predictor

We've been using **predictors** whenever we've been making predictions with a classifier or regressor. We would use the `fit()` method to train our predictor object and then feed in new data to make predictions (based on what it learned in the fitting stage).

### Usage (Methods)

#### `predict`

As you probably can guess, the `predict()` method predicts results from a dataset given to it after being trained with a `fit()` method

#### `score`

Predictors also have a `score()` method that can be used to evaluate how well the predictor performed on a dataset (such as the test set).

# Using a Pipeline

Check out Aurélien Geron's notebook of an [end-to-end ml project](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) on his GitHub repo based around his book [_Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed)_](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)