# Pipelines

<img src= "img/pipelines.png" style="height:450px">


[Image Source](https://towardsdatascience.com/using-functiontransformer-and-pipeline-in-sklearn-to-predict-chardonnay-ratings-9b13fdd6c6fd)

## Lesson Objectives

By the end of the lesson students will be able to:
- Summarize the purpose of pipelines
- Implement a scikit-learn pipeline to modularize a modeling process


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
from sklearn.impute import SimpleImputer


from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


## Steps to modeling we have learned so far:

1. Preprocess our data
    - scaling
    - imputing missing values
  

2. Fitting our model to our training data

3. Predicting new values with our test data

4. Obtaining metrics 

Let's look at how we would use a logistic regression model to predict breast cancer using the steps we have learned so far.

In [None]:
#import our data

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
X.head()
y=pd.Series(cancer.target)
X.head()

In [None]:
y.value_counts()

In [None]:
#splitting our data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
#scale our training data
ss=StandardScaler()
X_train_scaled = ss.fit_transform(X_train)

#scale our test data
X_test_scaled = ss.transform(X_test)

In [None]:
# create our logistic regression model
log_reg = LogisticRegression()

#fit our logistic regression model
log_reg.fit(X_train_scaled, y_train)

#examine the accuracy of our model
accuracy = log_reg.score(X_test_scaled, y_test)
print(f'Accuracy Score:{accuracy}')

## Introducing `sklearn.pipeline.Pipeline()` object

![kid transformer](./img//transformer.gif)

### Definition of a pipeline

We'll store these steps in a `Pipeline` object. From the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline):

> The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. [The `Pipeline` object] sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’ (i.e. transformers), that is, they must implement fit and transform methods. 

### Broad generalization of a pipeline

The key here is we need to specify a specific column, pass it's "transformer" (i.e. `SimpleImputer`, `OneHotEncoder`, `StandardScaler`), and determine if the transformation belongs in it's own new column or if it's more appropriate for the transformed column to overwrite the input column.

### Benefits of the `Pipeline`

From the [User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators):
> * **Convenience and encapsulation**
>     + You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> * **Joint parameter selection**
>     + You can grid search over parameters of all estimators in the pipeline at once.
> * **Safety**
>     + Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors. 

#### See these resources on data leakage
- [Kaggle Data Leakage](https://www.kaggle.com/alexisbcook/data-leakage)
-[Data Leakage in Machine Learning](https://machinelearningmastery.com/data-leakage-machine-learning/)
- [Leakage in Data Mining: Formulation, Detection, and Avoidance](https://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPaper_LeakingInDataMining.pdf)


### BONUS: We can create custom transformers

Sometimes we need to add new features to our existing feature space. In this case, we can't rely on importing a traditional transformer (i.e. `StandardScaler`, `OneHotEncoder`, etc.).

Instead, we'll need to create our own custom transformer. We can do this by creating a new class that implements both `.fit()` and `.transform()` methods. 

*See [Sebastian Raschka](https://sebastianraschka.com/) to see how you can do this!*

## Creating a pipeline

Here we can create a simple pipeline to do the standard scaling and modeling steps that we performed above.

In [None]:
steps = [('ss', StandardScaler()), ('log_reg', LogisticRegression())]

pipe = Pipeline(steps)

pipe



__Your Turn__

- Create your own pipeline. Chose a transformers and an estimator with given hyperparameters.


In [None]:
#your code here

#### Great!  Now that we have our pipeline let's use it to make predictions!

In [None]:
#splitting our data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

pipe.fit(X_train, y_train)
pipe.predict(X_test)

In [None]:
# instead of:
minmax = MinMaxScaler()
knn = KNeighborsClassifier(n_neighbors=3)

Sklearn also gives us "make_pipeline" which is almost the same thing but with make_pipeline you don't have to give names.

__Your Turn__

-  [Check documentation: 6.1.1.1.1. Construction](https://scikit-learn.org/stable/modules/compose.html) and use make_pipeline to construct an pipeline.

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
# your code here

## Accessing steps

We have multiple ways to access and object in the pipeline

- steps attribute

- [idx]



In [None]:
## note that these will all give the minmax scaler object

# pipe.steps[0][1]

# pipe['minmax']

# pipe[0]

In [None]:
## We can also access a particular object by named_steps
## sklearn claims that tab completion should work here but 
## in my notebook it didn't

pipe.named_steps.minmax

In [None]:
## We can 'slice' pipelines to create sub-pipes

pipe[1:]

In [None]:
type(pipe.steps)

## Access to the parameters

Parameters of the estimators in the pipeline can be accessed using the 
"estimator__parameter" syntax.

In [None]:
pipe

In [None]:
pipe.named_steps.keys()

In [None]:
pipe['minmax'].get_params().keys()

In [None]:
pipe.set_params(knn__n_neighbors=6,
                knn__leaf_size=15,)

pipe

## Transforming target in regression

In [None]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                  transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train)

print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

raw_target_regr = LinearRegression().fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

## Pipelines in action

In [None]:
df = pd.read_csv('data/diabetes.csv')
display(df.head(), df.shape)

In [None]:
target = df.Outcome

data = df.drop(columns='Outcome')

[On Scaling data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    target,
                                                    test_size=0.20,
                                                    stratify=target,
                                                    random_state=1)

__Your Turn__

- Create a pipeline and use this pipeline for fitting and predicting diabetes results for the above data.


In [None]:
#your code here

## Column Transformers

What happens when you are preparing your data but it has mixed data types?

Have you been splitting your dataframe, transforming, and then merging them back together???

![](./img/yes-well-not-anymore.jpg)

Sklearn has a really nice pipeline tool for this call [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)

__Your Turn__

- Create a column transformer that will standard scale all continuous variables and one hot encode any discrete variables using the termination dataset.


__Bonus!  Can you use this column transformer inside a pipeline to make predictions about employee terminations?__

In [None]:
#your code here

## Further research and miscellaneous

- [FeatureUnion](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces)

- [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)

- [sklearn, dictionary of terms](https://scikit-learn.org/stable/glossary.html#term-transformer)

- [Pydata meeting on pipelines](https://www.youtube.com/watch?v=BFaadIqWlAg)

- [Another pydata talk on pipelines with FeatureUnion](https://www.youtube.com/watch?v=URdnFlZnlaE)

- [On scalers](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

- [A nice notebook on pipelines](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb)