# CUSTOM TRANSFORMERS AND PIPERLINES
---

### 1. Creating Data
Let's create a dataframe based on the equation:
$$
y = {x_1} + 2\sqrt{x_2}
$$
This makes sure a simple Linear Regression model won't be able to fit it

In [1]:
import pandas as pd

df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,9],
                                                   [4,36,16],
                                                   [1,16,9],
                                                   [2,9,8],
                                                   [3,36,15],
                                                   [2,49,16],
                                                   [4,25,14],
                                                   [5,36,17]
])

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

In [2]:
df

Unnamed: 0,X1,X2,y
0,1,16,9
1,4,36,16
2,1,16,9
3,2,9,8
4,3,36,15
5,2,49,16
6,4,25,14
7,5,36,17


In [3]:
X_train

Unnamed: 0,X1,X2
0,1,16
1,4,36
2,1,16
3,2,9
4,3,36
5,2,49


In [4]:
X_test

Unnamed: 0,X1,X2
6,4,25
7,5,36


In [5]:
y_train

0     9
1    16
2     9
3     8
4    15
5    16
Name: y, dtype: int64

In [6]:
y_test

6    14
7    17
Name: y, dtype: int64

### 2. Linear Regression Model
Let's predict with `Linear Regression` to see what happens:

In [7]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr_fit = lr.fit(X_train, y_train)
preds = lr_fit.predict(X_test)
lr_mse = mean_squared_error(y_test, preds)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(lr_mse)}\n")


[13.72113586 16.93334467]
RMSE: 0.20274138822160784



A perfect predictions would be 14 and 17 with zero error. The predictions are not bad though but let's do some calculations on the input features to make this better!

- What if we square_root $x_2$ and multiply by 2?

In [8]:
X_train.X2 = 2 * np.sqrt(X_train.X2)
X_test.X2 = 2 * np.sqrt(X_test.X2)
print(X_test)
m2 = LinearRegression()
fit2 = m2.fit(X_train, y_train)
preds = fit2.predict(X_test)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds))}\n")

   X1    X2
6   4  10.0
7   5  12.0

[14. 17.]
RMSE: 5.17892563931115e-15



The input manipulation caused a perfect linear fit. 

Le't see how to do it with a `Pipeline`

### 3. Linear Regression with a Pipeline

- First **`without input transformation`** to validate that we can get the same results as before:

In [9]:
# let's go back to original data

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

# let's create a pipeline
from sklearn.pipeline import Pipeline

print("Pipeline 1")
pipe1 = Pipeline(steps=[
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(X_train, y_train)
print("predict via pipeline 1")
preds1 = pipe1.predict(X_test)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds1))}\n")

Pipeline 1
fit pipeline 1
predict via pipeline 1

[13.72113586 16.93334467]
RMSE: 0.20274138822160784



How we did it:
- We declared a `pipe1` variable using *Pipeline* class with array of *steps* inside it
    - The name of the step (in this case `linear_model`) could be anything unique of our choice
    - It is followed by an actial `Transformer` or `Estimator` (in this case, our `LinearRegression()` model)
- Like any model, `LinearRegression()` is fitted on the training data, but using the `pipe1` variable
- `pipe1` is also used to predict on test set as one would do in any other model
- We got the results that we expected. Awesome!!!

### 4. Creating a Custom Input Transformer
To perform the input calculations or transformations, we will design a custom transformer:

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class ExperimentalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print('\n>>>>>>>init() called.\n')

    def fit(self, X, y = None):
        print('\n>>>>>>>fit() called.\n')
        return self

    def transform(self, X, y = None):
        print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_.X2 = 2 * np.sqrt(X_.X2)
        return X_

We created a class and named it `ExperimentalTransformer`. All transformers we design will inherit from `BaseEstimator` and `TransformerMixin` classes as they give us pre-existing methods for free. The 3 methods to take care of:
- `__init__`: this is the constructor. It is called when the pipeline is initiated
- `fit()`: called when we fit the pipeline
- `transformer()`: called when we use fit or transform on the pipeline

For the moment, we just put print() messages in __ init__ & fit(), and wrote our calculations in transform(). As we see above, we returned x when fit() or transform() is called

Let's put this into a pipeline to see the order in which these functions are called!

### 5. Custom Transformer in a Pipeline

In [11]:
print("Pipeline 2")
pipe2 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer()),    # this will trigger a call to __init__
    ('linear_model', LinearRegression())
])
print("fit pipeline 2")
pipe2.fit(X_train, y_train)
print("predict via pipeline 2")
preds2 = pipe2.predict(X_test)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds2))}\n")

Pipeline 2

>>>>>>>init() called.

fit pipeline 2

>>>>>>>fit() called.


>>>>>>>transform() called.

predict via pipeline 2

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



In [12]:
# an alternate, shorter syntax to do the above, without naming each step, is:
from sklearn.pipeline import make_pipeline

print("Pipeline 2")
pipe2 = make_pipeline(ExperimentalTransformer(), # this will trigger a call to __init__
                      LinearRegression())
print("fit pipeline 2")
pipe2.fit(X_train, y_train)
print("predict via pipeline 2")
preds2 = pipe2.predict(X_test)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds2))}\n")

Pipeline 2

>>>>>>>init() called.

fit pipeline 2

>>>>>>>fit() called.


>>>>>>>transform() called.

predict via pipeline 2

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



Three important things to not:
- __ init__ was called the moment we initialized the pipe2 variable
- Both fit() and transform() of our ExperimentTransformer were called when we fitted the pipeline on training data. This makes sense as that is how model fitting works: you would need to transform in put features while trying to predict y_train
- transform() is called, as expected, when we call predict(X_test) - the input test features need to be square-rooted and doubled too before making predictions. 

The result - perfect predictions!

### 6. Modifying and Parametizing Transformers

But we've assumed in the transform() function of our ExperimentalTransformer that the column name is X2. Let's not do so and pass the column name via the constructor __ init__()

Here is our `ExperimentalTransformer_2`:

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

class ExperimentalTransformer_2(BaseEstimator, TransformerMixin):
    # let's add two parameters for fun
    def __init__(self, feature_name, additional_param = 'Himanshu'):
        print('\n>>>>>>>init() called.\n')
        self.feature_name = feature_name
        self.additional_param = additional_param

    def fit(self, X, y = None):
        print('\n>>>>>>>fit() called.\n')
        print(f'\nadditional param ~~~~ {self.additional_param}\n')
        return self

    def transform(self, X, y = None):
        print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_[self.feature_name] = 2 * np.sqrt(X_[self.feature_name])
        return X_

- Take care to keep the parameter name exactly the same in the function argument as well as the class' variable (feature_name). 
- Changing that will cause problems later when we also try to transform the target feature (y). 
- It causes a double-call to __ init__ for some reason.
- The `additional_param` with a default value was added just to mix things up. It's not really needed for anything in our case and acts as an optional argument

### 7. Parametized Transformer in a Pipeline

Let's create the new pipeline now:

In [14]:
print("Pipeline 3")
pipe3 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer_2('X2')),   
    ('linear_model', LinearRegression())
])
print("fit pipeline 3")
pipe3.fit(X_train, y_train)
print("predict via pipeline 3")
preds3 = pipe3.predict(X_test)
print(f"\n{preds3}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds3))}\n")

Pipeline 3

>>>>>>>init() called.

fit pipeline 3

>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 3

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



### 8. Custom Target Transformation

via TransformedTargetRegressor

- What about a situation when some pre- and post-processing needs to be done?
- Let's go further and modify the dataframe to have target as squares of current values. The new function is:
$$
\sqrt{y} = {x_1} + 2*\sqrt{x_2}
$$
To make this function fit into a simple linear model we will need to square-root y before fitting our model and also, later, square any predictions made by the model

In [15]:
df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,81],
                                                   [4,36,256],
                                                   [1,16,81],
                                                   [2,9,64],
                                                   [3,36,225],
                                                   [2,49,256],
                                                   [4,25,196],
                                                   [5,36,289]
])

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

# let's see model's performance with no input & target transformations:
print("Pipeline 1")
pipe1 = Pipeline(steps=[
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(X_train, y_train)
print("predict via pipeline 1")
preds1 = pipe1.predict(X_test)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds1))}\n")

Pipeline 1
fit pipeline 1
predict via pipeline 1

[200.34790002 279.04738423]
RMSE: 7.679804528409069



In [16]:
# with input transformation but no target transformation

print("Pipeline 3")
pipe3 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer_2('X2')),   
    ('linear_model', LinearRegression())
])
print("fit pipeline 3")
pipe3.fit(X_train, y_train)
print("predict via pipeline 3")
preds3 = pipe3.predict(X_test)
print(f"\n{preds3}")  
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds3))}\n")

Pipeline 3

>>>>>>>init() called.

fit pipeline 3

>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 3

>>>>>>>transform() called.


[207.42690058 280.94152047]
RMSE: 9.887192456534308



We can use scikit-learn's `TransformedTargetRegressor` to instruct our pipeline to perform some calculation and inverse-calculation on the target variable. Let's first write the two functions we need:

In [17]:
def target_transform(target):
    print('\n*****************target_transform() called.\n')
    target_ = target.copy() 
    target_ = np.sqrt(target_)
    return target_

def inverse_target_transform(target):
    print('\n*****************inverse_target_transform() called.\n')
    target_ = target.copy() 
    target_ = target_ ** 2
    return target_

In [18]:
# with input transformation & target transformation
from sklearn.compose import TransformedTargetRegressor

print("Pipeline 4")
# no change in input pipeline
pipe4 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer_2('X2')),   
    ('linear_model', LinearRegression())
])
# let's create target transformer
model = TransformedTargetRegressor(regressor=pipe4, 
                                   func=target_transform, 
                                   inverse_func=inverse_target_transform)
print("fit pipeline 4 [fit model]")
# we fit the model instead of pipe4
model.fit(X_train, y_train)
print("predict via pipeline 4")
preds4 = model.predict(X_test) # using 'model' to predict
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds4))}\n")

Pipeline 4

>>>>>>>init() called.

fit pipeline 4 [fit model]

*****************target_transform() called.


*****************inverse_target_transform() called.


*****************target_transform() called.


*****************inverse_target_transform() called.


*****************target_transform() called.


>>>>>>>init() called.


>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 4

>>>>>>>transform() called.


*****************inverse_target_transform() called.


[196. 289.]
RMSE: 1.657256204579568e-13



- *Perfect predictions!*
- The `TransformedTargetRegressor` class takes `regressor, func` and `invers_func` arguments which connects our pipeline to these new functions
- Notice that our two functions were called multiple times when fit() was called! That can be a problem in big projects and complex pipelines. The change needed to handle this is simply to set `check_inverse` param of TransformedTargetRegressor to `False`. 
- We can even use in-built Transformers instead of user-defined functions. Example
    - `model = TransformedTargetRegressor(regressor=pipe3, transformer=PowerTransformer())` or
    - `model = TransformedTargetRegressor(regressor=pipe3, transformer=StandardScaler())`
- Using a built-in transformer does not require us to specify the inverse_transformer() as that is taken care of internally.
- In case we want to have a custom transformer inside TransformedTargetRegressor, we can do that too. The only additional function we'll have to implement would be inverse_transform(). 
- Here's an example:

In [19]:
class CustomTargetTransformer(BaseEstimator, TransformerMixin):
  # no need to implement __init__ in this particular case
  
    def fit(self, target):
        return self

    def transform(self, target):
        print('\n%%%%%%%%%%%%%%%custom_target_transform() called.\n')
        target_ = target.copy() 
        target_ = np.sqrt(target_)
        return target_

    # need to implement this too
    def inverse_transform(self, target):
        print('\n%%%%%%%%%%%%%%%custom_inverse_target_transform() called.\n')
        target_ = target.copy() 
        target_ = target_ ** 2
        return target_

In [20]:
# with input transformation & target transformation
print("Pipeline 4.1")
pipe4_1 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer_2('X2')),   
    ('linear_model', LinearRegression())
])
# let's create target transformer
model = TransformedTargetRegressor(regressor=pipe4_1, 
                                   transformer=CustomTargetTransformer(),
                                   check_inverse=False) # avoid repeated calls
print("fit pipeline 4.1 [fit model]")
model.fit(X_train, y_train)
print("predict via pipeline 4.1 [model]")
preds4_1 = model.predict(X_test) 
print(f"\n{preds4_1}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds4_1))}\n")

Pipeline 4.1

>>>>>>>init() called.

fit pipeline 4.1 [fit model]

%%%%%%%%%%%%%%%custom_target_transform() called.


>>>>>>>init() called.


>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 4.1 [model]

>>>>>>>transform() called.


%%%%%%%%%%%%%%%custom_inverse_target_transform() called.


[196. 289.]
RMSE: 1.657256204579568e-13



We now got output with not repeated calls!

### 9. Getting and Setting Parameters of a Model

One last thing to do here is to make use of the cashing to preserve computations and also see how to get or set parameters of our pipeline from outside. This would be needed later if we want to apply GridSearch on top of this.
- Caching the transformer helps us avoid repeated computations and make it more efficient.

In [21]:
# get all the params of our model
model.get_params()

{'check_inverse': False,
 'func': None,
 'inverse_func': None,
 'regressor__memory': None,
 'regressor__steps': [('experimental_trans',
   ExperimentalTransformer_2(feature_name='X2')),
  ('linear_model', LinearRegression())],
 'regressor__verbose': False,
 'regressor__experimental_trans': ExperimentalTransformer_2(feature_name='X2'),
 'regressor__linear_model': LinearRegression(),
 'regressor__experimental_trans__additional_param': 'Himanshu',
 'regressor__experimental_trans__feature_name': 'X2',
 'regressor__linear_model__copy_X': True,
 'regressor__linear_model__fit_intercept': True,
 'regressor__linear_model__n_jobs': None,
 'regressor__linear_model__normalize': False,
 'regressor': Pipeline(steps=[('experimental_trans',
                  ExperimentalTransformer_2(feature_name='X2')),
                 ('linear_model', LinearRegression())]),
 'transformer': CustomTargetTransformer()}

In [22]:
from tempfile import mkdtemp
from shutil import rmtree

In [23]:
cachedir = mkdtemp()
print("Pipeline 5")
pipe5 = Pipeline(steps=[
    # incorrect column name passed
    ('experimental_trans', ExperimentalTransformer_2('X1')),   
    ('linear_model', LinearRegression())
], memory=cachedir)
# let's create target transformer
model = TransformedTargetRegressor(regressor=pipe5, 
                                   func=target_transform, 
                                   inverse_func=inverse_target_transform,
                                   check_inverse=False)
# correcting the column name using set_params()
model.set_params(regressor__experimental_trans__feature_name = 'X2') 

print("fit pipeline 5 [fit model]")
# we fit the model instead of pipe5
model.fit(X_train, y_train)
print("predict via pipeline 5")
preds5 = model.predict(X_test)
print(f"\n{preds5}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds5))}\n")

# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

Pipeline 5

>>>>>>>init() called.

fit pipeline 5 [fit model]

*****************target_transform() called.


>>>>>>>init() called.


>>>>>>>init() called.


>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 5

>>>>>>>transform() called.


*****************inverse_target_transform() called.


[196. 289.]
RMSE: 1.657256204579568e-13



### 10. References
- https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156
- https://github.com/HCGrit/MachineLearning-iamJustAStudent/blob/master/PipelineFoundation/Pipeline_Experiment.ipynb

- https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65
- https://machinelearningmastery.com/how-to-transform-target-variables-for-regression-with-scikit-learn/
- http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
- https://stackoverflow.com/questions/43308042/transformer-initialize-twice-in-pipeline
- read about caching and side effect at: https://scikit-learn.org/stable/modules/compose.html?highlight=transformedtargetregressor#pipeline-chaining-estimators

#### NEXT STEPS:

1. FeatureUnion and ColumnTransformer
    Some great examples:
-  https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces  
-  https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer


2. Using GridSearch with Pipelines
    Example: https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html?highlight=pipeline
    