# CUSTOM TRANSFORMERS AND PIPERLINES
---

### 1. Creating Data
Let's create a dataframe based on the equation:
$$
y = {x_1} + 2\sqrt{x_2}
$$
This makes sure a simple Linear Regression model won't be able to fit it

In [1]:
import pandas as pd

df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,9],
                                                   [4,36,16],
                                                   [1,16,9],
                                                   [2,9,8],
                                                   [3,36,15],
                                                   [2,49,16],
                                                   [4,25,14],
                                                   [5,36,17]
])

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

In [2]:
df

Unnamed: 0,X1,X2,y
0,1,16,9
1,4,36,16
2,1,16,9
3,2,9,8
4,3,36,15
5,2,49,16
6,4,25,14
7,5,36,17


In [3]:
X_train

Unnamed: 0,X1,X2
0,1,16
1,4,36
2,1,16
3,2,9
4,3,36
5,2,49


In [4]:
X_test

Unnamed: 0,X1,X2
6,4,25
7,5,36


In [5]:
y_train

0     9
1    16
2     9
3     8
4    15
5    16
Name: y, dtype: int64

In [6]:
y_test

6    14
7    17
Name: y, dtype: int64

### 2. Linear Regression Model
Let's predict with `Linear Regression` to see what happens:

In [7]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr_fit = lr.fit(X_train, y_train)
preds = lr_fit.predict(X_test)
lr_mse = mean_squared_error(y_test, preds)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(lr_mse)}\n")


[13.72113586 16.93334467]
RMSE: 0.20274138822160784



A perfect predictions would be 14 and 17 with zero error. The predictions are not bad though but let's do some calculations on the input features to make this better!

- What if we square_root $x_2$ and multiply by 2?

In [8]:
X_train.X2 = 2 * np.sqrt(X_train.X2)
X_test.X2 = 2 * np.sqrt(X_test.X2)
print(X_test)
m2 = LinearRegression()
fit2 = m2.fit(X_train, y_train)
preds = fit2.predict(X_test)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds))}\n")

   X1    X2
6   4  10.0
7   5  12.0

[14. 17.]
RMSE: 5.17892563931115e-15



The input manipulation caused a perfect linear fit. 

Le't see how to do it with a `Pipeline`

### 3. Linear Regression with a Pipeline

- First **`without input transformation`** to validate that we can get the same results as before:

In [9]:
# let's go back to original data

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

# let's create a pipeline
from sklearn.pipeline import Pipeline

print("Pipeline 1")
pipe1 = Pipeline(steps=[
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(X_train, y_train)
print("predict via pipeline 1")
preds1 = pipe1.predict(X_test)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds1))}\n")

Pipeline 1
fit pipeline 1
predict via pipeline 1

[13.72113586 16.93334467]
RMSE: 0.20274138822160784



How we did it:
- We declared a `pipe1` variable using *Pipeline* class with array of *steps* inside it
    - The name of the step (in this case `linear_model`) could be anything unique of our choice
    - It is followed by an actial `Transformer` or `Estimator` (in this case, our `LinearRegression()` model)
- Like any model, `LinearRegression()` is fitted on the training data, but using the `pipe1` variable
- `pipe1` is also used to predict on test set as one would do in any other model
- We got the results that we expected. Awesome!!!

### 4. Creating a Custom Input Transformer
To perform the input calculations or transformations, we will design a custom transformer:

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class ExperimentalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print('\n>>>>>>>init() called.\n')

    def fit(self, X, y = None):
        print('\n>>>>>>>fit() called.\n')
        return self

    def transform(self, X, y = None):
        print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_.X2 = 2 * np.sqrt(X_.X2)
        return X_

We created a class and named it `ExperimentalTransformer`. All transformers we design will inherit from `BaseEstimator` and `TransformerMixin` classes as they give us pre-existing methods for free. The 3 methods to take care of:
- `__init__`: this is the constructor. It is called when the pipeline is initiated
- `fit()`: called when we fit the pipeline
- `transformer()`: called when we use fit or transform on the pipeline

For the moment, we just put print() messages in __ init__ & fit(), and wrote our calculations in transform(). As we see above, we returned x when fit() or transform() is called

Let's put this into a pipeline to see the order in which these functions are called!

### 5. Custom Transformer in a Pipeline

In [11]:
print("Pipeline 2")
pipe2 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer()),    # this will trigger a call to __init__
    ('linear_model', LinearRegression())
])
print("fit pipeline 2")
pipe2.fit(X_train, y_train)
print("predict via pipeline 2")
preds2 = pipe2.predict(X_test)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds2))}\n")

Pipeline 2

>>>>>>>init() called.

fit pipeline 2

>>>>>>>fit() called.


>>>>>>>transform() called.

predict via pipeline 2

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



In [12]:
# an alternate, shorter syntax to do the above, without naming each step, is:
from sklearn.pipeline import make_pipeline

print("Pipeline 2")
pipe2 = make_pipeline(ExperimentalTransformer(), # this will trigger a call to __init__
                      LinearRegression())
print("fit pipeline 2")
pipe2.fit(X_train, y_train)
print("predict via pipeline 2")
preds2 = pipe2.predict(X_test)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds2))}\n")

Pipeline 2

>>>>>>>init() called.

fit pipeline 2

>>>>>>>fit() called.


>>>>>>>transform() called.

predict via pipeline 2

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



Three important things to not:
- __ init__ was called the moment we initialized the pipe2 variable
- Both fit() and transform() of our ExperimentTransformer were called when we fitted the pipeline on training data. This makes sense as that is how model fitting works: you would need to transform in put features while trying to predict y_train
- transform() is called, as expected, when we call predict(X_test) - the input test features need to be square-rooted and doubled too before making predictions. 

The result - perfect predictions!

### 6. Modifying and Parametizing Transformers

But we've assumed in the transform() function of our ExperimentalTransformer that the column name is X2. Let's not do so and pass the column name via the constructor __ init__()

Here is our `ExperimentalTransformer_2`:

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

class ExperimentalTransformer_2(BaseEstimator, TransformerMixin):
    # let's add two parameters for fun
    def __init__(self, feature_name, additional_param = 'Himanshu'):
        print('\n>>>>>>>init() called.\n')
        self.feature_name = feature_name
        self.additional_param = additional_param

    def fit(self, X, y = None):
        print('\n>>>>>>>fit() called.\n')
        print(f'\nadditional param ~~~~ {self.additional_param}\n')
        return self

    def transform(self, X, y = None):
        print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_[self.feature_name] = 2 * np.sqrt(X_[self.feature_name])
        return X_

- Take care to keep the parameter name exactly the same in the function argument as well as the class' variable (feature_name). 
- Changing that will cause problems later when we also try to transform the target feature (y). 
- It causes a double-call to __ init__ for some reason.
- The `additional_param` with a default value was added just to mix things up. It's not really needed for anything in our case and acts as an optional argument

### 7. Parametized Transformer in a Pipeline

Let's create the new pipeline now:

In [14]:
print("Pipeline 3")
pipe3 = Pipeline(steps=[
    ('experimental_trans', ExperimentalTransformer_2('X2')),   
    ('linear_model', LinearRegression())
])
print("fit pipeline 3")
pipe3.fit(X_train, y_train)
print("predict via pipeline 3")
preds3 = pipe2.predict(X_test)
print(f"\n{preds2}")  # should be [14. 17.]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds3))}\n")

Pipeline 3

>>>>>>>init() called.

fit pipeline 3

>>>>>>>fit() called.


additional param ~~~~ Himanshu


>>>>>>>transform() called.

predict via pipeline 3

>>>>>>>transform() called.


[14. 17.]
RMSE: 5.17892563931115e-15



### 8. Custom Target Transformation

via TransformedTargetRegressor

- What about a situation when some pre- and post-processing needs to be done?
- Let's go further and modify the dataframe to have target as squares of current values. The new function is:
$$
\sqrt{y} = {x_1} + 2*\sqrt{x_2}
$$

In [16]:
df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,81],
                                                   [4,36,256],
                                                   [1,16,81],
                                                   [2,9,64],
                                                   [3,36,225],
                                                   [2,49,256],
                                                   [4,25,196],
                                                   [5,36,289]
])

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

# let's see model's performance with no input & target transformations:
print("Pipeline 1")
pipe1 = Pipeline(steps=[
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(X_train, y_train)
print("predict via pipeline 1")
preds1 = pipe1.predict(X_test)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds1))}\n")

Pipeline 1
fit pipeline 1
predict via pipeline 1

[200.34790002 279.04738423]
RMSE: 7.679804528409069

