# CUSTOM TRANSFORMERS AND PIPERLINES
---

### 1. Creating Data
Let's create a dataframe based on the equation:
$$
y = {x_1} + 2\sqrt{x_2}
$$
This makes sure a simple Linear Regression model won't be able to fit it

In [3]:
import pandas as pd

df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
                                                   [1,16,9],
                                                   [4,36,16],
                                                   [1,16,9],
                                                   [2,9,8],
                                                   [3,36,15],
                                                   [2,49,16],
                                                   [4,25,14],
                                                   [5,36,17]
])

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

In [4]:
df

Unnamed: 0,X1,X2,y
0,1,16,9
1,4,36,16
2,1,16,9
3,2,9,8
4,3,36,15
5,2,49,16
6,4,25,14
7,5,36,17


In [5]:
X_train

Unnamed: 0,X1,X2
0,1,16
1,4,36
2,1,16
3,2,9
4,3,36
5,2,49


In [6]:
X_test

Unnamed: 0,X1,X2
6,4,25
7,5,36


In [7]:
y_train

0     9
1    16
2     9
3     8
4    15
5    16
Name: y, dtype: int64

In [8]:
y_test

6    14
7    17
Name: y, dtype: int64

### 2. Linear Regression Model
Let's predict with `Linear Regression` to see what happens:

In [14]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr_fit = lr.fit(X_train, y_train)
preds = lr_fit.predict(X_test)
lr_mse = mean_squared_error(y_test, preds)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(lr_mse)}\n")


[13.72113586 16.93334467]
RMSE: 0.20274138822160784



A perfect predictions would be 14 and 17 with zero error. The predictions are not bad though but let's do some calculations on the input features to make this better!

- What if we square_root $x_2$ and multiply by 2?

In [17]:
X_train.X2 = 2 * np.sqrt(X_train.X2)
X_test.X2 = 2 * np.sqrt(X_test.X2)
print(X_test)
m2 = LinearRegression()
fit2 = m2.fit(X_train, y_train)
preds = fit2.predict(X_test)
print(f"\n{preds}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds))}\n")

   X1    X2
6   4  10.0
7   5  12.0

[14. 17.]
RMSE: 5.17892563931115e-15



The input manipulation caused a perfect linear fit. 

Le't see how to do it with a `Pipeline`

### 3. Linear Regression with a Pipeline

- First **`without input transformation`** to validate that we can get the same results as before:

In [21]:
# let's go back to original data

train = df.iloc[:6]
test = df.iloc[6:]

X_train = train.drop('y', axis=1)
y_train = train.y

X_test = test.drop('y', axis=1)
y_test = test.y

# let's create a pipeline
from sklearn.pipeline import Pipeline

print("Pipeline 1")
pipe1 = Pipeline(steps=[
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(X_train, y_train)
print("predict via pipeline 1")
preds1 = pipe1.predict(X_test)
print(f"\n{preds1}")  # should be [13.72113586 16.93334467]
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, preds1))}\n")

Pipeline 1
fit pipeline 1
predict via pipeline 1

[13.72113586 16.93334467]
RMSE: 0.20274138822160784



How we did it:
- We declared a `pipe1` variable using *Pipeline* class with array of *steps* inside it
    - The name of the step (in this case `linear_model`) could be anything unique of our choice
    - It is followed by an actial `Transformer` or `Estimator` (in this case, our `LinearRegression()` model)
- Like any model, `LinearRegression()` is fitted on the training data, but using the `pipe1` variable
- `pipe1` is also used to predict on test set as one would do in any other model
- We got the results that we expected. Awesome!!!

### 4. Creating a Custom Input Transformer
To perform the inp