# pipelines
This is very powerful class has made my entire ML workflow -- from preprocessing to evaluation -- so much more tractable, more robust, and less susceptible to guesswork, especially in the huperparameter tuning stage. As a colleague of mine said, it really ought to be part of every skelarn-based ML project! Here's a description of waht it does:

# Basics 

Transformer in scikit-learn -- some class that have fit and transform method, or fit_transform method.

predictor -- some class that has fit and predict methods, or fit_predict method.

pipeline -- is just an abstract notion, its not some existing ML algorithm. often in ML tasks you need to perform sequence of different transformations(find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.

sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be 'transforms', that is, they must implement fit and transform methods. The final estimator only needs to implement fit.


### Transformer 
for data preparation
fit -- find parameters from training data(if needed)
transform -- apply to training or test data.

### Estimator
for modeling
fit -- find parameters from training data
predict -- apply to training or test data.

In [1]:
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
class SpacyVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 300
    
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        return [self.nlp(text).vector for text in x]

In [5]:
import pandas as pd

In [6]:
df = pd.DataFrame(columns=['X1', 'X2', 'y'], data=[
    [1,16,9],
    [4,36,16],
    [1,16,9],
    [2,9,8],
    [3,36,15],
    [2,49,16],
    [4,25,14],
    [5,36,17]
])

In [25]:
train = df.iloc[:6]
test = df.iloc[6:]

train_X = train.drop('y', axis=1)
train_y = train.y

test_X = test.drop('y', axis=1)
test_y = test.y

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [15]:
m1 = LinearRegression()
fit1 = m1.fit(train_X, train_y)
preds = fit1.predict(test_X)
print(preds)
print(np.sqrt(mean_squared_error(test_y, preds)))

[13.72113586 16.93334467]
0.20274138822160603


In [16]:
train_X.X2 = 2 * np.sqrt(train_X.X2)
test_X.X2 = 2 * np.sqrt(test_X.X2)

print(test_X)
m2 = LinearRegression()
fit2 = m2.fit(train_X, train_y)
preds = fit2.predict(test_X)
print(preds)
print(np.sqrt(mean_squared_error(test_y, preds)))

   X1    X2
6   4  10.0
7   5  12.0
[14. 17.]
1.2560739669470201e-15


In [17]:
from sklearn.pipeline import Pipeline

In [26]:
print("create pipeline 1")
pipe1 = Pipeline(steps = [
    ('linear_model', LinearRegression())
])
print("fit pipeline 1")
pipe1.fit(train_X, train_y)
print("predict via pipeline 1")
preds1 = pipe1.predict(test_X)
print(preds1)
print(np.sqrt(mean_squared_error(test_y, preds1)))

create pipeline 1
fit pipeline 1
predict via pipeline 1
[13.72113586 16.93334467]
0.20274138822160603


In [27]:
class ExperimentalTransformer(BaseEstimator, TransformerMixin):
    def __inti__(self):
        print(' >>>>>>>init() called.\n')
        
    def fit(self, X, y = None):
        print('>>>>>fit() called .\n')
        return self
    
    def transform(self, X, y = None):
        print('>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_.X2 = 2 * np.sqrt(X_.X2)
        return X_

In [28]:
print("Created pipeline 2")
pip2 = Pipeline(steps = [
    ("experimental_trans", ExperimentalTransformer()),
    ('linear_model', LinearRegression())
])

print("Fit pipeline 2")
pip2.fit(train_X, train_y)
print("predict via pieline 2")
preds2 = pip2.predict(test_X)
print(preds2)
print(np.sqrt(mean_squared_error(test_y, preds2)))

Created pipeline 2
Fit pipeline 2
>>>>>fit() called .

>>>>>transform() called.

predict via pieline 2
>>>>>transform() called.

[14. 17.]
1.2560739669470201e-15


In [37]:
class ExperimentalTransformer_2(BaseEstimator, TransformerMixin):
    def __init__(self, feature_name, additional_param = "Himanshu"):
        print(' >>>>>>>init() called.\n')
        self.feature_name = feature_name
        self.additional_param = additional_param
        
    def fit(self, X, y = None):
        print('>>>>>fit() called .\n')
        print('additional param ~~~~~')
        return self
    
    def transform(self, X, y = None):
        print('>>>>>transform() called.\n')
        X_ = X.copy() # creating a copy to avoid changes to original dataset
        X_[self.feature_name] = 2 * np.sqrt(X_[self.feature_name])
        return X_

In [38]:
ExperimentalTransformer_2('X2')

 >>>>>>>init() called.



ExperimentalTransformer_2(additional_param='Himanshu', feature_name='X2')

In [39]:
print("create pipeline 2")
pipe2 = Pipeline(steps = [
    ('experimental_trans', ExperimentalTransformer_2('X2')),
    ('linear_model', LinearRegression())
])
print("Fit pipeline 2")
pipe2.fit(train_X, train_y)
print("predict via pipeline 2")
preds2 = pipe2.predict(test_X)
print(preds2)
print(np.sqrt(mean_squared_error(test_y, preds2)))

create pipeline 2
 >>>>>>>init() called.

Fit pipeline 2
>>>>>fit() called .

additional param ~~~~~
>>>>>transform() called.

predict via pipeline 2
>>>>>transform() called.

[14. 17.]
1.2560739669470201e-15
