## Composite Estimators using Pipeline & FeatureUnions

<hr>

### Agenda
1. Introduction to Composite Estimators
2. Pipelines
3. TransformedTargetRegressor
4. FeatureUnions
5. ColumnTransformer
6. GridSearch on pipeline

PS: scikit version 0.20

<hr>

### 1. Introduction to Composite Estimators
* One or more transformers are connected to estimators resulting into composite estimator.
* Composite transformer is implemented using Pipeline
* FeatureUnion is used to concatenate output of transformers to create derived feature
* Pipeline make machine learning code reuseable & modular

### 2. Pipeline
* Before data is fed to learning algorithm, it needs to be handled for missing values.
* Different pre-processing needs to be done.
* The output of preprocessor is to be subjected to next preprocessor & finally the estimator
* This whole process can be automated using Pipeline

<img src="https://github.com/awantik/machine-learning-slides/blob/master/pipeline-ml2.png?raw=true">

* Intermediate steps .i.e transformers must implement fit & transform
* The same trained pipeline can used for prediction

#### Predicting horror author from text 

In [5]:
import pandas as pd

In [6]:
horror_train_data = pd.read_csv('Data/HorrorAuthor/train.csv')

In [7]:
horror_train_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [8]:
horror_test_data= pd.read_csv('Data/HorrorAuthor/test.csv')

In [9]:
horror_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
id        19579 non-null object
text      19579 non-null object
author    19579 non-null object
dtypes: object(3)
memory usage: 459.0+ KB


In [10]:
horror_train_data = horror_train_data[['text','author']]

In [11]:
from sklearn.pipeline import make_pipeline

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [14]:
pipelines = []
for model in [LogisticRegression(), MultinomialNB(), LinearSVC()]:
    pipeline = make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              model)
    pipelines.append(pipeline)

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
trainX,testX,trainY,testY = train_test_split(horror_train_data.text, horror_train_data.author)

In [17]:
for pipeline in pipelines:
    pipeline.fit(trainX, trainY)



In [18]:
for pipeline in pipelines:
    print (pipeline.score(testX, testY))

0.797752808988764
0.8126659856996936
0.8032686414708886


In [19]:
horror_test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8392 entries, 0 to 8391
Data columns (total 2 columns):
id      8392 non-null object
text    8392 non-null object
dtypes: object(2)
memory usage: 131.2+ KB


In [20]:
results = []
for pipeline in pipelines:
    result = pipeline.predict(horror_test_data.text)
    results.append(result)

In [21]:
results

[array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'EAP'], dtype=object),
 array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'HPL'], dtype='<U3'),
 array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'HPL'], dtype=object)]

In [22]:
pipelines[0].steps[0][1].transform(horror_test_data.text)

<8392x22075 sparse matrix of type '<class 'numpy.int64'>'
	with 88741 stored elements in Compressed Sparse Row format>

#### Caching transformers within a Pipeline
* Storing state of transformers is also possible to prevent recomputation of transformers
* When pipeline is subjected to GridSearch situations like this happens

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
svc_pipe =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              LinearSVC())

In [25]:
svc_pipe

Pipeline(memory=None,
     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [26]:
svc_pipe.steps

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 1), preprocessor=None, stop_words='english',
          strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('tfidftransformer',
  TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
 ('linearsvc',
  LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
       intercept_scaling=1, loss='squared_hinge', max_iter=1000,
       multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
       verbose=0))]

In [27]:
import numpy as np
params = {
    'linearsvc__C': list(np.logspace(1,20,20))
}

In [28]:
gs = GridSearchCV(svc_pipe,cv=2,param_grid=params)

%timeit gs.fit(trainX,trainY)

In [None]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.utils import Memory

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=0)
svc_pipe_cached =  make_pipeline(
              CountVectorizer(stop_words='english'),
              TfidfTransformer(),
              LinearSVC(), memory = memory)

In [None]:
gs_cached = GridSearchCV(svc_pipe_cached,cv=2,param_grid=params, verbose=0)

In [None]:
%timeit gs_cached.fit(trainX,trainY)

### 3. Transforming target in regression
* Dependent variables & independent variables should be linearly related
* In case, dependent variable is not normally distribted. We can make it happen for better error.
* The prediction also needs to be remapped
* This entire process can be automated using TransformedTargetRegressor

In [None]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
boston = load_boston()

In [None]:
X = boston.data

In [None]:
y = boston.target

In [None]:
regressor = LinearRegression()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
regressor.fit(X_train, y_train)

In [None]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

In [None]:
pred = regressor.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

In [None]:
mean_absolute_error(y_pred=pred, y_true=y_test)

In [4]:
from sklearn.preprocessing import PowerTransformer,QuantileTransformer

In [117]:
pt = PowerTransformer()

In [126]:
qt = QuantileTransformer(output_distribution='normal')

In [142]:
#X_tf = pt.fit_transform(X)
#OR
X_tf = qt.fit_transform(X)

In [128]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, random_state=0)

In [129]:
regressor = LinearRegression()

In [130]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [131]:
print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))

R2 score: 0.66


In [132]:
pred = regressor.predict(X_test)

In [133]:
mean_absolute_error(y_pred=pred, y_true=y_test)

3.6331157010096167

In [106]:
from sklearn.compose import TransformedTargetRegressor

In [134]:
regr = TransformedTargetRegressor(regressor=regressor,transformer=qt)

In [135]:
regr.fit(X_train, y_train)

TransformedTargetRegressor(check_inverse=True, func=None, inverse_func=None,
              regressor=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False),
              transformer=QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=1000,
          output_distribution='normal', random_state=None,
          subsample=100000))

In [136]:
pred = regr.predict(X_test)

In [137]:
mean_absolute_error(y_pred=pred, y_true=y_test)

3.350934118150315

In [141]:
r2_score(y_pred=pred, y_true=y_test)

0.7104191534298219

#### Hyper-parameters of TransformedTargetRegressor
* regressor - initialized model
* transformer - which supports transform & inverse_transform functions
* function - to convert target 
* inverse_function - to convert back predicted target in original data scale

### 4. FeatureUnion
* It combines several transformer objects into one transformer
* Transformers are executed in parallel
* During fitting, each of these are fit parallelly
* During transform, output is concatenated parallely

#### Predicting employee exit - The Pipeline & FeatureUnion Way

In [143]:
emp_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt')

In [145]:
emp_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [146]:
emp_data.rename(columns={'sales':'dept'}, inplace=True)

In [151]:
num_cols = ['number_project','average_montly_hours','time_spend_company']

In [163]:
bin_cols = ['Work_accident','promotion_last_5years']

In [225]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder,LabelEncoder, LabelBinarizer, MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest

In [182]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [154]:
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,key):
        self.key = key
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.key]

In [155]:
class MyLabelBinarizer(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)

In [156]:
pipeline_dept = Pipeline([
    ('selector', ItemSelector('dept')),
    ('lb', MyLabelBinarizer()),
])

In [157]:
pipeline_dept.fit_transform(emp_data)

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0]])

In [158]:
class MultiItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self,keys):
        self.keys = keys
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.keys]

In [159]:
class SalaryMapper(BaseEstimator, TransformerMixin):
    
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        db = {'low':1,'medium':2,'high':3}
        print (type(X))
        r = X.str.strip().replace(db)
        return r.values.reshape(-1,1)

In [160]:
pipeline_salary = Pipeline([
    ('selector',ItemSelector('salary')),
    ('sm',SalaryMapper())
])

In [162]:
pipeline_numbers = Pipeline([
    ('selector',MultiItemSelector(num_cols)),
    ('scaling', MinMaxScaler())
])

In [165]:
pipeline_bin = Pipeline([
    ('selector',MultiItemSelector(bin_cols))
])

In [166]:
fu = FeatureUnion([
    ('dept_pipe',pipeline_dept),
    ('salary_pipe',pipeline_salary),
    ('numbers_pipe',pipeline_numbers),
    ('bin_pipe',pipeline_bin)
])

In [200]:
pipeline = Pipeline([
    ('union',fu),
    #('feature_selector',SelectKBest(k=15)),
    ('classifier',RandomForestClassifier(n_estimators=10))
])

In [201]:
from sklearn.model_selection import train_test_split

In [202]:
trainX,testX, trainY,testY = train_test_split(emp_data.drop('left',axis=1), emp_data.left)

In [203]:
pipeline.fit(trainX,trainY)

<class 'pandas.core.series.Series'>


  return self.partial_fit(X, y)


Pipeline(memory=None,
     steps=[('union', FeatureUnion(n_jobs=None,
       transformer_list=[('dept_pipe', Pipeline(memory=None,
     steps=[('selector', ItemSelector(key='dept')), ('lb', <__main__.MyLabelBinarizer object at 0x00000200323DC748>)])), ('salary_pipe', Pipeline(memory=None,
     steps=[('selector', ItemSelector...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [204]:
pipeline.predict(testX)

<class 'pandas.core.series.Series'>


array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

In [205]:
pipeline.score(testX,testY)

<class 'pandas.core.series.Series'>


0.9664

### 5. ColumnTransformer ( Beta stage )
* Datasets consist of hetrogenous types of columns
* An easy technique to map column to pipeline

In [254]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [251]:
titanic_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/titanic-train.csv.txt', index_col='PassengerId')

In [252]:
titanic_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [256]:
num_cols = ['Age','Fare']
cat_cols = ['Embarked','Sex','Pclass']

In [318]:
pipeline_num = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaling',StandardScaler())
])

In [319]:
pipeline_cat = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoding', OneHotEncoder(handle_unknown='ignore'))
])

In [320]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', pipeline_num, num_cols),
        ('cat', pipeline_cat, cat_cols)])

In [321]:
pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                ('classifier',RandomForestClassifier(n_estimators=10))])

In [322]:
X = titanic_data.drop('Survived',axis=1)

In [323]:
Y = titanic_data.Survived

In [324]:
trainX,testX,trainY,testY = train_test_split(X,Y)

In [325]:
pipeline.fit(trainX,trainY)

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbo...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [326]:
pipeline.score(testX,testY)

0.852017937219731

### 6. GridSearch for pipelines
* Pipelines consist of combination of transformers & estimators
* Both transformers & estimators are configured hyper-parameters as a fine tuning process

In [311]:
pipeline.steps

[('preprocessor',
  ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
           transformer_weights=None,
           transformers=[('num', Pipeline(memory=None,
       steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
         verbose=0)), ('scaling', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['Age', 'Fare']), ('cat', Pipeline(memory=None,
       steps=[...4'>, handle_unknown='ignore',
         n_values=None, sparse=True))]), ['Embarked', 'Sex', 'Pclass'])])),
 ('classifier',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
              oob_score=False, random_state=None, verbose=0,
              warm_start

In [327]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__n_estimators': [10,15,20],
}

In [328]:
from sklearn.model_selection import GridSearchCV

In [329]:
grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid=False)
grid_search.fit(trainX,trainY)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbo...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=False, n_jobs=None,
       param_grid={'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__n_estimators': [10, 15, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [330]:
grid_search.score(testX,testY)

0.8295964125560538

In [331]:
grid_search.best_params_

{'classifier__n_estimators': 20,
 'preprocessor__num__imputer__strategy': 'mean'}