# 21. PIPELINES AND COMPOSITE ESTIMATORS
---

- [Step by Step Guide](https://medium.com/analytics-vidhya/scikit-learn-pipelines-with-custom-transformer-a-step-by-step-guide-9b9b886fd2cc)
- [Jupyter Notebook](https://github.com/abhi-rawat1/machine_learning_projects/blob/master/Sklearn_Pipeline_Custom_transformer/Titanic_Model_With_Pipeline_CustomTransformer.ipynb)

### 1.1 Dataset Exploration

In [1]:
import numpy as np
import pandas as pd

titanic_df = pd.read_csv('data/titanic_train.csv')
titanic_df = titanic_df.drop('PassengerId', axis=1)
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
titanic_df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

There are 11 columns in this dataset. We can place them into 3 categories:
- Target feature: `Survived`, numerical data type
- Numerical features: `Age` and `Fare` are continuous features
- Categorical features:
    - `Pclass, Sex, SibSp, Parch` and `Embarked` won't need custom transformation
    - `Name` and `Cabin` are free-text features that need custom transformation
    
##### Dataset Split

In [3]:
from sklearn.model_selection import train_test_split

# Split the data into train and test.
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [4]:
num_feat = ['Age', 'Fare']
cat_feat = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
name_feat = ['Name']
cabin_feat = ['Cabin']

### 1.2 Standard Transformation

Here we create pipelines for standard transformation (built-in transformers are used) of numeric and categorical features. 
- Numeric features: 
    - SimpleImputer('median'): this transformer will replace empty (NaN) values with the meadian of corresponding column
    - StandardScaler: this will standardize numerical columns so that the mean is zero and the standard deviation is 1
- Categorical features:
    - SimpleImputer('constant', 'missing'): this will replace missing values with a constant called `missing` (fill_value)
        - *Notes*: The strategy for SimpleImputer can be mean, median, most_frequent, or constant. When it is constant, fill_value is used to replace all occurrences of missing values. Fill_value can be a string or a numerical value. When 'constant' is left to default, fill_value will be 0 for numerical data and 'missing_value' for strings or object data types.
    - OneHotEncoder: will encode categorical features as a one-hot numeric array
    
Here are out standard transformers:

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', 
                              fill_value='missing')),
    ('encoder', OneHotEncoder())
])

### 1.3 Custom Transformation

To process `Cabin` and `Name` features, we will create custom transformers because these two features can't be directly transformed by our standard transformers.

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

class CabinFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print('in the CabinFeatureTransformer init method: ')
        
    def fit(self, x, y=None):
        x.Cabin.fillna('U', inplace=True)
        x['Cabin'] = x['Cabin'].map(lambda c: c[0])
        
        cabin_dummies = pd.get_dummies(x['Cabin'], prefix='Cabin')    
        self.cabin_columns=  cabin_dummies.columns
        return self

    def transform(self, x):
        # replacing missing cabins with U (for Uknown)
        x.Cabin.fillna('U', inplace=True)
    
        # mapping each Cabin value with the cabin letter
        x['Cabin'] = x['Cabin'].map(lambda c: c[0])
        
        cabin_dummies = pd.get_dummies(x['Cabin'], prefix='Cabin') 
        cabin_dummies = cabin_dummies.reindex(columns = self.cabin_columns, fill_value=0)
        
        x = pd.concat([x, cabin_dummies], axis=1)

        x.drop('Cabin', axis=1, inplace=True)
    
        return x

Explaining `CabinFeatureTransformer`:
- Fit method:
    - All empty (na) values are replaced with `U`
    - Values are replaced with first character of their respective values
    - The unique values are determined via `get_dummies` method. They are saved in 'self.cabin_columns' to be used in the transform method
- Transform method:
    - Same as fit method
    - We are re-indexing these new columns based on already saved 'self.cabin_columns' values. The purpose of doing this is to avoid addition of new column based on new Cabin value in test data. Otherwise model prediction will gain on test data due to feature count mismatch. That's why we are not applying LabeEncoder or OneHotEncoder on this feature and going with CustomTransformer.

In [7]:
class NameFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print('in the NameFeatureTransformer Init method: ')
        
    def fit(self, x, y=None):
        return self

    def transform(self, x):
        Title_Dictionary = {
                "Capt": "Officer", "Col": "Officer", "Major": "Officer","Jonkheer": "Royalty",
                "Don": "Royalty","Sir" : "Royalty","Dr": "Officer","Rev": "Officer","the Countess":"Royalty",
                "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs", "Mr" : "Mr", "Mrs" : "Mrs", "Miss" : "Miss",
                "Master" : "Master", "Lady" : "Royalty"}
        
        x['Title'] = x['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
        x['Title'] = x.Title.map(Title_Dictionary)
        
        x.drop('Name', axis=1, inplace=True)
    
        titles_dummies = pd.get_dummies(x['Title'], prefix='Title')
        x = pd.concat([x, titles_dummies], axis=1)
    
        x.drop('Title', axis=1, inplace=True)
        return x.values

### 1.4 Column Transformation

In [8]:
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer(
    transformers=[
        ('num_prep', num_pipe, num_feat),
        ('categorical_dat_prep', cat_pipe, cat_feat),
        ('cabin_prep', CabinFeatureTransformer(), cabin_feat),
        ('name_prep', NameFeatureTransformer(), name_feat)
    ])

in the CabinFeatureTransformer init method: 
in the NameFeatureTransformer Init method: 


### 1.5 Combining Transformers and Estimators

In [9]:
from sklearn.ensemble import RandomForestClassifier
final_pipeline = Pipeline(steps=[
    ('transformer', transformer),
    ('rf_estimator', RandomForestClassifier())
])

In [10]:
final_pipeline.fit(X_train, y_train)

in the CabinFeatureTransformer init method: 
in the NameFeatureTransformer Init method: 


Pipeline(steps=[('transformer',
                 ColumnTransformer(transformers=[('num_prep',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Age', 'Fare']),
                                                 ('categorical_dat_prep',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                        

In [11]:
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

y_pred = final_pipeline.predict(X_test)

print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred, average='weighted'))
print("Precision Score: ", precision_score(y_test, y_pred, average='weighted'))
print("Recall Score: ", recall_score(y_test, y_pred, average='weighted'))

Accuracy Score:  0.7821229050279329
F1 Score:  0.7811181585248036
Precision Score:  0.7805690633976433
Recall Score:  0.7821229050279329


##### References:
- https://medium.com/analytics-vidhya/scikit-learn-pipelines-with-custom-transformer-a-step-by-step-guide-9b9b886fd2cc
- https://github.com/abhi-rawat1/machine_learning_projects/blob/master/Sklearn_Pipeline_Custom_transformer/Titanic_Model_With_Pipeline_CustomTransformer.ipynb
---

### 1.6 HTML Representation of Pipeline
When the Pipeline is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:

In [12]:
from sklearn import set_config
set_config(display='diagram')

final_pipeline