# 21. PIPELINES AND COMPOSITE ESTIMATORS
---

## 1. Scikit-Learn Pipelines with Custom Transformers
- [Step by Step Guide](https://medium.com/analytics-vidhya/scikit-learn-pipelines-with-custom-transformer-a-step-by-step-guide-9b9b886fd2cc)
- [Jupyter Notebook](https://github.com/abhi-rawat1/machine_learning_projects/blob/master/Sklearn_Pipeline_Custom_transformer/Titanic_Model_With_Pipeline_CustomTransformer.ipynb)

### 1.1 Dataset Exploration

In [4]:
import pandas as pd

titanic_df = pd.read_csv('data/titanic_train.csv')
titanic_df = titanic_df.drop('PassengerId', axis=1)
titanic_df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

There are 11 columns in this dataset. We can place them into 3 categories:
- Target feature: `Survived`, numerical data type
- Numerical features: `Age` and `Fare` are continuous features
- Categorical features:
    - `Pclass, Sex, SibSp, Parch` and `Embarked` won't need custom transformation
    - `Name` and `Cabin` are free-text features that need custom transformation
    
##### Dataset Split

In [7]:
from sklearn.model_selection import train_test_split

# Split the data into train and test.
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [12]:
num_feat = ['Age', 'Fare']
cat_feat = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
name_feat = ['Name']
cabin_feat = ['Cabin']

### 1.2 Standard Transformation

Here we create pipelines for standard transformation (built-in transformers are used) of numeric and categorical features. 
- Numeric features: 
    - SimpleImputer('median'): this transformer will replace empty (NaN) values with the meadian of corresponding column
    - StandardScaler: this will standardize numerical columns so that the mean is zero and the standard deviation is 1
- Categorical features:
    - SimpleImputer('constant', 'missing'): this will replace missing values with a constant called `missing` (fill_value)
        - *Notes*: The strategy for SimpleImputer can be mean, median, most_frequent, or constant. When it is constant, fill_value is used to replace all occurrences of missing values. Fill_value can be a string or a numerical value. When 'constant' is left to default, fill_value will be 0 for numerical data and 'missing_value' for strings or object data types.
    - OneHotEncoder: will encode categorical features as a one-hot numeric array
    
Here are out standard transformers:

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', 
                              fill_value='missing')),
    ('onehot', OneHotEncoder())
])

### 1.3 Custom Transformation

To process `Cabin` and `Name` features, we will create custom transformers because these two features can't be directly transformed by our standard transformers.