# Creating a Custom Sklearn Pipeline
> How to include custom data preprocessing steps in an sklearn pipeline. 

In this notebook we will import the [income classification dataset](https://www.kaggle.com/lodetomasi1995/income-classification/data), review common preprocessing steps, and then introduce how those steps can be included in an sklearn pipeline. 

![](https://media.giphy.com/media/Jwp4sxM0Rjk1W/giphy.gif)

In [1]:
# Standard Imports
import pandas as pd
import numpy as np

# Transformers
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Modeling Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
import pickle

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,53,Local-gov,283602,Masters,14,Married-civ-spouse,Prof-specialty,Wife,White,Female,15024,0,40,United-States,>50K
1,28,Self-emp-not-inc,35864,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Other,Male,0,0,70,Iran,>50K
2,29,Private,146764,Some-college,10,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,35,United-States,<=50K
3,49,Private,59380,Some-college,10,Married-spouse-absent,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K
4,51,Self-emp-inc,338836,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K


## Develop a preprocessing strategy


### Task 1

Write a function called `bin_middle_age` that can be applied to the `age` column in `X_train` and returns a 1 if the age is 45-64 and a zero for every other age. 

### Task 2

Write a function called `bin_capital` that can be applied to the `capital-gain` and `capital-loss` columns in `X_train` and returns a 1 if the input is more than zero and a 0 for anything else.

### Task 3

Please write code to fit a one hot encoder to all of the object datatypes. Transform the object columns in `X_train` and turn them into a dataframe. For this final step, I'll give you two clues: "sparse" and "dense". Only one of them will be needed.

### Task 4

Please write code to scale the `'hours-per-week'` column in `X_train'.

### Task 5
Merge the transformed features into a new dataframe called `modeling_df`.

In [3]:
X_train, X_val, y_train, y_val = train_test_split(df.drop('income', axis = 1), 
                                                    df.income,
                                                    random_state = 2020)  
X_train.reset_index(drop=True, inplace=True)

# Task 1
# ===========================================
def bin_middle_age(age):
    return int(age >= 45 and age <= 64)

X_train['age'] = X_train.age.apply(bin_middle_age)
# ===========================================

# Task 2
# ===========================================
def bin_capital(x):
    return int(x > 0)

X_train['capital-gain'] = X_train['capital-gain'].apply(bin_capital)
X_train['capital-loss'] = X_train['capital-loss'].apply(bin_capital)
# ===========================================

# Task 3
# ===========================================
hot_encoder = OneHotEncoder(sparse=False)
categoricals = hot_encoder.fit_transform(X_train.select_dtypes(object))
categoricals = pd.DataFrame(categoricals, columns = hot_encoder.get_feature_names())
# ===========================================

# Task 4
# ===========================================
hours_scaler = StandardScaler()
hours_per_week = hours_scaler.fit_transform(X_train['hours-per-week'].values.reshape(-1,1))
hours_per_week = pd.DataFrame(hours_per_week, columns = ['hours-per-week'])
# ===========================================

# Task 5
# ===========================================
modeling_df = pd.concat([X_train.age, X_train['capital-gain'], X_train['capital-loss'], 
                         hours_per_week, categoricals], axis = 1)
# ===========================================

modeling_df.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,x0_?,x0_Federal-gov,x0_Local-gov,x0_Never-worked,x0_Private,x0_Self-emp-inc,...,x7_Portugal,x7_Puerto-Rico,x7_Scotland,x7_South,x7_Taiwan,x7_Thailand,x7_Trinadad&Tobago,x7_United-States,x7_Vietnam,x7_Yugoslavia
0,0,0,1,-2.877472,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1,0,0,0.77531,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0,0,0,0.2071,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,0,0,0.77531,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0,0,0,-0.036419,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Move all of this into a Pipeline


### Writing a custom transformer

Above we used two sklearn transformers and two custom functions to format our dataframe. This means, that we will need to create two custom transformers. The sklearn transformers can be used as they are.

To do this, we will create a class called `BinAge` that inherits from the sklearn classes, `TransformerMixin` and `BaseEstimator`. This class should have the following methods:

1. `fit`
    - This method should have three arguments
        1. self
        2. `X`.
        3. `y=None`
    - This method should return `self`.
    
1. `_bin_data`
    - This method is our function for binning the age column
    
1. `transform`
    - This method should have two arguments
        1. self
        2. `X`
    - This method should apply the `_bin_data` method to `X`
    - Return the binned data

In [4]:
from numpy import vectorize

class BinAge(TransformerMixin, BaseEstimator):
    
    def fit(self, X, y=None):
        return self
    
    @vectorize
    def _bin_data(x):
        return int(x >= 45 and x <= 64)
        
    def transform(self, X):
        return self._bin_data(X)

**Now repeat the process for a `BinCapital` Transformer!**

In [5]:
class BinCapital(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    @vectorize
    def _bin_data(x):
        return int(x > 0)
        
    def transform(self, X):
        return self._bin_data(X)

## Create pipeline

To make this pipeline, we will use the following sklearn functions:

1. `make_column_transformer`
> This function receives "Tuples of the form `(transformer, [columns])` specifying the transformer objects to be applied to subsets of the data."
2. `make_column_selector`
> "Selects columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected."
3. `make_pipeline`
> Used to create a pipeline of inputer transformer and estimator objects.

In [6]:
preprocessing = make_column_transformer((BinAge(), ['age']),
                                      (BinCapital(), ['capital-gain']),
                                      (BinCapital(), ['capital-loss']),
                                      (OneHotEncoder(sparse=False, handle_unknown='ignore'),
                                       make_column_selector(dtype_include=object)),
                                      (StandardScaler(), ['hours-per-week']),
                                      remainder='drop')

Now all of our preprocessing can be done with the `fit_transform` method!

In [7]:
preprocessing.fit_transform(X_train)

array([[ 0.        ,  0.        ,  1.        , ...,  0.        ,
         0.        , -2.87747181],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.77531045],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.20709987],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        , -0.03641894],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.28827281],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.77531045]])

To finish up pipeline, we can add a machine learning model to a new pipeline!

In [8]:
dt_pipeline = make_pipeline(preprocessing, DecisionTreeClassifier())
rf_pipeline = make_pipeline(preprocessing, RandomForestClassifier(max_depth=10))

## Our pipelines are built!

Now we can run them through cross validation!

In [9]:
cross_val_score(dt_pipeline, X_train, y_train)

array([0.80873521, 0.80909918, 0.80800728, 0.81070258, 0.80451402])

In [10]:
cross_val_score(rf_pipeline, X_train, y_train)

array([0.83694268, 0.833303  , 0.83348499, 0.84255552, 0.83017838])

In [11]:
rf_pipeline.fit(X_train, y_train)
train_preds = rf_pipeline.predict(X_train)
val_preds = rf_pipeline.predict(X_val)
print(f'Training Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Validation Accuracy: {accuracy_score(y_val, val_preds)}')

Training Accuracy: 0.844101481454519
Validation Accuracy: 0.839375409478052


Finally, we can fit the final pipeline on all of the data and test it on an additional hold out set!

In [12]:
rf_pipeline.fit(df.drop('income', axis = 1), df.income)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('binage', BinAge(), ['age']),
                                                 ('bincapital-1', BinCapital(),
                                                  ['capital-gain']),
                                                 ('bincapital-2', BinCapital(),
                                                  ['capital-loss']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f81c7748b50>),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['hours-per-week'])])),
   

Load in the hold out set and make predictions!

In [13]:
# Import holdout data
test = pd.read_csv('data/test.csv')
# Seperate features from the target
X_test, y_test = test.drop(columns=['income']), test.income
# Score the model
rf_pipeline.score(X_test, y_test)

0.8420276799606912

### Save the model to disk

In [14]:
# Merge training and hold out sets
full_data = pd.concat([df, test])

# Seperate the features from the target
X, y = df.drop(columns=['income']), df.income

# Fit the model to *all* observations
rf_pipeline.fit(X, y)

# Save the fit model to disk
file = open('model_v1.pkl', 'wb')
pickle.dump(rf_pipeline, file)
file.close()

### Check the saved model works when loaded

In [15]:
# Load the model
file = open('model_v1.pkl', 'rb')
model = pickle.load(file)
file.close()

# Generate predictions
model.predict(X)

array(['>50K', '<=50K', '<=50K', ..., '<=50K', '<=50K', '<=50K'],
      dtype=object)