# Creating a Custom Sklearn Pipeline
> How to include custom data preprocessing steps in an sklearn pipeline. 

In this notebook we will import the [income classification dataset](https://www.kaggle.com/lodetomasi1995/income-classification/data), review common preprocessing steps, and then introduce how those steps can be included in an sklearn pipeline. 

![](https://media.giphy.com/media/Jwp4sxM0Rjk1W/giphy.gif)

In [None]:
# Standard Imports
import pandas as pd
import pickle

# Transformers
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Modeling Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [1]:
# __SOLUTION__
# Standard Imports
import pandas as pd
import pickle

# Transformers
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Modeling Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
df = pd.read_csv('data/adult.csv')
df.head()

In [2]:
#__SOLUTION__
df = pd.read_csv('data/adult.csv')
df.head()

Unnamed: 0,age,workclass,education,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,40,United-States,0
1,1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,0,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,1,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,0,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


# Breakout Rooms


### Group 1

Write a function called `bin_middle_age` that can be applied to the `age` column in `X_train` and returns a 1 if the age is 45-64 and a zero for every other age. 

### Group 2

Write a function called `bin_capital` that can be applied to the `capital_gain` and `capital_loss` columns in `X_train` and returns a 1 if the input is more than zero and a 0 for anything else.

### Group 3

Please write code to fit a one hot encoder to all of the object datatypes. Transform the object columns in `X_train` and turn them into a dataframe. For this final step, I'll give you two clues: "sparse" and "dense". Only one of them will be needed.

### Group 4

Please write code to scale the `'hours_per_week'` column in `X_train'.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('income', axis = 1), 
                                                    df.income,
                                                    random_state = 2020)
X_train.reset_index(drop=True, inplace=True)

# Group 1
def bin_middle_age(age):
    pass

X_train['age'] = X_train.age.apply(bin_middle_age)

# Group 2
def bin_capital(x):
    pass

X_train['capital_gain'] = X_train.capital_gain.apply(bin_capital)
X_train['capital_loss'] = X_train.capital_loss.apply(bin_capital)

 
X_train.reset_index(drop=True, inplace=True)

# Group 3
    pass

# Group 4
    pass


modeling_df = pd.concat([X_train.age, X_train.capital_gain, X_train.capital_loss, 
                         hours_per_week, categoricals], axis = 1)

modeling_df.head()

In [3]:
# __SOLUTION__
X_train, X_test, y_train, y_test = train_test_split(df.drop('income', axis = 1), 
                                                    df.income,
                                                    random_state = 2020)  
X_train.reset_index(drop=True, inplace=True)

# Group 1
def bin_middle_age(age):
    if age < 45:
        return 0 
    elif age > 64:
        return 0
    else: 
        return 1

X_train['age'] = X_train.age.apply(bin_middle_age)

# Group 2
def bin_capital(x):
    if x > 0:
        return 1
    else:
        return 0

X_train['capital_gain'] = X_train.capital_gain.apply(bin_capital)
X_train['capital_loss'] = X_train.capital_loss.apply(bin_capital)



# Group 3
hot_encoder = OneHotEncoder(sparse=False)
categoricals = hot_encoder.fit_transform(X_train.select_dtypes(object))
categoricals = pd.DataFrame(categoricals, columns = hot_encoder.get_feature_names())

# Group 4
hours_scaler = StandardScaler()
hours_per_week = hours_scaler.fit_transform(X_train['hours_per_week'].values.reshape(-1,1))
hours_per_week = pd.DataFrame(hours_per_week, columns = ['hours_per_week'])


modeling_df = pd.concat([X_train.age, X_train.capital_gain, X_train.capital_loss, 
                         hours_per_week, categoricals], axis = 1)

modeling_df.head()

Unnamed: 0,age,capital_gain,capital_loss,hours_per_week,x0_ ?,x0_ Federal-gov,x0_ Local-gov,x0_ Never-worked,x0_ Private,x0_ Self-emp-inc,...,x7_ Portugal,x7_ Puerto-Rico,x7_ Scotland,x7_ South,x7_ Taiwan,x7_ Thailand,x7_ Trinadad&Tobago,x7_ United-States,x7_ Vietnam,x7_ Yugoslavia
0,0,0,0,0.767358,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0,0,0,-1.656074,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0,0,0,-1.252169,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0,0,0,-0.040453,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0,0,0,-0.040453,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Move all of this into a Pipeline

Above we used two sklearn transformers and two custom functions to format our dataframe. This means, that we will need to create two custom transformers. The sklearn transformers can be used as they are.

To do this, we will create a class called `BinAge` that inherits from the sklearn classes, `TransformerMixin` and `BaseEstimator`. This class should have the following methods:
1. `__init__`
    - This method only needs to exist. No code needs to be added to the method.
2. `fit`
    - This method should have three arguments
        1. self
        2. `X`.
        3. `y=None`
    - This method should return `self`.
3. `_bin_data`
    - This method is our function for binning the age column
4. `_to_df`
    - This is a helper function to transform the data to a dataframe.
    - This method should check if the input is a dataframe and return a dataframe
5. `transform`
    - This method should have two arguments
        1. self
        2. `X`
    - This method should turn X to a dataframe. 
    - This method should apply the `_bin_data` method
    - Return the binned data

In [None]:
class BinAge():
    pass

In [4]:
# __SOLUTION__
class BinAge(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
   
    def _to_df(self, X):
        if type(X) != pd.DataFrame:
            if type(X) != list:
                if type(X) == pd.Series:
                    data = pd.DataFrame(X)
                elif type(X) == dict:
                    data = pd.DataFrame([X])
                else:
                    raise ValueError('X must be a dataframe, list, series, or dictionary  object.')
            else:
                data = pd.DataFrame(X)
        else:
            data = X.copy()
        return data
    
    def _bin_data(self, x):
        if x < 45:
            return 0 
        elif x > 64:
            return 0
        else: 
            return 1
        
    def transform(self, X):
        data = self._to_df(X)
        data = data.applymap(self._bin_data)
        return data

**Now repeat the process for a `BinCapital` Transformer!**

In [None]:
class BinCapital():
    pass

In [5]:
# __SOLUTION__
class BinCapital(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
   
    def _to_df(self, X):
        if type(X) != pd.DataFrame:
            if type(X) != list:
                if type(X) == pd.Series:
                    data = pd.DataFrame(X)
                elif type(X) == dict:
                    data = pd.DataFrame([X])
                else:
                    raise ValueError('X must be a dataframe, list, series, or dictionary  object.')
            else:
                data = pd.DataFrame(X)
        else:
            data = X.copy()
        return data
    
    def _bin_data(self, x):
        if x > 0:
            return 1
        else:
            return 0
        
    def transform(self, X):
        data = self._to_df(X)
        data = data.applymap(self._bin_data)
        return data

## Create pipeline

To make this pipeline, we will use the following sklearn functions:

1. `make_column_transformer`
> This function receives "Tuples of the form `(transformer, [columns])` specifying the transformer objects to be applied to subsets of the data."
2. `make_column_selector`
> "Selects columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected."
3. `make_pipeline`
> Used to create a pipeline of inputer transformer and estimator objects.

In [None]:
preprocessing = make_column_transformer((BinAge(), ['age']),
                                      (BinCapital(), ['capital_gain']),
                                      (BinCapital(), ['capital_loss']),
                                      (OneHotEncoder(),
                                       make_column_selector(dtype_include=object)),
                                      (StandardScaler(), ['hours_per_week']),
                                      remainder='drop')

In [6]:
# __SOLUTION__
preprocessing = make_column_transformer((BinAge(), ['age']),
                                      (BinCapital(), ['capital_gain']),
                                      (BinCapital(), ['capital_loss']),
                                      (OneHotEncoder(),
                                       make_column_selector(dtype_include=object)),
                                      (StandardScaler(), ['hours_per_week']),
                                      remainder='drop')

Now all of our preprocessing can be done with the `fit_transform` method!

In [None]:
preprocessing.fit_transform(X_train)

In [7]:
#__SOLUTION__
preprocessing.fit_transform(X_train)

<21978x105 sparse matrix of type '<class 'numpy.float64'>'
	with 200636 stored elements in Compressed Sparse Row format>

To finish up pipeline, we can add a machine learning model to a new pipeline!

In [None]:
dt_pipeline = make_pipeline(preprocessing, DecisionTreeClassifier())
rf_pipeline = make_pipeline(preprocessing, RandomForestClassifier(max_depth=10))

In [8]:
#__SOLUTION__
dt_pipeline = make_pipeline(preprocessing, DecisionTreeClassifier())
rf_pipeline = make_pipeline(preprocessing, RandomForestClassifier(max_depth=10))

## Our pipelines are built!

Now we can run them through cross validation!

In [None]:
cross_val_score(dt_pipeline, X_train, y_train)

In [9]:
#__SOLUTION__
cross_val_score(dt_pipeline, X_train, y_train)

array([0.80959964, 0.79094631, 0.80414013, 0.80705347, 0.79590444])

In [None]:
cross_val_score(rf_pipeline, X_train, y_train)

In [10]:
#__SOLUTION__
cross_val_score(rf_pipeline, X_train, y_train)

array([0.83530482, 0.83621474, 0.83621474, 0.83208191, 0.83777019])

In [None]:
rf_pipeline.fit(X_train, y_train)
train_preds = rf_pipeline.predict(X_train)
test_preds = rf_pipeline.predict(X_test)
print(f'Training Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Testing Accuracy: {accuracy_score(y_test, test_preds)}')

In [11]:
#__SOLUTION__
rf_pipeline.fit(X_train, y_train)
train_preds = rf_pipeline.predict(X_train)
test_preds = rf_pipeline.predict(X_test)
print(f'Training Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Testing Accuracy: {accuracy_score(y_test, test_preds)}')

Training Accuracy: 0.8454818454818455
Testing Accuracy: 0.8381108381108381


Finally, we can fit the final pipeline on all of the data and test it on an additional hold out set!

In [None]:
rf_pipeline.fit(df.drop('income', axis = 1), df.income)

In [12]:
#__SOLUTION__
rf_pipeline.fit(df.drop('income', axis = 1), df.income)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('binage', BinAge(), ['age']),
                                                 ('bincapital-1', BinCapital(),
                                                  ['capital_gain']),
                                                 ('bincapital-2', BinCapital(),
                                                  ['capital_loss']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fa55de08610>),
                                                 ('standardscaler',
                                                  StandardScaler(),
                                                  ['hours_per_week'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=10))])

Load in the hold out set and make predictions!

In [None]:
validation = pd.read_csv('data/validation_features.csv')
val_preds = rf_pipeline.predict(validation)
y_val = pd.read_csv('data/validation_target.csv').iloc[:,0]
accuracy_score(y_val, val_preds)

In [13]:
#__SOLUTION__
validation = pd.read_csv('data/validation_features.csv')
val_preds = rf_pipeline.predict(validation)
y_val = pd.read_csv('data/validation_target.csv').iloc[:,0]
accuracy_score(y_val, val_preds)

0.8409090909090909