# PdPipeline + Sklearn Model

This notebook is meant to provide a brief tutorial on how to combine a `PdPipeline` and a scikit-learn model (any `sklearn` estimator, in fact) to create a combined sklearn-compliant estimator object. This object inherits any `predict`, `predict_proba`, etc. methods from the composing estimator, and naturally applies the inner pipeline to any input dataframe before passing the processed dataframe (or actually, its cast to a `numpy.ndarray`) to that estimator.

The main new functionality introduced here is a `pdpipe` class that can be extended to create such custom pipline+estimator classes with construtor parameters encapsulating both pipeline hyperparameters and model hyperparameters. Thus, the created pipeline-model object can be thought of as a stronger model with additional hyperparameters, and this extended model can have its hyperparameters optimized using sklearn functionalities such as the `model_selection.GridSearchCV` class and the `model_selection.cross_val_score` function.

## The Data

Let's start by creating a mock dataset that can demonstrate basic `pdpipe` functionalities.

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame(
    data=[
        [23, 'Jo', 'M', True, 0.07, 'USA', 'Living life to its fullest'],
        [52, 'Regina', 'F', False, 0.26, 'Germany', 'I hate cats'],
        [23, 'Dana', 'F', True, 0.3, 'USA', 'the pen is mightier then the sword'],
        [25, 'Bo', 'M', False, 2.3, 'Greece', 'all for one and one for all'],
        [80, 'Richy', 'M', False, 100.2, 'Finland', 'I gots the dollarz'],
        [60, 'Paul', 'M', True, 1.87, 'Denmark', 'blah'],
        [44, 'Derek', 'M', True, 1.1, 'Denmark', 'every life is precious'],
        [72, 'Regina', 'F', True, 7.1, 'Greece', 'all of you get off my porch'],
        [50, 'Jim', 'M', False, 0.2, 'Germany', 'boy do I love dogs and cats'],
        [80, 'Wealthus', 'F', False, 123.2, 'Finland', 'me likey them moniez'],
    ],
    columns=['Age', 'Name', 'Gender', 'Smoking', 'Savings', 'Country', 'Quote'],
)

In [3]:
df

Unnamed: 0,Age,Name,Gender,Smoking,Savings,Country,Quote
0,23,Jo,M,True,0.07,USA,Living life to its fullest
1,52,Regina,F,False,0.26,Germany,I hate cats
2,23,Dana,F,True,0.3,USA,the pen is mightier then the sword
3,25,Bo,M,False,2.3,Greece,all for one and one for all
4,80,Richy,M,False,100.2,Finland,I gots the dollarz
5,60,Paul,M,True,1.87,Denmark,blah
6,44,Derek,M,True,1.1,Denmark,every life is precious
7,72,Regina,F,True,7.1,Greece,all of you get off my porch
8,50,Jim,M,False,0.2,Germany,boy do I love dogs and cats
9,80,Wealthus,F,False,123.2,Finland,me likey them moniez


## Defining a combined object

In this section we will define a custom class which will capture all the pipeline logic and the model concatenated to it.

In this case we hard-code the use of the `LogisticRegression` model. With a bit more complex code, the model type/family itself to use can be set up as a hyperparameter.

In [4]:
from typing import Optional

In [5]:
from sklearn.linear_model import LogisticRegression

In [6]:
import pdpipe as pdp
from pdpipe.skintegrate import PdPipelineAndSklearnEstimator

In [7]:
class MyPipelineAndModel(PdPipelineAndSklearnEstimator):
    
    def __init__(
        self,
        drop_gender: Optional[bool] = False,
        scale_numeric: Optional[bool] = False,
        ohencode_country: Optional[bool] = True,
        savings_bin_val: Optional[int] = None,
        fit_intercept: Optional[bool] = True,
    ):
        self.drop_gender = drop_gender
        self.scale_numeric = scale_numeric
        self.ohencode_country = ohencode_country
        self.savings_bin_val = savings_bin_val
        self.fit_intercept = fit_intercept
        cols_to_drop = []
        stages = [
            pdp.ColDrop(['Name', 'Quote'], errors='ignore'),
        ]
        if savings_bin_val:
            stages.append(pdp.Bin({'Savings': [savings_bin_val]}, drop=False))
            stages.append(pdp.Encode('Savings_bin'))
        if scale_numeric:
            stages.append(pdp.Scale('MinMaxScaler'))
        if drop_gender:
            cols_to_drop.append('Gender')
        else:
            stages.append(pdp.Encode('Gender'))
        if ohencode_country:
            stages.append(pdp.OneHotEncode('Country'))
        else:
            cols_to_drop.append('Country')
        stages.append(pdp.ColDrop(cols_to_drop, errors='ignore'))
        pline = pdp.PdPipeline(stages)
        model = LogisticRegression(fit_intercept=fit_intercept)
        super().__init__(pipeline=pline, estimator=model)

In [8]:
mp = MyPipelineAndModel(
    drop_gender=True,
    scale_numeric=True,
    ohencode_country=True,
    savings_bin_val=1,
    fit_intercept=True,
)

In [9]:
mp

<PdPipeline -> LogisticRegression>

In [10]:
mp.pipeline

A pdpipe pipeline:
[ 0]  Drop columns Name, Quote
[ 1]  Bin Savings by [1].
[ 2]  Encode Savings_bin
[ 3]  Scale columns Columns of dtypes <class 'numpy.number'>
[ 4]  One-hot encode Country
[ 5]  Drop columns Gender

In [11]:
mp.estimator

LogisticRegression()

In [12]:
mp.score

<bound method PdPipelineAndSklearnEstimator.score of <PdPipeline -> LogisticRegression>>

In [13]:
mp.score?

[0;31mSignature:[0m [0mmp[0m[0;34m.[0m[0mscore[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/clones/pdpipe/pdpipe/skintegrate.py
[0;31mType:[0m      method


In [14]:
mp.pipeline(df)

Unnamed: 0,Age,Smoking,Savings,Savings_bin,Country_Finland,Country_Germany,Country_Greece,Country_USA
0,0.0,True,0.0,1.0,0,0,0,1
1,0.508772,False,0.001543,1.0,0,1,0,0
2,0.0,True,0.001868,1.0,0,0,0,1
3,0.035088,False,0.018111,0.0,0,0,1,0
4,1.0,False,0.813206,0.0,1,0,0,0
5,0.649123,True,0.014619,0.0,0,0,0,0
6,0.368421,True,0.008365,0.0,0,0,0,0
7,0.859649,True,0.057094,0.0,0,0,1,0
8,0.473684,False,0.001056,1.0,0,1,0,0
9,1.0,False,1.0,0.0,1,0,0,0


In [15]:
mp.pipeline[0:4](df)

Unnamed: 0,Age,Gender,Smoking,Savings,Savings_bin,Country
0,0.0,M,True,0.0,1.0,USA
1,0.508772,F,False,0.001543,1.0,Germany
2,0.0,F,True,0.001868,1.0,USA
3,0.035088,M,False,0.018111,0.0,Greece
4,1.0,M,False,0.813206,0.0,Finland
5,0.649123,M,True,0.014619,0.0,Denmark
6,0.368421,M,True,0.008365,0.0,Denmark
7,0.859649,F,True,0.057094,0.0,Greece
8,0.473684,M,False,0.001056,1.0,Germany
9,1.0,F,False,1.0,0.0,Finland


## Test our custom estimator checks out

In [16]:
from sklearn.utils.estimator_checks import check_estimator

Actually, it does not. :(

In [17]:
# check_estimator(mp)

But it is going to work anyway!

## X-y subsets

In [18]:
x_lbls = ['Age', 'Gender', 'Savings', 'Country']

In [19]:
all_x = df[x_lbls]
all_y = df['Smoking']

In [20]:
all_x.shape

(10, 4)

In [21]:
all_y.shape

(10,)

In [22]:
all_x

Unnamed: 0,Age,Gender,Savings,Country
0,23,M,0.07,USA
1,52,F,0.26,Germany
2,23,F,0.3,USA
3,25,M,2.3,Greece
4,80,M,100.2,Finland
5,60,M,1.87,Denmark
6,44,M,1.1,Denmark
7,72,F,7.1,Greece
8,50,M,0.2,Germany
9,80,F,123.2,Finland


## Check inheritence of predict()

This section demonstrates that the composed object inherits all the cool `sklearn` methods from the composing estimator!

In [23]:
mp.fit(all_x, all_y)

<PdPipeline -> LogisticRegression>

In [24]:
mp.predict(all_x)

array([ True, False,  True,  True, False,  True,  True,  True, False,
       False])

In [25]:
mp.predict(all_x).dtype == bool

True

In [26]:
mp.predict_proba(all_x)

array([[0.29801843, 0.70198157],
       [0.63432307, 0.36567693],
       [0.29822225, 0.70177775],
       [0.43059313, 0.56940687],
       [0.68689079, 0.31310921],
       [0.42798384, 0.57201616],
       [0.4167183 , 0.5832817 ],
       [0.46679026, 0.53320974],
       [0.63301878, 0.36698122],
       [0.70744934, 0.29255066]])

In [27]:
mp.predict_proba(all_x).dtype == float

True

In [28]:
mp.predict_log_proba(all_x)

array([[-1.21059993, -0.35384813],
       [-0.45519688, -1.00600504],
       [-1.20991627, -0.35413852],
       [-0.84259165, -0.56316003],
       [-0.37557997, -1.16120322],
       [-0.84866984, -0.55858804],
       [-0.87534482, -0.53908502],
       [-0.76187525, -0.62884042],
       [-0.45725519, -1.00244461],
       [-0.34608926, -1.22911742]])

In [29]:
mp.decision_function(all_x)

array([ 0.8567518 , -0.55080816,  0.85577775,  0.27943162, -0.78562325,
        0.29008181,  0.3362598 ,  0.13303483, -0.54518943, -0.88302815])

In [30]:
mp.score(all_x, all_y)

0.9

In [31]:
mp.classes_

array([False,  True])

Note that if we use `LinearRegression`, for example, which only has `predict` and `score`, our derivative pipeline+model object will only inherit these two methods:

In [53]:
from sklearn.linear_model import LinearRegression

In [54]:
pipe_and_linreg = PdPipelineAndSklearnEstimator(
    pipeline=mp.pipeline,
    estimator=LinearRegression()
)

In [57]:
pipe_and_linreg.fit(all_x, all_y)

<PdPipeline -> LinearRegression>

In [58]:
pipe_and_linreg.predict(all_x)

array([ 0.99994168,  0.01900237,  1.00005832,  0.05258477, -0.00583227,
        1.15209247,  0.84790753,  0.94741523, -0.01900237,  0.00583227])

In [60]:
assert not hasattr(pipe_and_linreg, 'predict_proba')

In [61]:
assert not hasattr(pipe_and_linreg, 'predict_log_proba')

In [62]:
assert not hasattr(pipe_and_linreg, 'decision_function')

## GridSearchCV

Here we can see how this object works flawlessly with sklearn's `GridSearchCV`!

By default, the scoring function used to select the best model is the `score` method the inner estimator has. If the estimator has no such method we will have to provide a scorer function to the scorer constructor parameter of the `GridSearchCV` class. We will take a look at this later.

In [32]:
from sklearn.model_selection import GridSearchCV

In [33]:
gcv = GridSearchCV(
    estimator=mp,
    param_grid={
        'savings_bin_val': [1, 2],
        'scale_numeric': [True, False],
        'drop_gender': [True, False],
        'ohencode_country': [True, False],
    },
    cv=3,
)

In [34]:
gcv

GridSearchCV(cv=3,
             ('estimator', <PdPipeline -> LogisticRegression>),
             param_grid={'drop_gender': [True, False],
                         'ohencode_country': [True, False],
                         'savings_bin_val': [1, 2],
                         'scale_numeric': [True, False]})

In [35]:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

In [36]:
try:
    check_is_fitted(gcv)
except NotFittedError:
    print("Not fitted - as expected")

Not fitted - as expected


In [37]:
gcv.fit(all_x, all_y)

GridSearchCV(cv=3,
             ('estimator', <PdPipeline -> LogisticRegression>),
             param_grid={'drop_gender': [True, False],
                         'ohencode_country': [True, False],
                         'savings_bin_val': [1, 2],
                         'scale_numeric': [True, False]})

In [38]:
assert check_is_fitted(gcv) is None

In [39]:
gcv

GridSearchCV(cv=3,
             ('estimator', <PdPipeline -> LogisticRegression>),
             param_grid={'drop_gender': [True, False],
                         'ohencode_country': [True, False],
                         'savings_bin_val': [1, 2],
                         'scale_numeric': [True, False]})

In [40]:
# gcv.cv_results_

In [41]:
gcv.best_estimator_

<PdPipeline -> LogisticRegression>

In [42]:
gcv.best_score_

0.5833333333333334

In [43]:
gcv.best_params_

{'drop_gender': True,
 'ohencode_country': True,
 'savings_bin_val': 1,
 'scale_numeric': True}

## Working with custom scoring functions

We can also use a custom scoring function with `GridSearchCV`. As usual, we will have to convert any metric / scoring function (a function with a signature of the form `score(y_true_, y_pred, ...)`) to a scorer function - with a signature of the form `scorer(estimator, X, y, ...)`; this is done with the `sklearn.metrics.make_scorer` function.

Here, however, we will also have to convert the resulting `sklearn` scorer into a scorer that can handle our combined pipeline+model object. This can be done by importing the aptly-named `pdpipe_scorer_from_sklearn_scorer` from `pdpipe.skintegrate`, and applying it to the scorer to get a pdpipe-compliant one.

In [44]:
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)

In [45]:
from pdpipe.skintegrate import pdpipe_scorer_from_sklearn_scorer

In [46]:
my_scorer = pdpipe_scorer_from_sklearn_scorer(ftwo_scorer)

In [47]:
my_scorer

<PdPipeScorer: make_scorer(fbeta_score, beta=2)>

In [48]:
gcv = GridSearchCV(
    estimator=mp,
    param_grid={
        'savings_bin_val': [1, 2],
        'scale_numeric': [True, False],
        'drop_gender': [True, False],
        'ohencode_country': [True, False],
    },
    cv=3,
    scoring=my_scorer,
)

In [49]:
gcv.fit(all_x, all_y)

GridSearchCV(cv=3,
             ('estimator', <PdPipeline -> LogisticRegression>),
             param_grid={'drop_gender': [True, False],
                         'ohencode_country': [True, False],
                         'savings_bin_val': [1, 2],
                         'scale_numeric': [True, False]},
             scoring=<PdPipeScorer: make_scorer(fbeta_score, beta=2)>)

In [50]:
gcv.best_score_

0.30303030303030304

In [51]:
gcv.best_params_

{'drop_gender': True,
 'ohencode_country': True,
 'savings_bin_val': 1,
 'scale_numeric': True}

In [52]:
# gcv.cv_results_