# Optimizing the bucketing process

In [1]:
import pandas as pd

from skorecard.datasets import load_credit_card

df = load_credit_card(as_frame=True)

# Show
display(df.head(4))

num_feats = ['x1','x15','x16']

X = df[num_feats]
y = df['y']

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x15,x16,x17,x18,x19,x20,x21,x22,x23,y
0,20000.0,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0


# Finding the best bucketing

The art of building a good scorecard model lies in finding the best bucketing strategy.<br>
Good buckets improve the predicitve power of the model, as well as guarantee stability of the predictions.<br>

This is normally a very manual, labour intensive process (and for a good reason).<br>

A good buckets follow the following principles:
- maximize the Information Value, defined as 

$$IV = \sum_{i}(\%G_{i}-\%B_{i})\dot\log(\frac{\%G_{i}}{\%B_{i}})$$

- avoid buckets that contain a very large or very small fraction of the population
- wherever the business sense requires it, 

The `skorecard` package provides some tooling to automate part of the process, namely:

- Grid search the hyper-parameters of the bucketers in order to maximise the information value
- Run the optimal bucketer within the bucketing process




## Grid search the bucketers to maximise the information value

`skorecard` implements an `IV_scorer`, that can be used as a custom scoring function for grid searching.<br>
The following snippets of code show how to integrate it in the grid search.<br>
The DecisionTreeBucketer applied on numerical features is the best use case, as there are some hyper-parameters that influence the bucketing quality. 

In [2]:
from skorecard.metrics import IV_scorer
from skorecard.bucketers import DecisionTreeBucketer
from sklearn.model_selection import GridSearchCV

The DecisionTreeBucketer has two main hyperparameters to grid-search:
- `max_n_bins`, maximum number of bins allowed for the bucketing
- `min_bin_size` minimum fraction of data in the buckets

In [3]:
gs_params = {
   "max_n_bins": [3, 4, 5, 6],
   "min_bin_size": [0.05, 0.06, 0.07, 0.08], #, 0.12]
}


The optimization has to be done for every feature indipendently, therefore we need a loop, and all the parameters are best stored in a data collector, like a dictionary

In [4]:
# Define the specials
best_params = dict()
max_iv = dict()
cv_results = dict()

# Add a special for demo purposes
specials = {'x1':{'special 0':['50000.0']}}

for feat in num_feats:

    #This snippet illustrates what to do with special values
    if feat in specials.keys():
        # This construct is needed to remap the specials, because skorecard validates that the key 
        # of the dictionary is present in the variables
        special = {feat: specials[feat]}
    else:
        special = {}
    bucketer = DecisionTreeBucketer(variables=[feat], specials=special)
    gs = GridSearchCV(bucketer, gs_params, scoring=IV_scorer, cv=3, return_train_score=True)
    gs.fit(X[[feat]], y)

    best_params[feat] = gs.best_params_
    max_iv[feat] = gs.best_score_
    cv_results[feat] = gs.cv_results_

Checking the best parameters per feature

In [6]:
best_params

{'x1': {'max_n_bins': 6, 'min_bin_size': 0.05},
 'x15': {'max_n_bins': 6, 'min_bin_size': 0.08},
 'x16': {'max_n_bins': 6, 'min_bin_size': 0.05}}

Because of its additive nature, IV is likely to be maximal for the highest `max_n_bins`. 
Therefore it is worth looking analysing the CV results!

In [8]:
cv_results['x1']

{'mean_fit_time': array([0.01115632, 0.00850654, 0.00830102, 0.01148208, 0.00970546,
        0.00796835, 0.00656263, 0.00784127, 0.01052507, 0.01083366,
        0.00946848, 0.00820398, 0.01150513, 0.00915805, 0.00807834,
        0.00712228]),
 'std_fit_time': array([0.00143131, 0.00119281, 0.00016714, 0.00292413, 0.00028072,
        0.00123104, 0.00013929, 0.00132944, 0.00062044, 0.00064747,
        0.00030773, 0.00044063, 0.00209388, 0.00052964, 0.00052136,
        0.00038604]),
 'mean_score_time': array([0.01940401, 0.00953499, 0.00975267, 0.01215029, 0.01113605,
        0.00872302, 0.00750152, 0.00908868, 0.01182199, 0.01137336,
        0.01147874, 0.00959229, 0.01343695, 0.00988603, 0.00882077,
        0.00784699]),
 'std_score_time': array([0.0100211 , 0.00027721, 0.00089701, 0.00231537, 0.00100316,
        0.00088643, 0.00012007, 0.00128508, 0.00014799, 0.00013479,
        0.0007528 , 0.00059069, 0.00284929, 0.0002886 , 0.00031146,
        0.00034016]),
 'param_max_n_bins': maske