# Using the BucketingProcess

The `BucketingProcess` enables a two-step bucketing approach, where a feature is first pre-bucketed to f.e. 100 pre-buckets, and then bucketed. 

This is a common practice - it reduces the complexity of finding exact boundaries to the problem of finding which of 100 buckets to merge together.

## Define the BucketingProcess

The bucketing process incorporates a pre-bucketing pipeline and a bucketing pipeline. You can also pass `specials` or `variables` and `BucketingProcess` will pass those settings on to the bucketers in the pipelines.

In the example below, we prebucket numerical features to max 100 bins, and prebucket categorical columns as-is (each unique value is a category and new categories end up in the other bucket).

In [1]:
%load_ext autoreload
%autoreload 2
from skorecard import datasets
from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer, AsIsCategoricalBucketer
from skorecard.pipeline import BucketingProcess

from sklearn.pipeline import make_pipeline

df = datasets.load_uci_credit_card(as_frame=True)
y = df["default"]
X = df.drop(columns=["default"])

num_cols = ["LIMIT_BAL", "BILL_AMT1"]
cat_cols = ["EDUCATION", "MARRIAGE"]
specials = {"EDUCATION" : {"Is 1": [1] } }

bucketing_process = BucketingProcess(
        prebucketing_pipeline=make_pipeline(
                DecisionTreeBucketer(variables=num_cols, max_n_bins=100, min_bin_size=0.05),
                AsIsCategoricalBucketer(variables=cat_cols)
        ),
        bucketing_pipeline=make_pipeline(
                OptimalBucketer(variables=num_cols, max_n_bins=10, min_bin_size=0.05),
                OptimalBucketer(variables=cat_cols,
                                variables_type='categorical',
                                max_n_bins=10,
                                min_bin_size=0.05),
        ),
        specials=specials
)

bucketing_process.fit_transform(X, y).head()

Unnamed: 0,EDUCATION,MARRIAGE,LIMIT_BAL,BILL_AMT1
0,-3,0,8,5
1,1,0,3,4
2,-3,0,8,5
3,-3,1,4,0
4,1,1,8,3


In [9]:
bucketing_process.fit_interactive(X, y, port="7001")

Dash app running on http://127.0.0.1:7001/


In [5]:
bucketing_process.pipeline.features_bucket_mapping_.get('LIMIT_BAL')

BucketMapping(feature_name='LIMIT_BAL', type='numerical', missing_bucket=None, other_bucket=None, map=[1, 2, 6, 8, 9, 10, 11], right=False, specials={})

In [6]:
bucketing_process.bucket_table('LIMIT_BAL')

Unnamed: 0,bucket,label,Count,Count (%),Non-event,Event,Event Rate,WoE,IV
0,-1,Missing,0.0,0.0,0.0,0.0,,0.0,0.0
1,0,"[-inf, 1.0)",479.0,7.98,300.0,179.0,0.373695,-0.725,0.05
2,1,"[1.0, 2.0)",370.0,6.17,233.0,137.0,0.37027,-0.71,0.037
3,2,"[2.0, 6.0)",1661.0,27.68,1235.0,426.0,0.256472,-0.177,0.009
4,3,"[6.0, 8.0)",1015.0,16.92,816.0,199.0,0.196059,0.17,0.005
5,4,"[8.0, 9.0)",769.0,12.82,630.0,139.0,0.180754,0.27,0.009
6,5,"[9.0, 10.0)",501.0,8.35,419.0,82.0,0.163673,0.39,0.011
7,6,"[10.0, 11.0)",379.0,6.32,326.0,53.0,0.139842,0.575,0.018
8,7,"[11.0, 12.0)",350.0,5.83,287.0,63.0,0.18,0.275,0.004
9,8,"[12.0, inf)",476.0,7.93,409.0,67.0,0.140756,0.567,0.022


In [9]:
bucketing_process.bucket_tables_.keys()

dict_keys(['EDUCATION', 'MARRIAGE', 'LIMIT_BAL', 'BILL_AMT1'])

## Methods and Attributes

A `BucketingProcess` instance has all the similar methods & attributes of a bucketer:

- `.summary()`
- `.bucket_table(column)`
- `.plot_bucket(column)`
- `.features_bucket_mapping`
- `.save_to_yaml()`
- `.fit_interactive()`

but also adds a few unique ones:

- `.prebucket_table(column)`
- `.plot_prebucket(column)`


In [None]:
bucketing_process.summary()

In [5]:
bucketing_process.prebucket_table('MARRIAGE')

Unnamed: 0,pre-bucket,label,Count,Count (%),Non-event,Event,Event Rate,WoE,IV,bucket
0,-2,Other,0.0,0.0,0.0,0.0,,0.0,0.0,-2
1,-1,Missing,0.0,0.0,0.0,0.0,,0.0,0.0,-1
2,0,2,3138.0,52.3,2493.0,645.0,0.205545,0.11,0.006,0
3,1,1,2784.0,46.4,2108.0,676.0,0.242816,-0.104,0.005,1
4,2,3,64.0,1.07,42.0,22.0,0.34375,-0.594,0.004,1
5,3,0,14.0,0.23,12.0,2.0,0.142857,0.547,0.001,0


In [3]:
bucketing_process.bucket_table('MARRIAGE')

Unnamed: 0,bucket,label,Count,Count (%),Non-event,Event,Event Rate,WoE,IV
0,-2,Other,0.0,0.0,0.0,0.0,,0.0,0.0
1,-1,Missing,0.0,0.0,0.0,0.0,,0.0,0.0
2,0,"0, 3",3152.0,52.53,2505.0,647.0,0.205266,0.112,0.006
3,1,"1, 2",2848.0,47.47,2150.0,698.0,0.245084,-0.117,0.007


In [None]:
bucketing_process.plot_prebucket("LIMIT_BAL", format="png", scale=2, width=1050, height=525)

## The `.features_bucket_mapping` attribute

All skorecard bucketing classes have a `.features_bucket_mapping` attribute to access the stored bucketing information to go from an input feature to a bucketed feature. In the case of `BucketingProcess`, because there is a prebucketing and bucketing step, this means the bucket mapping reflects the net effect of merging both steps into one. This is demonstrated below:

In [4]:
bucketing_process.pre_pipeline.features_bucket_mapping_.get('MARRIAGE').labels

{3: '0', 1: '1', 0: '2', 2: '3', -1: 'Missing', -2: 'Other'}

In [None]:
bucketing_process.pipeline.features_bucket_mapping_.get('EDUCATION')

In [None]:
bucketing_process.features_bucket_mapping_.get('EDUCATION')

## The `.fit_interactive()` method

All skorecard bucketing classes have a `.fit_interactive()` method. In the case of `BucketingProcess` this will launch a slightly different app that shows the pre-buckets and the buckets, and allows you to edit the prebucketing as well.

In [34]:
bucketing_process.fit_interactive(X, y, port=7001) # not run

Dash app running on http://127.0.0.1:7001/


# The `Skorecard()` class

In [None]:
from skorecard import Skorecard
model = Skorecard(bucketing=bucketing_process)
model.fit(X, y)
model.get_stats()

In [38]:
input_buckets = [-3, -2, -1, 1, 0, 3, 2, 4, 0]

from skorecard.apps.app_callbacks import is_sequential
is_sequential(sorted(input_buckets))


True

In [None]:
model.bucket_table('EDUCATION')

In [None]:
model.bucketing.features_bucket_mapping_