# skorecard 

##### building (traditional) credit risk models in python

# Traditional credit risk modelling

When builing a credit risk model, the most commonly used algorithm is a Logisitic Regression, due to its simplicity and interpretability.<br>
<br>
A Logistic regression is a model that assumes a linear relationship between the variables (aka risk-drivers or features) and the target (the default flag). <br>

    


- The input variables are usually bucketed (according to statistical process or business knowledge) in order to address the non-linear relationship between variables and risk drivers


- This generates a set of buckets:
    - each feature value gets assigned to one bucket
    - the buckets are used to train the model
    - using the model output, each bucket is then assigned a score - the final model becomes a scorecard

# Some practical examples in python

- intro to scikit learn
- skorecard examples with sklearn integration
- sneak preview for the next features in the package

### Load Data

in `skorecard` there is a demo dataset with 4 features (2 categorical and 2 numerical) for demo and testing.<br>
We'll use it for the demo.

In [1]:
import pandas as pd
import numpy as np
from skorecard import datasets

df = datasets.load_uci_credit_card(as_frame=True)

X = df.drop(columns=["default"])
y = df["default"]
num_cols = ["LIMIT_BAL", "BILL_AMT1"]
cat_cols = ["EDUCATION", "MARRIAGE"]

df.head()

Unnamed: 0,EDUCATION,MARRIAGE,LIMIT_BAL,BILL_AMT1,default
0,1,2,400000.0,201800.0,0
1,2,2,80000.0,80610.0,0
2,1,2,500000.0,499452.0,0
3,1,1,140000.0,450.0,1
4,2,1,420000.0,56107.0,0


### Quick intro to scikit-learn

- scikit learn (sklearn) is the package that defined the Machine Learning workflow in python.
- scikit learn is a very extensive and complete package. In the upcoming two slides we want to introduce the concept of `transformer`, `model` and `pipeline`, as this is what 'skorecard' relates to


### sklearn transformers

- `transfromers` are classes in sklearn whose function is to perform a transformation on the data.<br>
- in general, a `transformer` preserves the number of rows in a dataset.<br>
- 'transformers` are characterized by two main functions:
    - `fit(X,y=None)` performs the necessar calculations
    - `transfrom(X,y=None)` applies the transformation to the (new) dataset
    
Example: `MinMaxScaler`: this is a transformer that changes the range of the input features X to a predifined range (normally -1 to 1 or 0,1), depending on the use case

In [2]:
from sklearn.preprocessing import MinMaxScaler

In [3]:
mms = MinMaxScaler(feature_range=(0, 1)).fit(X)
X_transformed = mms.transform(X)
X_transformed

array([[0.16666667, 0.66666667, 0.52      , 0.47324305],
       [0.33333333, 0.66666667, 0.09333333, 0.31713133],
       [0.16666667, 0.66666667, 0.65333333, 0.8566655 ],
       ...,
       [0.16666667, 0.33333333, 0.2       , 0.3153717 ],
       [0.33333333, 0.33333333, 0.12      , 0.21329301],
       [0.33333333, 0.33333333, 0.53333333, 0.30543744]])

In [4]:
X_transformed[:,0].min(), X_transformed[:,0].max()

(0.0, 1.0)

If we change the range for example, we see that the transformation is changed accordingly

In [5]:
mms = MinMaxScaler(feature_range=(-2, 2)).fit(X)
X_transformed = mms.transform(X)
X_transformed[:,0].min(), X_transformed[:,0].max()

(-2.0, 2.0)

## sklearn models

- models are classes that contain the (ML) models and all that comes along.
- A model has three main functions:
    - fit(X,y) - runs the optimization for the specific algorithms
    - predict(X) - returns the predictions for a new dataset
    - predict_proba(X) - returns the probabilities of the fitted model
    
Example: `Logistic Regression`

In [6]:
from sklearn.linear_model import LogisticRegression

lr = (
    LogisticRegression()
    .fit(X,y)
)
X_proba = lr.predict_proba(X)
X_proba

array([[0.93184381, 0.06815619],
       [0.62888787, 0.37111213],
       [0.9642941 , 0.0357059 ],
       ...,
       [0.74000313, 0.25999687],
       [0.65663984, 0.34336016],
       [0.93499249, 0.06500751]])

## sklearn pipeline - putting it all togeteher

A pipeline is a sequential set that puts together transformers and one model.<br>
The pipeline can have a sequence of multiple transformers and must finish with a model.

In [7]:
from sklearn.pipeline import Pipeline, make_pipeline

pipe = make_pipeline(
    MinMaxScaler(),
   LogisticRegression()
)

pipe.fit(X,y)

Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                ('logisticregression', LogisticRegression())])

In [8]:
X_proba = pipe.predict_proba(X)
X_proba

array([[0.86808393, 0.13191607],
       [0.74407574, 0.25592426],
       [0.85396345, 0.14603655],
       ...,
       [0.7501432 , 0.2498568 ],
       [0.73629437, 0.26370563],
       [0.86798974, 0.13201026]])

## Skorecard - and how it fits in the sklearn API

When we consider the bucketing process, it fits in the concept of sklearn transformers.<br>
Therefore in skorecard, we implemented a set of transformers that map the input data to a set of buckets.

Example: bucket with Decision Trees

In [9]:
from skorecard.bucketers import DecisionTreeBucketer
from sklearn.preprocessing import OneHotEncoder

skorecard_pipeline = make_pipeline(
    DecisionTreeBucketer(variables=num_cols, max_n_bins=6, min_bin_size=0.1),
    OneHotEncoder(),
    LogisticRegression()
)

In [10]:
skorecard_pipeline.fit(X,y)

Pipeline(steps=[('decisiontreebucketer',
                 DecisionTreeBucketer(max_n_bins=6, min_bin_size=0.1,
                                      variables=['LIMIT_BAL', 'BILL_AMT1'])),
                ('onehotencoder', OneHotEncoder()),
                ('logisticregression', LogisticRegression())])

## Get the details of the bucketers

Generate a report of the bucketing process 

In [11]:
binner = skorecard_pipeline.steps[0][1] # get the first element of the pipeline, which is our bucketer
oh_encoder = skorecard_pipeline.steps[1][1] # get the second element of the pipeline, which is the one hot encoder
model = skorecard_pipeline.steps[2][1] # get the third element of the pipeline, which is our model

In [12]:
binner.features_bucket_mapping_['LIMIT_BAL']

BucketMapping(feature_name='LIMIT_BAL', type='numerical', map=array([   -inf,  45000.,  55000.,  85000., 145000., 255000.,     inf]), right=False)

In [13]:
from skorecard.reporting import create_report

create_report(X,y,num_cols[0],binner, verbose = True)

IV for LIMIT_BAL = 0.1693


Unnamed: 0,Bin id,Min bin,Max bin,Count,Count (%),Event,Non Event,Event Rate,% Event,% Non Event,WoE,IV
0,1.0,-inf,45000.0,849,0.1415,316,533,0.372203,0.234944,0.114501,0.718724,0.086566
1,2.0,45000.0,55000.0,676,0.112667,158,518,0.233728,0.117472,0.111278,0.054163,0.000335
2,3.0,55000.0,85000.0,655,0.109167,179,476,0.273282,0.133086,0.102256,0.263493,0.008123
3,4.0,85000.0,145000.0,896,0.149333,219,677,0.24442,0.162825,0.145435,0.112941,0.001964
4,5.0,145000.0,255000.0,1561,0.260167,265,1296,0.169763,0.197026,0.27841,-0.345745,0.028138
5,6.0,255000.0,inf,1363,0.227167,208,1155,0.152605,0.154647,0.24812,-0.472745,0.044189


### Checking the model

In [14]:
model

LogisticRegression()

In [15]:
print(f'Coefficients: {model.coef_}\n')
print(f'Intercept : {model.intercept_}\n')

Coefficients: [[-0.13253087  0.47937838  0.65463067  0.55480657 -0.26212038 -0.50791937
  -0.79486294 -0.38514784  0.16334948 -0.12803591  0.34121631  0.74774612
   0.01223482  0.19485732  0.02750463 -0.43413672 -0.55682412  0.19645489
   0.28221057 -0.28005102 -0.05808675 -0.17239751  0.02325187]]

Intercept : [-1.76410325]



# Fine and coarse classing (WIP)

Right now, we have shown an example where:
- the binning is defined through one transformer, which might not be optimized
- Ideally one would start with a lot of bins (fine classing), and then try to merge them together if they are similar enough (coarse classing)

#### skorecard support both and automatic bin merging (based on statistical properties), as well as manual merging

In [16]:
from skorecard.bucketers import OptimalBucketer

opti_skorecard_pipeline = make_pipeline(
    OptimalBucketer(variables=num_cols, max_n_bins=6, min_bin_size=0.1),
    LogisticRegression()
)

opti_skorecard_pipeline.fit(X,y)

Pipeline(steps=[('optimalbucketer',
                 OptimalBucketer(max_n_bins=6, min_bin_size=0.1,
                                 variables=['LIMIT_BAL', 'BILL_AMT1'])),
                ('logisticregression', LogisticRegression())])

In [17]:
opti_binner = opti_skorecard_pipeline.steps[0][1] # get the first element of the pipeline, which is our bucketer

### Sneak preview into the manual bucketing
In order to perfrom the manual bucketing, the steps are the following:

- The user defines the fine classing that is desired
- Optionally, the user can then also run the statistical optimiziation
- Once this is done, the whole pipeline is passed to thr `tweak_buckets` function
- This will launch a web ui (that can run in a notebook, as well as in the browser), where the user can merge the buckets accoring to the desired logic.
- If the statistical optimization is performed, a suggestion of the merging is presented.
- After the buckets are adapted, the user can store the object and immediately continue with the pipeline

In [18]:
from skorecard.pipeline import BucketingPipeline, tweak_buckets

prebucket_pipeline = make_pipeline(DecisionTreeBucketer(variables=num_cols, max_n_bins=100, min_bin_size=0.05))
bucket_pipeline = BucketingPipeline(make_pipeline(
    OptimalBucketer(variables=num_cols, max_n_bins=10, min_bin_size=0.05),
    OptimalBucketer(variables=cat_cols, max_n_bins=10, min_bin_size=0.05),
))
pipe = make_pipeline(prebucket_pipeline, bucket_pipeline)
pipe.fit(X, y)

pipe.transform(X).head()


Unnamed: 0,EDUCATION,MARRIAGE,LIMIT_BAL,BILL_AMT1
0,1.0,2.0,9.0,6.0
1,2.0,2.0,4.0,5.0
2,1.0,2.0,9.0,6.0
3,1.0,1.0,5.0,1.0
4,2.0,1.0,9.0,4.0


Launch a web app where the manual tweaking can be done (this is still WIP)

In [19]:
tweak_buckets(pipe, X, y)

Dash app running on http://127.0.0.1:8050/


http://127.0.0.1:8050/