# Understanding LGD scoring models

This is my workbook associated with the reading of this LinkedIn [pulse](https://www.linkedin.com/pulse/understanding-lgd-risk-denis-burakov) by [Denis Burakov](https://linktr.ee/deburky).

- Data can be found [here](https://github.com/shawn-y-sun/Credit_Risk_Model_LoanDefaults/blob/main/loan_data_defaults.csv): <https://github.com/shawn-y-sun/Credit_Risk_Model_LoanDefaults>;
- Github repo for code reference: <https://github.com/deburky/lgd-scoring-models>.

## 1 Methodologies

Loss Given Default (LGD) risk management model is widely used in order to quantify 

## 2 Modeling

Required: `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `scipy`, `lightgbm`

In [39]:
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%config InlineBackend.figure_format = 'retina'

### 2.1 Dataset

Loading dataset

In [21]:
loan_data = pd.read_csv('loan_data_defaults.csv', index_col=0, low_memory=False)
loan_data = loan_data.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])
loan_data[['int_rate', 'recoveries', 'collection_recovery_fee', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'funded_amnt']].head(10)

Unnamed: 0,int_rate,recoveries,collection_recovery_fee,total_rec_prncp,total_rec_int,total_rec_late_fee,funded_amnt
1,15.27,117.08,1.11,456.46,435.17,0.0,2500
8,21.28,189.06,2.09,162.02,294.94,0.0,5600
9,12.69,269.29,2.52,673.48,533.42,0.0,5375
12,13.49,444.3,4.16,1256.14,570.26,0.0,9000
14,10.65,645.1,6.3145,5433.47,1393.42,0.0,10000
21,12.42,0.0,0.0,10694.96,3330.44,0.0,21000
24,11.71,269.31,2.57,1305.58,475.25,0.0,6000
26,14.27,0.0,0.0,0.0,0.0,0.0,15000
27,16.77,260.96,2.3,629.05,719.11,0.0,5000
46,8.9,107.0,1.07,4217.38,696.99,0.0,5000


Data dictionary (for variables that concerned in this topic)

- `int_rate`: Interest Rate on the loan
- `MRP`: Maximum recovery period (we assumed 36 months)
- `recoveries`: Post charge off gross recovery
- `collection_recovery_fee`: Post charge off collection fee, collected from the obligor
- `total_rec_prncp`: Principal received to date
- `total_rec_int`: Interest received to date
- `total_rec_late_fee`: Late fees received to date
- `funded_amnt`: The total amount committed to that loan at that point in time.


### 2.2 Some pre-processings

Discounting the recovery cash-flows, we use data post-default event to calculate the LGD

In [24]:
# interest rate
loan_data['int_rate'] /= 100

# maximum recovery period - assuming 3 years
MRP = 36

# discount factor
loan_data['discount_factor'] = (1 + loan_data['int_rate'] / 12) ** MRP

# recovery cash-flows
loan_data['recovery_cf'] = (loan_data['recoveries'] # gross recovery
                            + loan_data['collection_recovery_fee'] # collected from the obligor on colateral assets
                            + loan_data['total_rec_prncp'] # recovery principle
                            + loan_data['total_rec_int'] # recovery interest of loan
                            + loan_data['total_rec_late_fee'] # recovery late fee of the loan
                            )

# discounting recovery cash-flows, assuming cash is collected at the end of periods (?)
loan_data['recoveries_cf_disc'] = loan_data['recovery_cf'] / loan_data['discount_factor']

# realized LGD calculation
loan_data['LGD'] = (loan_data['funded_amnt'] - loan_data['recoveries_cf_disc']) / loan_data['funded_amnt']

# flooring & capping, null handling
loan_data['LGD'] = np.where(
    loan_data['LGD'] < 0, 0,
    (np.where((loan_data['LGD'] > 1) | (loan_data['LGD'].isnull()), 1,
              loan_data['LGD'])
        )
)

Processing on **utilization**

In [26]:
# current account utilization
loan_data['utilization'] = (    loan_data['revol_bal'].astype(float) / # Total credit revolving balance
                                loan_data['total_rev_hi_lim'].astype(float) # Total revolving high credit/credit limit
                            )
# capping
loan_data['utilization'] = np.where(
    loan_data['utilization'] > 1, 1,
    loan_data['utilization']
)

### 2.3 Train/Test split

Features

In [27]:
# features and outcome that our models concern
lgd_cols = ['funded_amnt', # The total amount committed to that loan at that point in time.
            'total_pymnt', # Payments received to date for total amount funded
            'last_pymnt_amnt', # Last total payment amount received
            'zip_code', # The first 3 numbers of the zip code provided by the borrower in the loan application.
            'grade', # LC assigned loan grade
            'utilization', # processed above
            'annual_inc', # The self-reported annual income provided by the borrower during registration.
            'purpose', # A category provided by the borrower for the loan request. 
            'inq_last_6mths', # The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
            'mths_since_last_delinq', # The number of months since the borrower's last delinquency.
            'dti', # A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, 
            # excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
            'LGD' # processed above, our ground true
            ]

loan_data_lgd = loan_data[lgd_cols].copy() # for safety

loan_data_lgd.rename({
    'total_pymnt': 'payments_received',
    'funded_amnt': 'ead', # this dataset is somehow for bullet payment loans
    # so when the default event happens, all fund are exposure at default
    'LGD': 'lgd',
}, axis=1, inplace=True)

Now we have this list of features for testing

In [28]:
features_testing = [
    'ead',
    'payments_received',
    'last_pymnt_amnt',
    'zip_code', 
    'grade',
    'utilization',
    'annual_inc',
    'purpose',
    'inq_last_6mths',
    'mths_since_last_delinq',
    'dti']

In [36]:
from sklearn.model_selection import train_test_split
from scipy.stats import spearmanr
random_state = 24 # for re-productibility

In [31]:
# features and target
X = loan_data_lgd[features_testing + ['lgd']].copy()
y = X.pop('lgd') # pop() removes `lgd` from X and return what we've deleted, ie assign it to y simultaneously

# sampling (train / test)
ix_train, ix_test = train_test_split(
    X.index,
    test_size=0.3,
    random_state=random_state
)

print(f"Train: {len(ix_train):,.0f}\nTest: {len(ix_test):,.0f}")


Train: 30,265
Test: 12,971


In [33]:
y.head(10)

1     0.596091
8     0.884273
9     0.724902
12    0.747248
14    0.252193
21    0.332149
24    0.657894
26    1.000000
27    0.677732
46    0.000000
Name: lgd, dtype: float64

### 2.4 Linear regression

In [32]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

In [40]:
features_lr = [feature for feature in features_testing if feature != 'ead']

# specify categorical features
feats_cat = X[features_lr].select_dtypes(include=[object]).columns

# one-hot encoding for categorical features
transformer = ColumnTransformer( # Applies transformers to columns of an array or pandas DataFrame
    transformers=[
        ('OneHotEncoder',
         OneHotEncoder( # Encode categorical features as a one-hot numeric array
             drop='first', # drop the first category in each feature
             handle_unknown='ignore'), # result one-hot encoding column to be all zeros for unknown cat
            feats_cat
         )
    ],
    remainder='passthrough', # retain not specified columns but present in data to be passed to fit (ie., the numerical features)
    verbose_feature_names_out=False # not prefix any feature name
)

# imputation for missing values
imputer = SimpleImputer( # Univariate imputer for completing missing values with simple strategies.
            missing_values=np.nan, # definition of missing values, and to be imputed
            strategy='median' # replace missing values using the median along each column
            )

# calling the model
lin_reg = LinearRegression()

# defining the pipeline, what will be processed from data to output
sk_lr_model = Pipeline(
    steps=[
        ("transformer", transformer),
        ("imputer", imputer),
        ("regressor", lin_reg)
    ]
)

# training the model
sk_lr_model.fit(X.loc[ix_train][features_lr], y.loc[ix_train])

Predict on original data

In [41]:
X['lgd_pred_lr'] = sk_lr_model.predict(X)

Rank correlation

In [42]:
spearmanr(y.loc[ix_test],
          X.loc[ix_test]['lgd_pred_lr'])[0]

0.7206451572037508

### 2.5 Logistic regression on WOE

In [45]:
import uuid
from optbinning import BinningProcess, OptimalBinning
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

### 2.6 Boosting

## 3 Testing

### 3.1 Discrimination testing

### 3.2 Visualization of discrimination (CLAR curve)