In this notebook, we are going to discuss about decision making using the PD model we built. We will compute the PD score for individual accounts. Based on the feature or variables an each individual has, the model computes the odds of being good vs being bad , in another words how good the borrower is. For instance:

Feature coeefficient

| Variable | Coefficient |
|----------|----------|
| Age [25 - 40]  | 0.235 |
| Age [25 - 40] | 0.143 |
| Education [BS ] | -0.52 |
| Education [MS ] | 0.6 |
| Purpose [Car ] | 0.70 |

PD score is linear computation of all variables, then we get log odds.

\begin{equation}
    ln\left ( \frac{\left( 1 - PD \right )}{\left( PD\right)} \right ) = 3.5 => i.e, e^{3.54} => 1-PD = \frac{e^{3.54}}{e^{3.54} + 1}
\end{equation}

Hence, 1-PD (good borrowers) = 0.92, which is eaxctly the model produced.

In the following section of this notebook, we will compute the credit score for each of the individuals.

In [1]:
import numpy as np
import pandas as pd
from joblib import load
from logistic_regression_wrapper import LogisticRegressionWrapper

In [2]:
data_path = f"./data/"
model_path = f"./models/"

In [3]:
# train_data = pd.read_csv(f"{data_path}processed_train.csv")
test_data = pd.read_csv(f"{data_path}processed_test.csv")

In [4]:
woe_vars = pd.read_csv(f"{data_path}/woe_cat_vars.csv").squeeze()
ref_vars = pd.read_csv(f"{data_path}/woe_ref_vars.csv").squeeze()

print(f"# of woe cat vars: {len(woe_vars)}, # of woe ref vars: {len(ref_vars)}")

# of woe cat vars: 118, # of woe ref vars: 22


Loading the model

In [5]:
pd_model = load(f"{model_path}m_zero.joblib")

In [6]:
required_features = list(ref_vars.values) +  pd_model.feature_names
x_test, y_test = test_data.loc[:, required_features], test_data["good_bad"]
print(f"with refs: [xtest: {x_test.shape}, y_test: {y_test.shape}")

with refs: [xtest: (93257, 96), y_test: (93257,)


##### Compute credit score 

In [7]:
summary = pd.DataFrame(columns=["feature"], data=pd_model.feature_names)
summary["coefficients"] = np.transpose(pd_model.coef_)
summary['p_value'] = pd_model.p_values
summary.index = summary.index + 1
summary.loc[0] = ["intercept", pd_model.intercept_[0], np.nan]
summary = summary.sort_index()
summary

Unnamed: 0,feature,coefficients,p_value
0,intercept,-0.731294,
1,grade:A,1.016916,5.283330e-29
2,grade:B,0.854733,1.268703e-45
3,grade:C,0.676734,7.133004e-34
4,grade:D,0.521357,5.251552e-23
...,...,...,...
70,dti:12.97_16.79,0.048041,2.259660e-03
71,dti:20.75_23.99,-0.078927,3.511072e-06
72,dti:24.9_31.99,-0.135511,5.352269e-17
73,mths_since_last_record:>2,0.074476,1.129349e-05


In [8]:
# Lets addd ref category variable as well
ref_cat_df = pd.DataFrame(ref_vars.values, columns=["feature"])
# we will merge with summary table
ref_cat_df["coefficients"] = 0
ref_cat_df["p_value"] = np.nan
ref_cat_df

Unnamed: 0,feature,coefficients,p_value
0,grade:G,0,
1,home_ownership:RENT_OTHER_NONE_ANY,0,
2,addr_state:NE_IA_NV_AL_ID_ND_FL_HI,0,
3,verification_status:Verified,0,
4,purpose:educ_small_biz_wedd_renno_enerby_movin...,0,
5,initial_list_status:f,0,
6,term:60,0,
7,emp_length:0,0,
8,months_since_issued_date:>172,0,
9,int_rate:>20.281,0,


In [9]:
score_card_df = pd.concat([summary, ref_cat_df])
score_card_df = score_card_df.reset_index()
score_card_df

Unnamed: 0,index,feature,coefficients,p_value
0,0,intercept,-0.731294,
1,1,grade:A,1.016916,5.283330e-29
2,2,grade:B,0.854733,1.268703e-45
3,3,grade:C,0.676734,7.133004e-34
4,4,grade:D,0.521357,5.251552e-23
...,...,...,...,...
92,17,total_acc:<=28,0.000000,
93,18,acc_now_delinq:0,0.000000,
94,19,dti:>32,0.000000,
95,20,mths_since_last_record:0_2,0.000000,


In [10]:
score_card_df["original_featre_name"] = score_card_df['feature'].str.split(':').str[0]
score_card_df

Unnamed: 0,index,feature,coefficients,p_value,original_featre_name
0,0,intercept,-0.731294,,intercept
1,1,grade:A,1.016916,5.283330e-29,grade
2,2,grade:B,0.854733,1.268703e-45,grade
3,3,grade:C,0.676734,7.133004e-34,grade
4,4,grade:D,0.521357,5.251552e-23,grade
...,...,...,...,...,...
92,17,total_acc:<=28,0.000000,,total_acc
93,18,acc_now_delinq:0,0.000000,,acc_now_delinq
94,19,dti:>32,0.000000,,dti
95,20,mths_since_last_record:0_2,0.000000,,mths_since_last_record


Now, we will convet PD model in to simple score. The range will be (min-max) (300, 850) - Just like FICO.
The minimum credit assessment is defined when a borrower falls into the 'worst' categories for all variables. Similarly, The maximum credit assessment is reached when a borrower falls into the 'best' categories of all variables.

In [11]:
min_score, max_score = 300, 850

In [12]:
original_coef = score_card_df.groupby("original_featre_name")['coefficients']
original_coef.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
original_featre_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
acc_now_delinq,1.0,0.0,,0.0,0.0,0.0,0.0,0.0
addr_state,13.0,0.177448,0.144407,0.0,0.07499,0.138128,0.215559,0.518929
annual_inc,10.0,0.373832,0.211923,0.0,0.219356,0.41762,0.531365,0.636967
delinq_2yrs,2.0,0.025249,0.035708,0.0,0.012625,0.025249,0.037874,0.050498
dti,6.0,0.026651,0.123744,-0.135511,-0.059195,0.024021,0.120472,0.181689
emp_length,6.0,0.13904,0.074437,0.0,0.13775,0.145555,0.187927,0.205931
grade,7.0,0.514795,0.363873,0.0,0.266912,0.521357,0.765733,1.016916
home_ownership,3.0,0.070425,0.062279,0.0,0.046516,0.093031,0.105637,0.118243
initial_list_status,2.0,0.03465,0.049002,0.0,0.017325,0.03465,0.051975,0.0693
inq_last_6mths,4.0,0.410669,0.308777,0.0,0.273277,0.464973,0.602365,0.71273


In [13]:
min_sum_coef= original_coef.min().sum()
max_sum_coef = original_coef.max().sum()
print(f"min sum coeff: {min_sum_coef}, max sum coeff: {max_sum_coef}")

min sum coeff: -1.690752930331029, max sum coeff: 6.260874700119958


Now, How do we scale dummy variable coefficient to credit score !!


\begin{equation}
    variable\_score = variable\_coeff \times  \frac{\left( max\_score - min\_score \right )}{\left( max\_sumof\_coef - min\_sumof\_coef \right)}
\end{equation}


In [14]:
score_card_df['score_compute'] = score_card_df["coefficients"] * (max_score - min_score) / (max_sum_coef - min_sum_coef)
score_card_df

Unnamed: 0,index,feature,coefficients,p_value,original_featre_name,score_compute
0,0,intercept,-0.731294,,intercept,-50.582288
1,1,grade:A,1.016916,5.283330e-29,grade,70.338312
2,2,grade:B,0.854733,1.268703e-45,grade,59.120366
3,3,grade:C,0.676734,7.133004e-34,grade,46.808481
4,4,grade:D,0.521357,5.251552e-23,grade,36.061350
...,...,...,...,...,...,...
92,17,total_acc:<=28,0.000000,,total_acc,0.000000
93,18,acc_now_delinq:0,0.000000,,acc_now_delinq,0.000000
94,19,dti:>32,0.000000,,dti,0.000000
95,20,mths_since_last_record:0_2,0.000000,,mths_since_last_record,0.000000


For the intecept:
\begin{equation}
    intercept\_score =  \frac{\left( intercept\_coeff - min\_sum\_coeff \right )}{\left( max\_sumof\_coef - min\_sumof\_coef \right)} \times \left ( max\_score - min\_score  \right) + min\_score
\end{equation}

In [15]:
score_card_df['score_compute'][0] = ((score_card_df["coefficients"][0] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score
score_card_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_card_df['score_compute'][0] = ((score_card_df["coefficients"][0] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score


Unnamed: 0,index,feature,coefficients,p_value,original_featre_name,score_compute
0,0,intercept,-0.731294,,intercept,366.364097
1,1,grade:A,1.016916,5.283330e-29,grade,70.338312
2,2,grade:B,0.854733,1.268703e-45,grade,59.120366
3,3,grade:C,0.676734,7.133004e-34,grade,46.808481
4,4,grade:D,0.521357,5.251552e-23,grade,36.061350
...,...,...,...,...,...,...
92,17,total_acc:<=28,0.000000,,total_acc,0.000000
93,18,acc_now_delinq:0,0.000000,,acc_now_delinq,0.000000
94,19,dti:>32,0.000000,,dti,0.000000
95,20,mths_since_last_record:0_2,0.000000,,mths_since_last_record,0.000000


In [16]:
score_card_df['score_compute_rounded'] = score_card_df['score_compute'].round()
score_card_df.head()

Unnamed: 0,index,feature,coefficients,p_value,original_featre_name,score_compute,score_compute_rounded
0,0,intercept,-0.731294,,intercept,366.364097,366.0
1,1,grade:A,1.016916,5.28333e-29,grade,70.338312,70.0
2,2,grade:B,0.854733,1.268703e-45,grade,59.120366,59.0
3,3,grade:C,0.676734,7.133004e-34,grade,46.808481,47.0
4,4,grade:D,0.521357,5.2515520000000005e-23,grade,36.06135,36.0


In [17]:
score_card_df["score"] = score_card_df["score_compute_rounded"]
score_card_df["score"][56] = 4 # it was rounded to 3 when actual was 3.49
score_card_df["score"][24] = 16 # it was rounded to 15.47 when actual was 16

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_card_df["score"][56] = 4 # it was rounded to 3 when actual was 3.49
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_card_df["score"][24] = 16 # it was rounded to 15.47 when actual was 16


In [18]:
# lets check the boundaries for minimum and maximu score.
min_score_from_model = score_card_df.groupby("original_featre_name")["score_compute_rounded"].min().sum()
max_score_from_model = score_card_df.groupby("original_featre_name")["score_compute_rounded"].max().sum()

print(f"Socre + intercept, min: {min_score_from_model} max:{max_score_from_model}")

Socre + intercept, min: 300.0 max:848.0


The max score is 848 it is due to rounding the scores.

In [19]:
min_score_from_model = score_card_df.groupby("original_featre_name")["score"].min().sum()
max_score_from_model = score_card_df.groupby("original_featre_name")["score"].max().sum()
print(f"Socre + intercept, min: {min_score_from_model} max:{max_score_from_model}")

Socre + intercept, min: 300.0 max:850.0


In [20]:
score_card_df.head()


Unnamed: 0,index,feature,coefficients,p_value,original_featre_name,score_compute,score_compute_rounded,score
0,0,intercept,-0.731294,,intercept,366.364097,366.0,366.0
1,1,grade:A,1.016916,5.28333e-29,grade,70.338312,70.0,70.0
2,2,grade:B,0.854733,1.268703e-45,grade,59.120366,59.0,59.0
3,3,grade:C,0.676734,7.133004e-34,grade,46.808481,47.0,47.0
4,4,grade:D,0.521357,5.2515520000000005e-23,grade,36.06135,36.0,36.0


In [21]:
x_test.head()

Unnamed: 0,grade:G,home_ownership:RENT_OTHER_NONE_ANY,addr_state:NE_IA_NV_AL_ID_ND_FL_HI,verification_status:Verified,purpose:educ_small_biz_wedd_renno_enerby_moving_other_house,initial_list_status:f,term:60,emp_length:0,months_since_issued_date:>172,int_rate:>20.281,...,open_acc:>=31,pub_rec:3_4,total_acc:28_50,dti:1.6_6.39,dti:6.39_10.39,dti:12.97_16.79,dti:20.75_23.99,dti:24.9_31.99,mths_since_last_record:>2,total_rev_hi_lim:>95K
0,False,0,0,True,0,False,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,False,1,0,False,0,False,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,False,1,1,False,0,True,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,False,1,0,True,0,False,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,False,0,0,False,0,False,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Nowm we will compute the credit score. But first we have to insert intercept to column index 0 in order to compute the following. 

\begin{equation}
    credit\_score = intercept + \beta_{grade:A} \times grade:A + \beta_{grade:B} \times grade:B+  + ................. \beta_{M:N} \times M:N
\end{equation}

By inserting into df we will have dot product between individual instance and score we have.

In [22]:
x_test_copy = x_test
x_test_copy.insert(0, 'intercept', 1) #inserts intercept columns into col: 0 in given df
x_test_copy = x_test_copy[score_card_df["feature"].values]
x_test_copy.head()

Unnamed: 0,intercept,grade:A,grade:B,grade:C,grade:D,grade:E,grade:F,home_ownership:OWN,home_ownership:MORTGAGE,addr_state:NY,...,months_since_earliest_cr_line:<171,delinq_2yrs:>7,inq_last_6mths:>6,open_acc:<4,pub_rec:0_2,total_acc:<=28,acc_now_delinq:0,dti:>32,mths_since_last_record:0_2,total_rev_hi_lim:<=5K
0,1,False,False,True,False,False,False,True,False,True,...,0,0,0,0,1,1,1,0,1,0
1,1,True,False,False,False,False,False,False,False,False,...,0,0,0,0,1,0,1,0,1,0
2,1,False,True,False,False,False,False,False,False,False,...,0,0,0,1,1,1,1,0,1,0
3,1,False,False,True,False,False,False,False,False,False,...,0,0,0,0,1,1,1,0,1,0
4,1,False,False,True,False,False,False,False,True,False,...,0,0,0,0,1,1,1,0,0,0


In [23]:
scores = score_card_df["score"]
scores = scores.values.reshape(scores.shape[0], 1)
print(f"input data: {x_test_copy.shape}, score data: {scores.shape}")

input data: (93257, 97), score data: (97, 1)


In [24]:
y_scores = x_test_copy.dot(scores)
y_scores.head()

Unnamed: 0,0
0,604.0
1,643.0
2,588.0
3,563.0
4,597.0


y_score is the credit score based on our PD model. The PD model and a credit scores serve the same purpose. Lets reverse the credit score to PD. This can be achieve using hte following forumla.

*From credit score to PD*

\begin{equation}
    sum\_of\_coef\_from\_score =  \frac{\left( total\_score - min\_score \right )} {\left ( max\_score - min\_score  \right)} {\left( max\_sumof\_coef - min\_sumof\_coef \right)} + min\_sumof\_coef
\end{equation}

lets turn that sum of coefficient from score into PD.

\begin{equation}
    \frac{\exp^{sum\_of\_coef\_from\_score}}{1 + \exp^{sum\_of\_coef\_from\_score}}
\end{equation}

this is equivalent to sigmoid function:

\begin{equation}
    \frac{1}{1 + \exp^{-sum\_of\_coef\_from\_score}}
\end{equation}


In [25]:
sum_of_coef_from_score = ((y_scores - min_score) / (max_score - min_score)) * (max_sum_coef - min_sum_coef) + min_sum_coef
sum_of_coef_from_score = sum_of_coef_from_score.astype(float) # just to make sure, otherwise it np.exp throws error

In [26]:
y_hat_probab_from_score = np.exp(sum_of_coef_from_score) / (np.exp(sum_of_coef_from_score) + 1) 
y_hat_probab_from_score.head()

Unnamed: 0,0
0,0.937282
1,0.963321
2,0.922228
3,0.892023
4,0.931062


The PD score we have above is exactly same as predicted by our PD model.