# PD - Probability of Default
This notebook will train a model to predict a likelihood of a borrower defaults (fail to pay back a certain debt). Let's start with feature selection


In [1]:
# Data Science
import numpy as np
import pandas as pd

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# ignore all warnings
import warnings
warnings.filterwarnings('ignore')

# data preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import chi2

# dataframe set
pd.set_option('display.float', '{:.2f}'.format)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

## Feature selection
First, using the selected features from `simple_feature_selection.ipynb`

In [15]:
df = pd.read_csv('../data/loan_data.csv')
# %store -r FIRST_SELECTED_FEATURES
# df = df[FIRST_SELECTED_FEATURES]
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,...,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,leadman,10+ years,MORTGAGE,55000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,190xx,PA,5.91,...,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,Engineer,10+ years,MORTGAGE,65000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,small_business,Business,577xx,SD,16.06,...,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,truck driver,10+ years,MORTGAGE,63000.0,Not Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,home_improvement,,605xx,IL,10.78,...,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,Information Systems Officer,10+ years,MORTGAGE,110000.0,Source Verified,Dec-2015,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,076xx,NJ,17.06,...,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,Contract Specialist,3 years,MORTGAGE,104433.0,Source Verified,Dec-2015,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,major_purchase,Major purchase,174xx,PA,25.37,...,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


To comply with regulatory requirements, PD models should be **explainable and interpretable**. In order to address this, it is recommended to develop a credit scorecard. This approach involves categorizing variables appropriately and determining how each category affects the Credit Score by assigning corresponding increases or decreases.

To implement this approach, categorical variables and binned numeric variables are utilized. Additionally, considering that certain variables may have a large number of unique values (high cardinality), it can be beneficial to examine how the categories relate to the target variable. In cases where it is reasonable, aggregating some categories may provide valuable insights.

The two concepts **weight of evidence (WOE) and information value (IV)** are also a good practice to screen features. 

With a limit resource, I decide to pick fewer features that I think is important which is listed below (mosts are from [kaggle](https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction)).

<table class="dataframe">
  <thead>
    <tr style="text-align: left;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>delinq_2yrs</td>
      <td>Number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>7</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>8</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>9</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>10</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>11</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>12</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>13</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>14</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>15</th>
      <td>inq_last_6mths</td>
      <td>Number of inquiries paid in last 6 months.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>17</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>18</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>19</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
  </tbody>
</table>

In [34]:
FINAL_SELECTED_FEATURES = [
    "id", "loan_amnt", "term", "int_rate", "installment", "grade", "emp_length", 
    "home_ownership", "annual_inc", "verification_status", "issue_d", "loan_status", "purpose", "addr_state", 
    "dti", "open_acc", "pub_rec", "total_acc", "initial_list_status", "delinq_2yrs", 
    "inq_last_6mths",
]

In [36]:
df = df[FINAL_SELECTED_FEATURES]
df.isna().sum()

id                         0
loan_amnt                  4
term                       4
int_rate                   4
installment                4
grade                      4
emp_title              37218
emp_length             36906
home_ownership             4
annual_inc                 4
verification_status        4
issue_d                    4
loan_status                4
purpose                    4
addr_state                 5
dti                      354
open_acc                   5
pub_rec                    5
total_acc                  5
initial_list_status        5
delinq_2yrs                5
inq_last_6mths             5
dtype: int64

In [24]:
numerical_features = df.select_dtypes(include=['int', 'float']).columns

categorical_features = df.select_dtypes(include=['object']).columns

## Data Clean and Features Transform

In [33]:
df['issue_d'].isnull().sum()

4