<h1 align="center">Loan - Credit Risk & Population Stability</h1>
<h2>About Dataset</h2>
<p>Loan - Credit Risk & Population Stability is a part of Lending Club Company public database.
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.</p>
<p>The data was divided into two parts.</p>
<p>The first data (loan_2014-2018.csv) contains almoust 1800 000 consumer loans issued from 2014 to 2018.
The second one (loan_2019-2020.csv) is specially separated to check if model have similar characteristics and is still up to date.</p>
<strong><p>In this project, I will only use the "loan_2019-2020.csv" dataset</p></strong>
<h2>Data Source</h2>
<p>Kaggle: <a href="https://www.kaggle.com/datasets/beatafaron/loan-credit-risk-and-population-stability">Loan - Credit Risk & Population Stability</a></p>

In [9]:
import pandas as pd
from IPython.display import display

# 1. Data Collecting

## 1. Understanding Data

<strong>Understand what the dataset contains</strong>

In [2]:
df_dict = pd.read_excel("./datasets/LoanDataDictionary.xlsx")

In [3]:
# Display all rows
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [4]:
display(df_dict.head())

Unnamed: 0,LoanStatNew,Description
0,acc_now_delinq,The number of accounts on which the borrower is now delinquent.
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan application
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by the borrower during registration.


In [8]:
df_dict[df_dict['LoanStatNew'] == 'total_cu_tl']

Unnamed: 0,LoanStatNew,Description
103,total_cu_tl,Number of finance trades


## 2. Feature Selection

<strong>Conclusion</strong>
<p>After understanding the dataset and because the dataset is too large, I will select the features that are relevant to the analysis and create them into a new CSV file. I will choose features:<p>
<ul>
<li>Dataset with high-risk loans</li>
<li>Will not include outstanding "loan_status"</li>
<li>The total deselected features are now more than 100 features, due to redundancy, administrative reasons, or simply reflecting the status after the loan has been granted.</li>
<li>The best features selected were 53 features, which in my opinion have a direct relationship with the risk of default.</li>
</ul>

In [7]:
selected_features = [
    "acc_now_delinq",  # 1. Number of accounts currently delinquent
    "all_util",  # 2. Balance to credit limit ratio on all trades
    "bc_util",  # 3. Percentage of bankcard credit usage
    "delinq_2yrs",  # 4. Number of delinquency incidents (30+ days past due) in the last 2 years
    "delinq_amnt",  # 5. Total past-due amount for delinquent accounts
    "dti",  # 6. Debt-to-income ratio
    "fico_range_high",  # 7. Highest FICO score of the borrower
    "fico_range_low",  # 8. Lowest FICO score of the borrower
    "int_rate",  # 9. Interest rate of the loan
    "loan_status",  # 10. Current status of the loan (e.g., charged-off, fully paid, late)
    "num_accts_ever_120_pd",  # 11. Number of accounts ever 120+ days past due
    "num_tl_120dpd_2m",  # 12. Number of accounts currently 120+ days past due in the last 2 months
    "num_tl_30dpd",  # 13. Number of accounts currently 30+ days past due
    "num_tl_90g_dpd_24m",  # 14. Number of accounts 90+ days past due in the last 24 months
    "mo_sin_old_il_acct",  # 15. Months since oldest installment account opened
    "mo_sin_old_rev_tl_op",  # 16. Months since oldest revolving account opened
    "mo_sin_rcnt_rev_tl_op",  # 17. Months since most recent revolving account opened
    "mo_sin_rcnt_tl",  # 18. Months since most recent account opened
    "mort_acc",  # 19. Number of mortgage accounts
    "num_actv_bc_tl",  # 20. Number of active bankcard accounts
    "num_actv_rev_tl",  # 21. Number of active revolving accounts
    "num_bc_sats",  # 22. Number of satisfactory bankcard accounts
    "num_bc_tl",  # 23. Number of bankcard accounts
    "num_il_tl",  # 24. Number of installment loan accounts
    "num_op_rev_tl",  # 25. Number of open revolving accounts
    "num_rev_accts",  # 26. Number of revolving accounts
    "num_rev_tl_bal_gt_0",  # 27. Number of revolving trades with a balance greater than 0
    "revol_bal",  # 28. Total revolving balance
    "revol_util",  # 29. Revolving credit utilization percentage
    "total_bal_ex_mort",  # 30. Total balance excluding mortgage
    "total_bc_limit",  # 31. Total credit limit for bankcards
    "total_il_high_credit_limit",  # 32. Total high credit limit for installment loans
    "num_sats",  # 33. Number of satisfactory accounts
    "num_tl_op_past_12m",  # 34. Number of accounts opened in the past 12 months
    "pct_tl_nvr_dlq",  # 35. Percentage of trades that were never delinquent
    "percent_bc_gt_75",  # 36. Percentage of bankcards with balances greater than 75% of the credit limit
    "total_bal_il",  # 37. Total installment loan balance
    "total_rev_hi_lim",  # 38. Total revolving high credit limit
    "mths_since_recent_bc_dlq",  # 39. Months since the last delinquency on a bankcard
    "mths_since_recent_inq",  # 40. Months since the last credit inquiry
    "mths_since_recent_revol_delinq",  # 41. Months since the last revolving delinquency
    "num_rev_accts", # 42. Number of revolving accounts
    "num_op_rev_tl", # 43. The number of revolving accounts that are still open
    "num_actv_rev_tl", # 44. Number of active revolving accounts
    "num_bc_sats", # 45. Number of credit card accounts that are in good standing
    "num_tl_90g_dpd_24m", # 46. Number of accounts that have experienced delays of more than 90 days in the last 24 months
    "num_tl_op_past_12m", # 47. Number of credit accounts opened in the last 12 months
    "mo_sin_old_il_acct", # 48. Length of time since the first installment account was opened
    "mo_sin_rcnt_tl", # 49. Length of time since the latest credit account was opened
    "num_il_tl", # 50. Number of installment accounts
    "num_tl_30dpd", # 51. The number of accounts experiencing delays of more than 30 days at this time
    "num_tl_120dpd_2m", # 52. Number of accounts that are more than 120 days late in the last 2 months
    "num_bc_tl", # 53. Total number of credit card accounts
]

## 3. Filtering Feature

In [None]:
df = pd.read_csv("./datasets/loan_2019_20.csv")

  df = pd.read_csv("./datasets/loan_2019_20.csv")


Unnamed: 0.1,Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag
0,0,149203043,24000.0,24000.0,24000.0,60 months,13.90%,557.2,C,C1,...,Apr-2020,Jun-2020,Apr-2020,2.0,0.0,ACTIVE,473.24,20656.42,557.2,N
1,1,149354242,18500.0,18500.0,18500.0,60 months,14.74%,437.6,C,C2,...,,,,,,,,,,N
2,2,149355875,24000.0,24000.0,24000.0,36 months,8.19%,754.18,A,A4,...,,,,,,,,,,N
3,3,149437986,2800.0,2800.0,2775.0,36 months,8.19%,87.99,A,A4,...,,,,,,,,,,N
4,4,149511512,8800.0,8800.0,8800.0,36 months,20.00%,327.04,D,D2,...,,,,,,,,,,N
