This notebook creates meaningful, bank-interpretable features from cleaned loan data fpr segmentation and risk modeling

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("/Users/starboy/Documents/Projects/credit_risk_segmentation/data/interim/cleaned_loans.csv")

In [3]:
df.head()

Unnamed: 0,emp_length,homeownership,annual_income,verified_income,debt_to_income,annual_income_joint,verification_income_joint,debt_to_income_joint,delinq_2y,months_since_last_delinq,...,term,interest_rate,installment,grade,sub_grade,issue_month,loan_status,initial_listing_status,disbursement_method,target
0,10.0,MORTGAGE,210000.0,Verified,9.53,,,,0,999.0,...,36,26.77,203.51,E,E5,Feb-2018,Fully Paid,fractional,Cash,0
1,1.0,MORTGAGE,83000.0,Source Verified,18.44,,,,3,14.0,...,60,15.05,476.33,C,C4,Jan-2018,Fully Paid,whole,Cash,0
2,10.0,MORTGAGE,140000.0,Not Verified,13.82,,,,0,999.0,...,60,9.93,318.19,B,B2,Jan-2018,Fully Paid,whole,Cash,0
3,1.0,OWN,70000.0,Source Verified,0.0,,,,0,31.0,...,36,6.08,73.1,A,A2,Jan-2018,Fully Paid,whole,Cash,0
4,2.0,MORTGAGE,44000.0,Verified,24.77,,,,0,999.0,...,36,14.08,246.36,C,C3,Feb-2018,Fully Paid,whole,Cash,0


THe features are designed to capture:
1. Loan burden relative to income
2. Borrower credit experience
3. Debt pressure
4. Loan structure risk
5. Missing value indicators

##### **LOAN BURDEN**

In [4]:
df["loan_to_income"] = df["loan_amount"]/df["annual_income"]

In [5]:
df["loan_to_income"].describe()

  sqr = _ensure_numeric((avg - values) ** 2)


count    454.000000
mean            inf
std             NaN
min        0.011765
25%        0.107764
50%        0.180201
75%        0.300000
max             inf
Name: loan_to_income, dtype: float64

##### **CREDIT EXPERIENCE**

how many years the person has been using credit

In [6]:
df["credit_history_years"] = 2018 - df["earliest_credit_line"]

In [7]:
df[["loan_to_income","credit_history_years"]].head()

Unnamed: 0,loan_to_income,credit_history_years
0,0.02381,15
1,0.240964,13
2,0.107143,25
3,0.034286,14
4,0.163636,7


Interpretation:
1. for 0th index, the loan is only 2.4% of annual income. The person has 15 years of credit history. This looks low risk.
2. For 1st index, the loan is 24% of annual income. This person has 13 years of credit history. This is Reasonable
3. For 4th index, the loan is 16% of the annual income and the person has 7 years of credit history. slightly higher uncertainty.

#### **CREATE MISSING VALUES**

In [8]:
df["annual_income_missing"] = df["annual_income"].isna().astype(int)

##### **FILL MISSING INCOME**

In [9]:
df["annual_income"] = df["annual_income"].fillna(df["annual_income"].median())

In [10]:
df["debt_to_income"] = df["debt_to_income"].fillna(df["debt_to_income"].median())

In [11]:
df["credit_history_years"] = df["credit_history_years"].fillna(df["credit_history_years"].median())

In [12]:
df[["annual_income","annual_income_missing"]].head()

Unnamed: 0,annual_income,annual_income_missing
0,210000.0,0
1,83000.0,0
2,140000.0,0
3,70000.0,0
4,44000.0,0


we need a missing flag component to see if missing data is a risk 

In [13]:
df.to_csv("/Users/starboy/Documents/Projects/credit_risk_segmentation/data/processed/model_ready_data.csv", index=False)