This notebook performs bank-safe data cleaning and validation

In [1]:
import numpy as np
import pandas as pd

In [4]:
df_raw=pd.read_csv("/Users/starboy/Documents/Projects/credit_risk_segmentation/data/raw/lending_club.csv")

In [5]:
df = df_raw.copy()

**Remove columns that couldnt have present while the loan officer was making the decision to approve/reject. Fields capturing repayment amounts, balances, or collections were classified as post-loan variables and excluded to prevent leakage.**

In [6]:
LEAKAGE_COLS = ["balance",
                "paid_total",
                "paid_principal",
                "paid_interest",
                "paid_late_fees"
                ]

df=df.drop(columns=LEAKAGE_COLS, errors="ignore")

These columns reflect post-loan repayment behaviour and are excluded to prevent data leakage

In [7]:
BAD_STATUSES=["Charged Off", "Default"]
GOOD_STATUSES= ["Fully Paid"]

In [8]:
df = df[df["loan_status"].isin(BAD_STATUSES+GOOD_STATUSES)]
df["target"]=df["loan_status"].apply(
    lambda x:1 if x in BAD_STATUSES else 0
)

In [9]:
DROP_COLS=["emp_title","state"]
df=df.drop(columns=DROP_COLS,errors="ignore")

In [10]:
df["months_since_last_delinq"]=df["months_since_last_delinq"].fillna(999)

In [13]:
# Remove negetive income and assure reasonable credit utilization

assert (df["debt_to_income"].dropna() >= 0).all()
assert (df["interest_rate"] > 0).all()


In [15]:
assert (df["debt_to_income"] < 0).sum() == 0
assert (df["interest_rate"] <= 0).sum() == 0

In [17]:
df.to_csv("/Users/starboy/Documents/Projects/credit_risk_segmentation/data/interim/cleaned_loans.csv",index=False)