<a href="https://colab.research.google.com/github/klaxman23/August_pratice/blob/main/Model_7_Case_Study_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Study – 3
Domain – Banking/Loan
focus – Lower NPA (Non-Performing Asset)
Business challenge/requirement
PeerLoanKart is an NBFC (Non-Banking Financial Company) which facilitates peer-to-peer loans.
It connects people who need money (borrowers) with people who have money (investors). As an investor, you would want to invest in people who showed a profile of having a high probability of paying you back.
You as an ML expert create a model that will help predict whether a borrower will pay the loan or not.

Key issues
Ensure NPAs are lower – meaning PeerLoanKart wants to be very diligent in giving
loans to a borrower
Considerations
NONE
Data volume
- Approx 9578 records – file loan_borowwer_data.csv
Fields in Data
• credit.policy: 1 if the customer meets the credit underwriting criteria of
PeerLoanKart, and 0 otherwise
• purpose: The purpose of the loan (takes values "credit_card",
"debt_consolidation", "educational", "major_purchase", "small_business", and
"all_other")
• int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be
stored as 0.11). Borrowers judged by PeerLoanKart to be riskier are assigned
higher interest rates
• installment: The monthly installments owed by the borrower if the loan is
funded  log.annual.in
• dti: The debt-to-income ratio of the borrower (amount of debt divided by
annual income)
• fico: The FICO credit score of the borrower
• days.with.cr.line: The number of days the borrower has had a credit line
• revol.bal: The borrower's revolving balance (amount unpaid at the end of the
credit card billing cycle)
• revol.util: The borrower's revolving line utilization rate (the amount of the
credit line used relative to total credit available)
• inq.last.6mths: The borrower's number of inquiries by creditors in the last 6
months
• delinq.2yrs: The number of times the borrower had been 30+ days past due on
a payment in the past 2 years
• pub.rec: The borrower's number of derogatory public records (bankruptcy
filings, tax liens, or judgments)
• not.fully.paid: This is the output field. Please note that 1 means the
borrower is not going to pay the loan completely
Additional information
- NA
Business benefits
Increase in profits up to 20% as NPA will be reduced due to loan disbursal for only
good borrowers

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
df = pd.read_csv("loan_borowwer_data.csv")

In [None]:
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

In [None]:
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
# 4. Encode categorical column
df = pd.get_dummies(df, columns=["purpose"], drop_first=True)

In [None]:
# 5. Split features & target
X = df.drop("not.fully.paid", axis=1)
y = df["not.fully.paid"]

In [None]:
# 6. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,
    random_state=42,
    stratify=y
)

In [None]:
# 7. Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("\n--- Logistic Regression ---")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(confusion_matrix(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

In [None]:
# 8. Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("\n--- Decision Tree ---")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print(confusion_matrix(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

In [None]:
# 9. Random Forest (Best)
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42,
    class_weight="balanced"
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("\n--- Random Forest ---")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

In [None]:
# 10. Feature Importance
importance = pd.Series(rf.feature_importances_, index=X.columns)
print("\nTop 10 Important Features:")
print(importance.sort_values(ascending=False).head(10))