## Problem Statement
Loan default is one of the major financial risks faced by lending institutions.
When a borrower fails to repay a loan, it results in direct financial losses and
affects the overall stability of the organization.

Traditional loan approval systems often rely on fixed rules, which may not
accurately assess the true risk associated with a borrower. Additionally, not
all classification errors have the same business impact. For example, approving
a loan for a high-risk customer (false negative) is far more costly than
rejecting a low-risk customer (false positive).


## Objective
The objective of this task is to build a predictive model that estimates the
likelihood of loan default based on applicant financial and credit-related
attributes.

In addition to prediction accuracy, this task focuses on optimizing the
classification decision threshold using a business cost perspective. By
adjusting the threshold, the model aims to minimize overall financial risk and
align machine learning predictions with real-world business objectives.


## Dataset Description
The loan dataset contains detailed information about loan applicants,
including demographic, financial, and credit history features.

Key attributes in the dataset include:
- person_age: Age of the applicant
- person_income: Annual income of the applicant
- person_home_ownership: Home ownership status
- person_emp_length: Length of employment
- loan_intent: Purpose of the loan
- loan_grade: Credit grade assigned to the applicant
- loan_amnt: Loan amount requested
- loan_int_rate: Interest rate of the loan
- loan_percent_income: Loan amount as a percentage of income
- cb_person_default_on_file: Historical default record
- cb_person_cred_hist_length: Credit history length
- loan_status: Target variable indicating loan default (0 = No, 1 = Yes)

This dataset is suitable for binary classification and risk modeling tasks.


## Import Required Libraries


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report


## Dataset Loading


In [2]:
df = pd.read_csv("credit_risk_dataset.csv")
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


## Data Cleaning and Preprocessing


In [3]:
categorical_cols = [
    'person_home_ownership',
    'loan_intent',
    'loan_grade',
    'cb_person_default_on_file'
]

le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

df.head()


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,3,123.0,4,3,35000,16.02,1,0.59,1,3
1,21,9600,2,5.0,1,1,1000,11.14,0,0.1,0,2
2,25,9600,0,1.0,3,2,5500,12.87,1,0.57,0,3
3,23,65500,3,4.0,3,2,35000,15.23,1,0.53,0,2
4,24,54400,3,8.0,3,2,35000,14.27,1,0.55,1,4


## Feature Selection and Train-Test Split

In [4]:
X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Feature Scaling


In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Model Training using Logistic Regression


In [9]:
from sklearn.impute import SimpleImputer

# Create an imputer (mean strategy is common)
imputer = SimpleImputer(strategy='mean')

# Fit on training data and transform both train and test
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)


In [10]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=2000)
model.fit(X_train_imputed, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,2000


In [11]:
y_pred = model.predict(X_test_imputed)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.95      0.90      5072
           1       0.72      0.44      0.55      1445

    accuracy                           0.84      6517
   macro avg       0.79      0.70      0.72      6517
weighted avg       0.83      0.84      0.82      6517



## Cost-Based Threshold Optimization


In [13]:
# Predict probabilities on the test set
probabilities = model.predict_proba(X_test_scaled_final)[:, 1]  # Use preprocessed test set

# Apply custom threshold (e.g., 0.3 for business cost optimization)
threshold = 0.3
y_custom = (probabilities >= threshold).astype(int)




NameError: name 'X_test_scaled_final' is not defined