# üí≥ CreditOne: Loan Default Risk Modeling

This notebook demonstrates a machine learning solution to help **CreditOne**, a financial institution, assess customer creditworthiness and mitigate loan default risk.

We follow a complete data science workflow:
1. Problem understanding
2. Data cleaning and preprocessing
3. Exploratory Data Analysis (EDA)
4. Model building and evaluation
5. Key insights and future enhancements


In [None]:
# üì¶ Load and inspect data

import pandas as pd

# Load the cleaned dataset
df = pd.read_csv("CreditOne_Cleaned.csv")
df.head()

## üîç Exploratory Data Analysis (EDA)

We'll explore the dataset to understand customer demographics, payment behaviors, and their relationship with default risk.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
sns.set(style="whitegrid")

# Plot distribution of AGE
plt.figure(figsize=(6, 4))
sns.histplot(df['AGE'], bins=20, kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

# Plot default value counts
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='default payment next month')
plt.title("Loan Default Distribution")
plt.ylabel("Number of Clients")
plt.show()

# Correlation heatmap for numeric features
plt.figure(figsize=(12, 10))
numeric_df = df.select_dtypes(include='number')
sns.heatmap(numeric_df.corr(), cmap='coolwarm', annot=False)
plt.title("Correlation Heatmap")
plt.show()

## ü§ñ Modeling

We'll train two models:
- **Logistic Regression**: simple and interpretable.
- **Random Forest**: a basic ensemble model for improved accuracy.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Encode target variable
df['target'] = df['default payment next month'].map({'default': 1, 'not default': 0})

# Features and target
X = df.drop(columns=['default payment next month', 'target'])
X = pd.get_dummies(X, drop_first=True)  # minimal encoding
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train, y_train)
y_pred_log = logreg.predict(X_test)

# Random Forest (shallow)
rf = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
print("Logistic Regression Report:")
print(classification_report(y_test, y_pred_log))

print("Random Forest Report:")
print(classification_report(y_test, y_pred_rf))

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC curve
y_prob_rf = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob_rf)
plt.plot(fpr, tpr, label="Random Forest")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
