# Final Model Comparison — Credit Scoring

## Objective
To compare multiple machine learning models for credit default prediction
and determine the most suitable approach from both
predictive performance and business perspectives.

## Models Evaluated
- Logistic Regression (Baseline & Explainable)
- Random Forest (Non-linear, Balanced)
- LightGBM (High-performance Risk Ranking)

## Key Evaluation Focus
- ROC AUC
- KS Statistic
- Recall for Default (Class 1)
- Approval Rate
- Business Trade-off


## Experimental Setup

- Dataset: Home Credit Application Data
- Target Variable: Default (1) vs Non-default (0)
- Train-test split: 80:20 (stratified)
- Feature set: Engineered numerical and encoded categorical features
- Evaluation performed on hold-out test set

Accuracy is not considered a primary metric
due to strong class imbalance in the dataset.

## Model Performance Summary

| Model | ROC AUC | KS Statistic | Recall (Default) | Approval Rate | Key Characteristics |
|------|--------|--------------|------------------|---------------|--------------------|
| Logistic Regression | ~0.69 | ~0.36 | ~0.97 | ~30–40% | Conservative, highly interpretable |
| Random Forest | 0.731 | 0.344 | 0.64 | 67.1% | Balanced risk and growth |
| LightGBM (Default Threshold) | 0.759 | 0.386 | 0.02 | 99.7% | Ranking-focused, overly permissive |
| **LightGBM (Tuned Threshold)** | **0.759** | **0.386** | **0.14** | **~91%** | Growth-oriented strategy |

## Visual Model Comparison

This section presents visual comparisons to support
the quantitative evaluation metrics.

### ROC Curve Comparison

The ROC curves below compare the discriminative ability
of each model in separating default and non-default customers.


![ROC Curve Comparison](../../screenshot/roc/roc_comparison.png)

### Insight:
- LightGBM demonstrates the strongest separation
- Random Forest provides competitive performance
- Logistic Regression serves as a baseline

### Confusion Matrix Comparison

Confusion matrices highlight how each model behaves
under different risk strategies.


![Logistic Regression CM](../../screenshot/cm/cm_logres.png)
![Random Forest CM](../../screenshot/cm/cm_rf.png)
![LightGBM Tuned CM](../../screenshot/cm/cm_lightgbm.png)


### Insight:

- Logistic Regression is highly conservative
- Random Forest balances approval and risk
- LightGBM (tuned) supports growth-oriented decisioning

## Feature Importance Analysis

Across all models, top contributing features are consistent:
- EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3
- Employment duration (EMPLOYED_YEARS)
- Age (AGE_YEARS)
- Credit & annuity related ratios

This consistency validates:
- Data quality
- Model robustness
- Alignment with credit risk domain knowledge


## Business Interpretation

- Logistic Regression:
  Suitable for conservative lending strategies
  and regulatory explainability requirements.

- Random Forest:
  Provides a balanced trade-off between risk mitigation
  and customer approval growth.

- LightGBM:
  Achieves the strongest ranking performance
  but requires threshold tuning for operational use.

## Final Recommendation

Based on predictive performance and business considerations:

- Use **LightGBM** as the primary **risk scoring engine**
  to rank applicants by default risk.
- Apply **threshold tuning** to align approval decisions
  with business risk appetite.
- Maintain **Logistic Regression** as:
  - A benchmark model
  - An explainability and regulatory support model.
- Consider **Random Forest** as a challenger model
  for periodic validation and robustness checks.


## Future Improvements

- Cost-sensitive optimization using expected loss
- Segment-based thresholding (e.g., income level, product type)
- Model monitoring for data drift and performance decay
- Integration with rule-based credit policies

## Conclusion

This project demonstrates a complete end-to-end
credit scoring workflow, from EDA and feature engineering
to multi-model evaluation and business-aligned decision making.

The results emphasize that strong predictive performance
must be combined with appropriate threshold selection
and business context to deliver real-world value.