# **5. Experiments and Results**

# 5.1 Random Forest Experiments

To establish a benchmark, we first trained and evaluated several variations of Random Forest models. Given the imbalanced nature of the dataset, we experimented with both threshold tuning and class weighting strategies.

| **Model**                  | **Recall (Default)** | **Precision (Default)** | **Recall (Non-Default)** | **Precision (Non-Default)** | **AUC** |
|-----------------------------|----------------------|--------------------------|---------------------------|------------------------------|---------|
| Baseline                    | 0.17                 | 0.58                     | 0.97                      | 0.82                         | 0.776   |
| Threshold-Tuned Baseline    | **0.81**             | 0.34                     | 0.59                      | 0.92                         | 0.776   |
| Class-Weighted              | 0.15                 | 0.59                     | 0.97                      | 0.81                         | 0.779   |
| Threshold-Tuned Weighted    | **0.79**             | 0.35                     | 0.62                      | 0.92                         | 0.779   |


Initially, the baseline model showed strong ability to identify non-default loans but struggled to capture defaults. Applying threshold tuning—lowering the classification threshold to 0.2—substantially improved default recall (from 0.17 to 0.81), though at the cost of reduced precision.

Class weighting was also applied by assigning computed weights (approximately 2.41 for defaults and 0.63 for non-defaults), but weighting alone did not meaningfully boost recall. Threshold tuning again proved to have the greatest impact.

We selected the **threshold-tuned baseline Random Forest** as the final version among Random Forest models due to its slightly better recall. However, limitations remained, including an increased rate of false positives and the model's lack of transparency.

Given these constraints, we moved toward exploring **XGBoost**, a more advanced modeling approach, to further improve default prediction performance.


# 5.2 XGBoost Experiments

Building on lessons from Random Forest, we transitioned to **XGBoost**, a gradient boosting framework designed to deliver stronger predictive performance for structured data.

### Model Variants Compared:
- **Baseline XGBoost** (no class weighting, default hyperparameters)
- **Class-Weighted XGBoost** (using `scale_pos_weight`)
- **Class-Weighted and Tuned XGBoost** (hyperparameter tuning + class weighting)

The final selected model was the **Class-Weighted and Tuned XGBoost**, achieving:
- **Accuracy**: 0.6607
- **F1 Score**: 0.4592
- **ROC AUC**: 0.7317

This model best balanced precision, recall, and AUC.


### Overfitting Assessment

To evaluate model stability, ROC AUC scores were compared across training, test, and cross-validation sets:

| Model                     | Train AUC | Test AUC | 3-Fold CV AUC (Train) |
|----------------------------|-----------|----------|-----------------------|
| Baseline XGBoost           | 0.7285    | 0.7278   | 0.7262 ± 0.0010        |
| Tuned XGBoost (final)      | 0.7444    | 0.7317   | 0.7295 ± 0.0011        |

Both models showed minimal overfitting, with consistent performance between training and testing.


### Learning Curve Analysis

- **Training AUC** started high (~0.84) and gradually declined as more data was introduced, converging toward ~0.75.
- **Validation AUC** steadily increased from ~0.718 to ~0.729.
- The narrowing gap between training and validation curves indicated moderate but manageable overfitting.

This suggested that the model generalizes well, and additional data only offers diminishing returns after a point.


### Threshold Tuning: Precision–Recall Trade-Off

Since minimizing missed defaults was prioritized, recall was emphasized during threshold tuning. Three thresholds were evaluated:

| Threshold | Precision | Recall | F1 Score | ROC AUC | KS Statistic |
|-----------|-----------|--------|----------|---------|--------------|
| 0.30      | 0.2626    | 0.9256 | 0.4091   | 0.7317  | 0.3373       |
| 0.50      | 0.3463    | 0.6815 | 0.4592   | 0.7317  | 0.3373       |
| 0.70      | 0.4886    | 0.2862 | 0.3610   | 0.7317  | 0.3373       |

At a **threshold of 0.30**, recall reached **92.56%**, making it the most appropriate threshold for catching as many risky loans as possible, even if it increased false positives.

Thus, the final XGBoost model was implemented with a **threshold of 0.30** to maximize default detection.

### Feature Importance Analysis

Using the built-in feature importance scores from the tuned XGBoost model, the top predictors of loan default were identified. These values reflect the relative contribution of each feature to the model's predictive performance (based on split gain/importance).

**Top Predictive Features (from XGBoost):**
1. `int_rate` — Most important predictor: Higher interest rates are strongly associated with higher credit risk and default probability, as they often reflect a lender's assessment of borrower risk.
2. `grade_B` — These represent borrower credit grades and are inherently tied to loan risk profiling, underwriting decisions, and expected repayment behavior.
3. `term_60 months` — Longer-term loans may carry higher risk due to extended exposure and borrower volatility over time.
4. `home_ownership_MORTGAGE` — Homeownership status can be a proxy for financial stability. Borrowers who rent or carry mortgages may be at higher risk compared to outright homeowners.
5. `emp_length_Unknown` — Missing employment length data could signal weaker credit profiles or incomplete applications, which may correspond with elevated risk.
6. `fico_range_high` — As expected, credit score features were important in assessing risk, with lower FICO ranges linked to higher default likelihood.
7. `dti`,`loan_amnt` — Larger loan amounts and higher DTI ratios are associated with repayment challenges and credit strain.
8. `verification_status_Verified` — Loans with income verification appear to influence default probability, either through improved risk assessment or signaling borrower transparency.
9. `purpose_small_business` — Loans for small business purposes tend to carry higher risk due to the volatility and failure rates of SMEs.
10. `installment` — Monthly repayment amounts reflect loan structure and affordability dynamics, which tie directly to repayment behavior.

Note: Other `grade` features not mentioned e.g `grade_C`, `grade_D`, etc, have the same interpretation as `grade_B`, albeit with lower importance. 

# 5.3 Detailed Comparative Analysis of ANN, XGBoost, and Random Forest

### Table of Model Evaluation Metric Comparisions

| Model                   | Accuracy | Precision (Class 1) | Recall (Class 1) | F1 Score (Class 1) | ROC AUC |
|--------------------------|----------|---------------------|------------------|--------------------|---------|
| ANN (Thresh 0.45)         | 0.6680   | 0.3660              | 0.7760           | 0.4980             | 0.7750  |
| XGB (Thresh 0.30)         | 0.6607   | 0.2626              | 0.9256           | 0.4091             | 0.7317  |
| XGB (Thresh 0.50)         | 0.6607   | 0.3463              | 0.6815           | 0.4592             | 0.7317  |
| XGB (Thresh 0.70)         | 0.6607   | 0.4886              | 0.2862           | 0.3610             | 0.7317  |
| RF (Baseline)             | 0.8030   | 0.5800              | 0.1700           | 0.2600             | 0.7760  |
| RF (Class Weighted)       | 0.8020   | 0.5900              | 0.1500           | 0.2400             | 0.7790  |
| RF (Thresh 0.2)           | 0.8020   | 0.3400              | 0.8100           | 0.4800             | 0.7790  |
| RF Weighted + Thresh 0.2  | 0.8020   | 0.3500              | 0.7900           | 0.4900             | 0.7790  |


### ANN

**Strengths:**
- Strong recall (0.776), which is vital in high-risk settings (e.g., catching defaulters).
- Balanced F1 score (0.498), meaning a good trade-off between precision and recall.
- High ROC AUC (0.775), indicating strong overall discriminatory power.

**Limitations:**
- Lower precision (0.366) compared to threshold-tuned Random Forest (0.59) or XGBoost at 0.7 (0.4886), meaning more false positives.
- Lower accuracy (0.668) compared to Random Forest models (0.802–0.803).


### XGBoost

**Insights:**
- At threshold 0.30, XGBoost delivers the highest recall (92.56%), but sacrifices precision heavily — useful when missing a defaulter is more costly than a false alarm.
- At threshold 0.50, it achieves the best balance between recall and precision, yielding a competitive F1 score (0.4592).
- At threshold 0.70, XGBoost becomes precision-focused, identifying fewer positives but with high accuracy — suited to false-positive-sensitive applications.

**Overall:**
- Flexible model when paired with threshold tuning.
- Slightly lower AUC than ANN and Random Forest, suggesting marginally weaker ranking performance.


### Random Forest (RF)

**Key Takeaways:**
- Baseline RF is biased toward the majority class (class 0), with poor recall (0.17).
- Class weighting alone does not significantly improve recall, indicating that rebalancing class importance isn't sufficient on its own.
- Threshold tuning drastically improves recall to 0.81, and F1 to 0.48.
- The best RF configuration is class weighting + threshold tuning, achieving the highest F1 score (0.49) among all models and competitive recall and precision.

**AUC**: Highest among all models (tied with ANN), confirming strong overall discrimination ability.


### Recommendation

To meet the coursework objective of minimising default**, XGBoost (threshold 0.30) is the most suitable primary model. However, to ensure operational balance and reduce unnecessary loan rejections, it may be advisable to pair it with a Random Forest model in a dual-stage system, or use business-defined thresholds to modulate risk appetite over time.

This approach not only optimises for technical accuracy, but also aligns with the practical and regulatory demands of credit risk management.

| Goal                          | Best Model                  | Reasoning                                            |
|------------------------------- |----------------------------- |----------------------------------------------------- |
| Maximizing Recall              | XGBoost @ Threshold 0.30     | Catching nearly all positives (recall = 0.93)        |
| Maximizing Precision           | XGBoost @ Threshold 0.70     | Highest precision (0.49) with controlled recall      |
| Balanced F1 (Fair Trade-off)    | RF Weighted + Threshold      | Best F1 (0.49), strong recall and decent precision   |
| Overall AUC Performance        | ANN or RF (any tuned)         | AUC (0.775-0.779), best class discrimination         |


# 5.4 Low-Risk Loan Portfolio Analysis: XGBoost and ANN Models

To segment the loan portfolio into low-risk loans, different strategies based on predicted default probability thresholds were employed. This allowed the creation of portfolios with varying risk levels, enabling a tailored strategy for managing loan defaults.

## XGBoost Model Analysis:

**Threshold-Based Portfolio Selection (< 0.2):**
- **Total Loans Selected**: 19,277  
- **Observed Default Rate**: 3.94%  

Loans in this portfolio are the least likely to default, with a very low predicted default probability. This portfolio has a significantly lower default rate than the overall dataset, making it ideal for minimising risk.


**Top N Loan Portfolio (Top 300 Loans)**
- **Total Loans Selected**: 300  
- **Observed Default Rate**: 1.33%  

By selecting the top 300 loans with the lowest predicted default probabilities, we achieve an even lower default rate (1.33%), reflecting a high concentration of quality loans. However, the number of loans is small, which may limit portfolio size and diversification.


**Top X% Loan Portfolio (Top 10% Loans)**
- **Total Loans Selected**: 15,134  
- **Observed Default Rate**: 3.36%  

This approach selects the top 10% safest loans based on predicted probabilities. It allows for a larger pool of loans with a relatively low default rate, though slightly higher than the stricter threshold of 0.2.


## ANN Model Analysis (Threshold < 0.45):

**Overall Low-Risk Loan Portfolio:**
- **Total Loans Selected:** 33,325
- **Observed Default Rate:** 19.67%
 
Using a threshold of 0.45 in the ANN model results in a larger selected portfolio but a relatively higher default rate compared to XGBoost’s stricter thresholding.

**Top 300 Loans:**
- **Observed Default Rate:** 10.00%
- **Average ROI:** 12.45%

**Top 500 Loans:**
- **Observed Default Rate:** 10.87%
- **Average ROI:** 11.60%

**Top 1000 Loans:**
- **Observed Default Rate:** 14.29%
- **Average ROI:** 10.70%
 
The ANN model identifies portfolios that maintain relatively strong return on investment (ROI) alongside moderate control of default risk. However, even in the best-performing top segments, default rates are notably higher compared to XGBoost's best selections.

## Strategic Insights

| Strategy                                | Risk Level | Portfolio Size | Diversification | Potential ROI       |
|-----------------------------------------|------------|----------------|-----------------|---------------------|
| XGBoost Threshold < 0.2                 | Very Low   | Moderate       | Limited         | Not directly measured |
| XGBoost Top 300 Loans                   | Ultra Low  | Very Small     | Very Limited    | Not directly measured |
| XGBoost Top 10%                         | Low        | Large          | Good            | Not directly measured |
| ANN Top 300 Loans                       | Moderate   | Small          | Limited         | High (12.45%)        |
| ANN Top 500–1000 Loans                  | Moderate   | Moderate       | Moderate        | High (10.7%–11.6%)   |

## Final Recommendations
- For clients prioritizing ultra-low risk, XGBoost portfolios (especially Threshold < 0.2 or Top 300 selections) are preferable.
- For those balancing risk and returns, ANN portfolios offer higher ROIs at the cost of a moderately higher default rate.
- Top 10% XGBoost portfolios present a strong middle ground, offering both diversification and significantly reduced default risks compared to the general loan pool.