# 3. Methodology

## 3.1 Random Forest Modeling Approach

### Objective:
Random Forest was selected as the initial model to establish a strong baseline due to its robustness, ability to handle mixed data types, and good out-of-the-box performance without heavy hyperparameter tuning.

### Data Preprocessing:

| Step                        | Details                                                                                                  |
|------------------------------|----------------------------------------------------------------------------------------------------------|
| **Definition of Default**    | Loans labeled as 1 if `loan_status` was 'Charged Off' or 'Default' ; otherwise labeled as 0.               |
| **Preventing Data Leakage**  | Post-loan variables (`recoveries`, `total_pymnt`, `last_pymnt_d`, etc.) were removed to ensure only pre-loan information was used for training..                    |
| **Handling Missing Values**  | Rows with missing values were dropped to maintain dataset consistency and avoid biases from imputation.                                  |
| **Feature Engineering**      | - Converted `earliest_cr_line` to datetime format. <br> - Created a new feature `credit_history_years` to represent the borrower’s credit history length. <br> - Original `earliest_cr_line` column was dropped. |
| **Categorical Encoding**     | One-hot encoding on categorical features (`home_ownership`, `grade`, `emp_length`, `purpose`, `verification_status`, `term`) with `drop_first=True`. |

### Model Variations:

However, loan default is a rare event in our dataset, so a baseline Random Forest tended to predict almost all loans as non-defaults—achieving high overall accuracy but missing most actual defaults. To correct this imbalance we applied two techniques:

1. Class Weighting: 

We increased the penalty for misclassifying defaults during training by assigning higher weight to the default class (and lower weight to non-defaults). This forces the Random Forest to pay more attention to the minority class and slightly boosts default detection.

2. Threshold Tuning

Beyond altering the training loss, we also adjusted the probability cutoff used to declare a loan “default.” Instead of the standard 0.5 threshold, we lowered it (to 0.2) for both the baseline and the class-weighted models. This shift trades off some false positives (flagging safe loans) for a much higher true positive rate—crucial when undetected defaults carry greater financial risk than a few extra false alarms.

By adjusting the classification cutoff from the default value of 0.5, the model can prioritise recall over precision — a necessary trade-off given the investment strategy's goal of minimising undetected defaults. 

| Model Variation                            | Description                                                                                                                                               |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Baseline Random Forest**                  | Random Forest trained with default settings, no adjustments to threshold or class balance.                                                               |
| **Threshold-Tuned Random Forest**           | Lowered decision threshold to 0.2 to prioritise identifying defaults, increasing recall at the cost of more false positives.                             |
| **Class-Weighted Random Forest**            | Applied higher weight to defaults (2.41) and lower weight to non-defaults (0.63) during model training to address class imbalance.                        |
| **Threshold-Tuned Class-Weighted Random Forest** | Combined class weighting with threshold adjustment (threshold 0.2) to maximise recall while controlling the precision-recall trade-off.                |

### Evaluation and Decision:

| Model                     | Recall (Default) | Precision (Default) | Recall (Non-Default) | Precision (Non-Default) | AUC   |
|---------------------------|------------------|----------------------|----------------------|-------------------------|-------|
| Baseline                  | 0.17             | 0.58                 | 0.97                 | 0.82                    | 0.776 |
| Threshold-Tuned Baseline  | 0.81             | 0.34                 | 0.59                 | 0.92                    | 0.776 |
| Class-Weighted            | 0.15             | 0.59                 | 0.97                 | 0.81                    | 0.779 |
| Threshold-Tuned Weighted  | 0.79             | 0.35                 | 0.62                 | 0.92                    | 0.779 |

Although threshold tuning significantly improved recall, the general performance of Random Forest models plateaued. Their ability to distinguish defaults remained limited, motivating exploration of more advanced ensemble methods.

## 3.2 XGBoost Modeling Approach

### Objective:
XGBoost was implemented following Random Forest to achieve higher recall, better handling of class imbalance, and improved scalability.

### Data Preprocessing:

Included some additional preprocessing not explored in Random Forest.

| Step                        | Details                                                                                                  |
|------------------------------|----------------------------------------------------------------------------------------------------------|
| **Definition of Default**    | Loans labeled as 1 if `loan_status` was 'Charged Off' or 'Default' ; otherwise labeled as 0.               |
| **Preventing Data Leakage**  | Post-loan variables were excluded.                                                                      |
| **Handling Missing Values**  | - Continuous features imputed with the median. <br> - Categorical features imputed with "Unknown".        |
| **Target Encoding**          | Defaults mapped to 1; others mapped to 0.                                                                |
| **Feature Engineering**      | Created `cr_hist` by computing the difference in months between the loan issue date `issue_d` and the borrower's earliest credit line `earliest_cr_line`.                           |
| **Categorical Encoding**     | One-hot encoding applied, dropping the first category to avoid multicollinearity.                        |
| **Type Consistency & Cleaning** | Columns were cleaned, continuous features enforced as floats, and stratified sampling was used for train-test split. |


### Model Variations:

| Model Variation                       | Description                                                                                                                                      |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| **Baseline XGBoost**                   | Trained with default parameters and no class weighting; used as a starting benchmark.                                                            |
| **Weighted XGBoost**                   | Introduced `scale_pos_weight` to adjust for class imbalance by giving more importance to the minority (default) class during training.           |
| **Weighted and Tuned XGBoost**          | Applied class weighting and tuned hyperparameters (e.g., learning rate, depth, subsample) to optimise performance across AUC, precision, and recall.|

### Evaluation and Final Model Selection:

#### Threshold Tuning

Similar to the approach used in Random Forest, threshold tuning was applied to the class-weighted and hyperparameter-optimised XGBoost model to prioritise the minimisation of undetected defaults. The model's performance across different threshold levels is summarised below:

| Threshold | Precision | Recall  | F1 Score | ROC AUC |
|-----------|-----------|---------|----------|---------|
| 0.30      | 0.2626    | 0.9256  | 0.4091   | 0.7317  |
| 0.50      | 0.3463    | 0.6815  | 0.4592   | 0.7317  |
| 0.70      | 0.4886    | 0.2862  | 0.3610   | 0.7317  |

At a threshold of 0.30, the model achieves a recall of 92.56%, significantly reducing the likelihood of missing high-risk borrowers. Compared to the Random Forest models, this evaluation demonstrates superior recall and overall predictive performance, making XGBoost a more effective tool for minimising default risk.

#### Final Model Selection

Based on this evaluation, the **Class-Weighted and Tuned XGBoost** model with a **threshold of 0.30** was selected as the final model. It achieved the strongest balance between recall, F1 score, and overall model robustness, aligning best with the strategic objective of minimising default risk in the investment portfolio.
