# **3. Methodology**

## **3.1 Random Forest Modeling**

We first developed a **Random Forest classifier** as an initial model for predicting loan defaults. Random Forest was selected for its strong baseline performance, robustness to overfitting, and ability to handle large feature sets without heavy tuning.
A loan was defined as a **default** if its status was `'Charged Off'` or `'Default'`. To prevent **data leakage**, variables reflecting post-loan outcomes (such as `loan_status`, `recoveries`, and pre-calculated return columns) were removed before modeling. Missing values were handled by dropping rows containing any null values, ensuring the model trained only on complete data without introducing imputation biases.
The Random Forest was trained in four variations:
- Baseline model
- Baseline model with threshold tuning
- Class-weighted model
- Class-weighted model with threshold tuning

The threshold-tuned baseline model achieved the highest recall (**0.81**) among the Random Forest variants. However, Random Forest was ultimately surpassed in performance by XGBoost, prompting a shift toward using more advanced boosting techniques.

## **3.2 XGBoost Modeling**

Building on the experience from Random Forest, we adopted a more structured preprocessing pipeline to optimize the dataset for **XGBoost**, a model known for its strong performance with tabular, structured data and its ability to manage class imbalance effectively.
The following steps were taken:
- **Prevention of Data Leakage**:
All variables related to post-loan outcomes (e.g., `total_pymnt`, `recoveries`, `last_pymnt_d`) were removed. This ensured the model only used information available at loan origination.
- **Handling Missing Values**:
Missing values were imputed rather than dropped:
    - **Continuous features** (e.g., `annual_inc`, `revol_util`) were filled using the median to preserve distribution robustness.
    - **Categorical features** (e.g., `home_ownership`, `purpose`) were imputed with a new **'Unknown'** category, allowing retention of all observations while signaling missingness.
- **Target Variable Encoding**:
The loan_status variable was binarized, mapping `"Charged Off"` and `"Default"` loans to 1 (default), and all other statuses to 0 (non-default).
- **Temporal Feature Engineering**:
A new feature, `cr_hist`, representing borrower credit history length, was created by calculating the difference in months between `issue_d` (loan issuance date) and `earliest_cr_line` (first credit line).
- **Categorical Encoding**:
Categorical variables were encoded using **one-hot encoding** (`drop_first=True`) to prevent multicollinearity and ensure model compatibility.
- **Type Consistency and Clean-Up**:
Continuous features were cast to float types, and column names were standardized to prevent errors during XGBoost model fitting.
- **Train-Test Splitting**:
A **stratified train-test split** was performed to maintain the default rate proportion across the training and testing datasets, which is crucial given the class imbalance.

## **3.3 Modelling Experiment**

Three versions of XGBoost were developed and compared:
- **Baseline XGBoost**: No class weighting, default parameters.
- **Weighted XGBoost**: Using `scale_pos_weight` to address class imbalance.
- **Weighted and Tuned XGBoost**: Further optimizing hyperparameters to refine model performance.
The final selected model was the **class-weighted and tuned XGBoost**, which achieved the best trade-off between **precision**, **recall**, and **AUC**, outperforming Random Forest across all major evaluation metrics.

# **5. Experiments and Results**

## **5.1 Random Forest Experiments**

To establish a benchmark, we first trained and evaluated several variations of Random Forest models. Given the imbalanced nature of the dataset, we experimented with both threshold tuning and class weighting strategies.

| **Model**                  | **Recall (Default)** | **Precision (Default)** | **Recall (Non-Default)** | **Precision (Non-Default)** | **AUC** |
|-----------------------------|----------------------|--------------------------|---------------------------|------------------------------|---------|
| Baseline                    | 0.17                 | 0.58                     | 0.97                      | 0.82                         | 0.776   |
| Threshold-Tuned Baseline    | **0.81**             | 0.34                     | 0.59                      | 0.92                         | 0.776   |
| Class-Weighted              | 0.15                 | 0.59                     | 0.97                      | 0.81                         | 0.779   |
| Threshold-Tuned Weighted    | **0.79**             | 0.35                     | 0.62                      | 0.92                         | 0.779   |

Initially, the baseline model showed strong ability to identify non-default loans but struggled to capture defaults. Applying threshold tuning—lowering the classification threshold to 0.2—substantially improved default recall (from 0.17 to 0.81), though at the cost of reduced precision.

Class weighting was also applied by assigning computed weights (approximately 2.41 for defaults and 0.63 for non-defaults), but weighting alone did not meaningfully boost recall. Threshold tuning again proved to have the greatest impact.

We selected the **threshold-tuned baseline Random Forest** as the final version among Random Forest models due to its slightly better recall. However, limitations remained, including an increased rate of false positives and the model's lack of transparency.

Given these constraints, we moved toward exploring **XGBoost**, a more advanced modeling approach, to further improve default prediction performance.


## **5.2 XGBoost Experiments**

Building on lessons from Random Forest, we transitioned to **XGBoost**, a gradient boosting framework designed to deliver stronger predictive performance for structured data.

### Model Variants Compared:
- **Baseline XGBoost** (no class weighting, default hyperparameters)
- **Class-Weighted XGBoost** (using `scale_pos_weight`)
- **Class-Weighted and Tuned XGBoost** (hyperparameter tuning + class weighting)

The final selected model was the **Class-Weighted and Tuned XGBoost**, achieving:
- **Accuracy**: 0.6607
- **F1 Score**: 0.4592
- **ROC AUC**: 0.7317

This model best balanced precision, recall, and AUC.


### Overfitting Assessment

To evaluate model stability, ROC AUC scores were compared across training, test, and cross-validation sets:

| Model                     | Train AUC | Test AUC | 3-Fold CV AUC (Train) |
|----------------------------|-----------|----------|-----------------------|
| Baseline XGBoost           | 0.7285    | 0.7278   | 0.7262 ± 0.0010        |
| Tuned XGBoost (final)      | 0.7444    | 0.7317   | 0.7295 ± 0.0011        |

Both models showed minimal overfitting, with consistent performance between training and testing.


### Learning Curve Analysis

- **Training AUC** started high (~0.84) and gradually declined as more data was introduced, converging toward ~0.75.
- **Validation AUC** steadily increased from ~0.718 to ~0.729.
- The narrowing gap between training and validation curves indicated moderate but manageable overfitting.

This suggested that the model generalizes well, and additional data only offers diminishing returns after a point.


### Threshold Tuning: Precision–Recall Trade-Off

Since minimizing missed defaults was prioritized, recall was emphasized during threshold tuning. Three thresholds were evaluated:

| Threshold | Precision | Recall | F1 Score | ROC AUC | KS Statistic |
|-----------|-----------|--------|----------|---------|--------------|
| 0.30      | 0.2626    | 0.9256 | 0.4091   | 0.7317  | 0.3373       |
| 0.50      | 0.3463    | 0.6815 | 0.4592   | 0.7317  | 0.3373       |
| 0.70      | 0.4886    | 0.2862 | 0.3610   | 0.7317  | 0.3373       |

At a **threshold of 0.30**, recall reached **92.56%**, making it the most appropriate threshold for catching as many risky loans as possible, even if it increased false positives.

Thus, the final XGBoost model was implemented with a **threshold of 0.30** to maximize default detection.
