# 3. Methodology

## 3.1 Random Forest Modeling Approach

We initially developed a **Random Forest classifier** as a baseline model for predicting loan defaults. Random Forest was selected for its ability to handle mixed data types, robustness to noise, and effectiveness without heavy hyperparameter tuning.

To ensure model validity:

- **Definition of Default**: Loans were labeled as default (`1`) if their `loan_status` was `'Charged Off'` or `'Default'`; otherwise labeled non-default (`0`).
- **Preventing Data Leakage**: Post-loan outcome variables (`recoveries`, `total_pymnt`, `last_pymnt_d`, etc.) were removed to ensure only pre-loan information was used for training.
- **Handling Missing Values**: Rows with missing values were dropped to maintain dataset consistency and avoid biases from imputation.
- **Feature Engineering**:
  - `earliest_cr_line` was converted to datetime format.
  - A new feature `credit_history_years` was created to represent the borrower’s credit history length.
  - The original `earliest_cr_line` column was dropped.
- **Categorical Encoding**: One-hot encoding was applied on known categorical columns (`home_ownership`, `grade`, `emp_length`, `purpose`, `verification_status`, `term`) with `drop_first=True`.

We tested several variations:

- **Baseline Random Forest** with default settings.
- **Threshold-Tuned Random Forest**:
Threshold tuning is a technique where the decision boundary of the classifier is adjusted. Instead of classifying an observation as "default" only if its predicted probability exceeds 0.5, the threshold is lowered to prioritize identifying defaults (even at the expense of more false positives).
In this case, a threshold of 0.2 was selected to maximize recall for defaults.
- **Class-Weighted Random Forest**, applying higher weight to defaults (2.41) and lower weight to non-defaults (0.63) during model training.
- **Threshold-Tuned Class-Weighted Random Forest**, combining class weighting and threshold adjustment.


### Random Forest Model Results

| Model                     | Recall (Default) | Precision (Default) | Recall (Non-Default) | Precision (Non-Default) | AUC   |
|---------------------------|------------------|----------------------|----------------------|-------------------------|-------|
| Baseline                  | 0.17             | 0.58                 | 0.97                 | 0.82                    | 0.776 |
| Threshold-Tuned Baseline  | 0.81             | 0.34                 | 0.59                 | 0.92                    | 0.776 |
| Class-Weighted            | 0.15             | 0.59                 | 0.97                 | 0.81                    | 0.779 |
| Threshold-Tuned Weighted  | 0.79             | 0.35                 | 0.62                 | 0.92                    | 0.779 |

> Performance evaluation focused particularly on **recall for the default class**, given the business objective of minimizing undetected risky loans. Threshold tuning had the greatest impact, increasing recall from **0.17** to **0.81** for the baseline model at a threshold of **0.2**.
Ultimately, the **Threshold-Tuned Baseline Random Forest** was selected as the best-performing Random Forest version. However, recognizing its limitations (e.g., increased false positives and lack of interpretability), a more advanced modeling approach was pursued.

## 3.2 XGBoost Modeling Approach

Following Random Forest modeling, we implemented **XGBoost**, a more sophisticated ensemble learning method designed for high predictive performance, scalability, and better handling of imbalanced data.

Preprocessing steps included:

- **Preventing Data Leakage**: Post-loan variables were excluded.
- **Handling Missing Values**:  
  - Continuous features: imputed with the **median**  
  - Categorical features: imputed with **"Unknown"**
- **Target Encoding**: Defaults (`'Charged Off'` or `'Default'`) mapped to `1`, others to `0`.
- **Feature Engineering**: Created `cr_hist` by computing the difference in months between the loan issue date (`issue_d`) and the borrower's earliest credit line (`earliest_cr_line`).
- **Categorical Encoding**: One-hot encoding was used, dropping the first category to avoid multicollinearity.
- **Type Consistency and Cleaning**: Columns were cleaned, continuous features enforced as floats, and stratified sampling was used for train-test split.


We trained three XGBoost variants:

- **Baseline XGBoost**: No class weighting, default parameters.
- **Weighted XGBoost**: Using `scale_pos_weight` to address class imbalance.
- **Weighted and Tuned XGBoost**: Further optimizing hyperparameters to refine model performance.
The final selected model was the **class-weighted and tuned XGBoost**, which achieved the best trade-off between **precision**, **recall**, and **AUC**, outperforming Random Forest across all major evaluation metrics.

## 3.3 Model Validation and Final Model Selection (XGBoost)

This section specifically discusses the validation and final selection of the XGBoost models.

#### Overfitting Assessment  
Model stability and generalizability were assessed by comparing training, test, and 3-fold cross-validation ROC AUC scores:

| Model                      | Train AUC | Test AUC | 3-Fold CV AUC (Train)    |
|----------------------------|-----------|----------|--------------------------|
| Baseline XGBoost           | 0.7285    | 0.7278   | 0.7262 ± 0.0010          |
| Tuned XGBoost (Final)      | 0.7444    | 0.7317   | 0.7295 ± 0.0011          |

> Both versions show closely matched train and test scores, indicating minimal overfitting.

#### Learning Curve Analysis  
Learning curves provided insight into how performance scaled with data size:

- **Training AUC** started high (~0.84) and declined to ~0.75 as more data was added.  
- **Validation AUC** rose steadily from ~0.718 to ~0.729.  
- The gap between training and validation narrows, showing only slight overfitting (typical for XGBoost) that remains under control.  

> These patterns suggest the model generalizes reliably to new data.

#### Threshold Tuning for Final XGBoost Model  
Threshold tuning adjusts the classification cutoff (default 0.5) to favor recall over precision, aligning with our goal of minimizing undetected defaults. Performance at different thresholds:

| Threshold | Precision | Recall  | F1 Score | ROC AUC |
|-----------|-----------|---------|----------|---------|
| 0.30      | 0.2626    | 0.9256  | 0.4091   | 0.7317  |
| 0.50      | 0.3463    | 0.6815  | 0.4592   | 0.7317  |
| 0.70      | 0.4886    | 0.2862  | 0.3610   | 0.7317  |

At **threshold 0.30**, recall reaches **92.56%**, greatly reducing missed risky loans. Although precision drops, this trade-off is acceptable given our priority on catching defaults.

## Final Model Selected

The **Class-Weighted & Tuned XGBoost** with a **threshold of 0.30** was chosen as the final model. It achieved the best balance between recall, precision, and robustness while effectively minimizing the risk of missed defaults.