# Data Preprocessing: Considerations and Rationale

A number of structured steps were taken to ensure that the dataset was suitable for supervised machine learning, while minimising risks of bias, data leakage, or inconsistency. Key considerations included:

#### 1. Prevention of Data Leakage
To ensure predictive integrity, all variables that reflect post-loan outcomes (such as `total_pymnt`, `recoveries`, or `last_pymnt_d`) were removed. These values are only known after the loan’s performance is observed, and thus would result in data leakage if included during training. This step is essential to ensure that model evaluation reflects realistic, out-of-sample performance.

#### 2. Handling Missing Values
Two different imputation strategies were used based on variable type:
- For **continuous features**, missing values were filled with the median, preserving distribution without introducing artificial skew.
- For **categorical features**, missing values were imputed using a placeholder label "Unknown" to retain observations while making missingness explicit.

This avoided unnecessary row deletion while ensuring compatibility with model training.

#### 3. Target Variable Encoding
The `loan_status` field was mapped into a binary target variable (`target`) to distinguish default events:
- Loans marked as `"Charged Off"` or `"Default"` were assigned a label of 1 (default).
- All others were treated as non-default (0).

This transformation enabled the application of binary classification techniques.

#### 4. Temporal Feature Engineering
To capture borrower credit history, a new feature `cr_hist` was created by computing the difference in months between the loan issue date (`issue_d`) and the borrower's earliest credit line (`earliest_cr_line`).

All date variables were converted to `datetime` format to allow reliable manipulation.

#### 5. Categorical Encoding
All categorical variables were encoded using one-hot encoding with `drop_first=True` to avoid multicollinearity. This ensured compatibility with tree-based models, while preserving interpretability of categorical inputs.

#### 6. Type Consistency and Clean-Up
To avoid errors during model fitting (especially with XGBoost), careful steps were taken to:
- Enforce all continuous features to be of float dtype.
- Flatten any nested arrays or DataFrames in the target variable.
- Remove illegal characters from column names (`[`, `]`, `<`, `>`), which are unsupported by certain model APIs.
- Check and verify that all columns were valid and accessible before fitting the model.

This ensured seamless integration with `scikit-learn` and `XGBoost` pipelines.

#### 7. Train-Test Splitting with Stratification
The data was split into training and test sets using a stratified split on the target variable. This preserved the proportion of defaulters in both sets, ensuring consistent model evaluation — particularly important in imbalanced classification problems such as loan default prediction.


# Model Performance Comparison: Baseline vs Weighted vs Tuned XGBoost

To evaluate the impact of class weighting and hyperparameter tuning on model performance, three variants of the XGBoost classifier were developed and compared on the test set.


#### Model 1: Baseline XGBoost (No Weights, No Tuning)
This model used default XGBoost parameters without any adjustment for class imbalance.

- **Accuracy**: 0.7947  
- **F1 Score**: 0.2159  
- **ROC AUC**: 0.7301  

While the baseline model showed high accuracy, the F1 score was notably low, reflecting poor performance on the minority (default) class. This is typical in imbalanced datasets, where accuracy can be misleading due to the model's bias toward the majority class (non-default).


#### Model 2: Weighted XGBoost (Class Weights, No Tuning)
In this version, the `scale_pos_weight` parameter was set to the ratio of non-defaults to defaults to address class imbalance.

- **Accuracy**: 0.6629  
- **F1 Score**: 0.4566  
- **ROC AUC**: 0.7288  

Here, accuracy decreased (as expected) because the model no longer over-predicted the majority class. However, both the F1 score and recall improved significantly, indicating better detection of actual defaulters. This suggests that class weighting is essential in improving fairness and effectiveness when defaults are rare.


#### Model 3: Tuned XGBoost (Class Weights + Hyperparameter Tuning)
This final model combined class weighting with a tuned set of hyperparameters via grid search.

- **Accuracy**: 0.6607  
- **F1 Score**: 0.4592  
- **ROC AUC**: 0.7317  

Compared to Model 2, the tuned model achieved a slightly higher AUC and marginally improved F1 score, though overall gains were modest. This implies that while tuning helped refine model discrimination, class weighting remained the primary driver of performance improvement in terms of recall and F1.


### Model Selected

Based on the evaluation of all three models, Model 3: Tuned XGBoost (Class Weights + Hyperparameter Tuning) is the most appropriate choice for this loan default prediction task. While the performance gains over Model 2 were modest, this model achieved the highest F1 score (0.4592) and highest ROC AUC (0.7317), indicating the best overall balance between precision, recall, and discriminatory capability.


# Feature Importance Analysis

Using the built-in feature importance scores from the tuned XGBoost model, the top predictors of loan default were identified. These values reflect the relative contribution of each feature to the model's predictive performance (based on split gain/importance).

### Top Predictive Features (from XGBoost):
1. `int_rate` — Most important predictor: Higher interest rates are strongly associated with higher credit risk and default probability, as they often reflect a lender's assessment of borrower risk.
2. `grade_B` — These represent borrower credit grades and are inherently tied to loan risk profiling, underwriting decisions, and expected repayment behavior.
3. `term_60 months` — Longer-term loans may carry higher risk due to extended exposure and borrower volatility over time.
4. `home_ownership_MORTGAGE` — Homeownership status can be a proxy for financial stability. Borrowers who rent or carry mortgages may be at higher risk compared to outright homeowners.
5. `emp_length_Unknown` — Missing employment length data could signal weaker credit profiles or incomplete applications, which may correspond with elevated risk.
6. `fico_range_high` — As expected, credit score features were important in assessing risk, with lower FICO ranges linked to higher default likelihood.
7. `dti`,`loan_amnt` — Larger loan amounts and higher DTI ratios are associated with repayment challenges and credit strain.
8. `verification_status_Verified` — Loans with income verification appear to influence default probability, either through improved risk assessment or signaling borrower transparency.
9. `purpose_small_business` — Loans for small business purposes tend to carry higher risk due to the volatility and failure rates of SMEs.
10. `installment` — Monthly repayment amounts reflect loan structure and affordability dynamics, which tie directly to repayment behavior.

Note: Other `grade` features not mentioned e.g `grade_C`, `grade_D`, etc, have the same interpretation as `grade_B`, albeit with lower importance. 

# Overfitting Assessment

To evaluate the generalisability and stability of the XGBoost models, an overfitting analysis was conducted by comparing training, test, and cross-validation ROC AUC scores for both the baseline and tuned models.

#### Baseline Model
- **Train ROC AUC**: 0.7285  
- **Test ROC AUC**: 0.7278  
- **3-Fold CV ROC AUC (Train)**: 0.7262 ± 0.0010  

The baseline model demonstrated consistent performance across training, test, and cross-validation sets, indicating minimal overfitting. The near-identical ROC AUC values suggest the model generalises well despite not applying class weighting or tuning.

#### Tuned Model (Class Weights + Grid Search)
- **Train ROC AUC**: 0.7444  
- **Test ROC AUC**: 0.7317  
- **3-Fold CV ROC AUC (Train)**: 0.7295 ± 0.0011  

The tuned model achieved a slightly higher training AUC, with a moderate improvement in test performance. The difference between training and test scores (≈ 0.013) is small, suggesting the model benefits from tuning without significantly overfitting. The cross-validation score closely aligns with the test score, further supporting its robustness.


# Learning Curve Analysis

The learning curve provides insights into how model performance evolves with increasing training data size. This is essential for diagnosing underfitting, overfitting, and assessing whether additional training data could improve generalisation.

#### Observations:
- Training AUC starts high (~0.84) but decreases steadily as more data is introduced, converging toward ~0.75. This is expected and indicates reduced overfitting with increased data complexity.
- Validation AUC begins lower (~0.718) and increases gradually, plateauing around ~0.729 with more data.
- The gap between training and validation AUC narrows with more data, but does not fully close — suggesting moderate overfitting, which is typical in high-capacity models like XGBoost.

#### Interpretation:
- The model benefits from more data, as validation AUC improves consistently up to the maximum sample size used.
- The performance plateaus near the end, implying that beyond this point, adding more data may not yield significant gains without other changes (e.g. feature engineering or model regularisation).
- The final gap between training and validation curves indicates the model captures real patterns but still slightly overfits — this is not necessarily problematic, but something to monitor in production.


# Threshold Tuning: Precision–Recall Trade-Off

While the model's optimal threshold based on F1 score was identified as 0.50, threshold adjustment can be used strategically to meet different operational goals — particularly in imbalanced classification problems like loan default prediction.

#### Contextual Objective: Prioritising Recall

In the lending context, catching as many defaulters as possible is critical for minimising portfolio risk. Therefore, recall (true positive rate) is prioritised over precision, as missing a defaulter (false negative) is generally more costly than flagging a non-defaulter (false positive).

#### Threshold Comparison Summary

| Threshold | Precision | Recall | F1 Score | ROC AUC | KS Statistic |
|-----------|-----------|--------|----------|---------|--------------|
| 0.30      | 0.2626    | 0.9256 | 0.4091   | 0.7317  | 0.3373       |
| 0.50      | 0.3463    | 0.6815 | 0.4592   | 0.7317  | 0.3373       |
| 0.70      | 0.4886    | 0.2862 | 0.3610   | 0.7317  | 0.3373       |

#### Interpretation:
- At threshold = 0.30, the model captures over 92% of actual defaults, making it ideal when recall is critical. However, this comes at the cost of lower precision (more false positives).
- Threshold = 0.50 yields the highest F1 score, balancing precision and recall.
- At threshold = 0.70, precision improves significantly, but recall drops sharply — making it riskier in contexts where missing defaulters is unacceptable.


# Loan Portfolio Selection: Risk-Based Segmentation

To segment the loan portfolio into low-risk loans, different strategies based on predicted default probability thresholds were employed. This allowed the creation of portfolios with varying risk levels, enabling a tailored strategy for managing loan defaults.

#### Objective: Minimise Default Risk by Selecting Low-Risk Loans

**Threshold-Based Portfolio Selection:**

1. **Selected Low-Risk Loan Portfolio (Threshold < 0.2)**
   - **Total Loans Selected**: 19,277  
   - **Observed Default Rate**: 3.94%  

Loans in this portfolio are the least likely to default, with a very low predicted default probability. This portfolio has a significantly lower default rate than the overall dataset, making it ideal for minimising risk.


2. **Top N Loan Portfolio (Top 300 Loans)**
   - **Total Loans Selected**: 300  
   - **Observed Default Rate**: 1.33%  

By selecting the top 300 loans with the lowest predicted default probabilities, we achieve an even lower default rate (1.33%), reflecting a high concentration of quality loans. However, the number of loans is small, which may limit portfolio size and diversification.


3. **Top X% Loan Portfolio (Top 10% Loans)**
   - **Total Loans Selected**: 15,134  
   - **Observed Default Rate**: 3.36%  

This approach selects the top 10% safest loans based on predicted probabilities. It allows for a larger pool of loans with a relatively low default rate, though slightly higher than the stricter threshold of 0.2.


#### Insights and Strategic Considerations:
- **Threshold < 0.2** provides the lowest risk but results in a smaller portfolio. It is ideal for clients seeking ultra-low-risk loans, though it sacrifices portfolio size.
- **Top 300 loans** offer a more concentrated portfolio, achieving the lowest default rate but with limited volume and diversification.
- **Top 10% selection** strikes a balance between scale and risk, creating a larger and more diversified portfolio with a default rate still well below the dataset average.


# LLM Used For:

- Learning curve analysis  
- Confusion matrix interpretation  
- Threshold tuning explanation