# Credit Risk Prediction Project

By Jerrie M. Mataya

## Project Overview

The Credit Risk Prediction project aims to develop a predictive model to assess the likelihood of loan defaults among borrowers based on various financial and demographic factors. By analyzing these factors, the project seeks to enable financial institutions to identify potential risks in lending practices and make informed decisions regarding credit approvals.

## Table of Contents
1. Introduction
2. Data Exploration and Preprocessing
3. Model Selection and Comparison
4. Feature Interaction Analysis
5. Model Training and Validation
6. Post-Modeling Analysis
7. Conclusion
8. References

# 1. Introduction

The importance of effective credit risk assessment has grown significantly, especially in light of fluctuating economic conditions. This project aims to harness machine learning techniques to predict loan defaults, enabling lenders to mitigate risk and improve financial outcomes.

## 1.1 Objectives
- To analyze the dataset and understand key features influencing loan default.
- To develop and compare multiple predictive models for assessing credit risk.
- To interpret model predictions and explore feature interactions.

# 2. Data Exploration and Preprocessing

## Tools Used

- **Programming Language** Phython
- **Notebook Environment** Google Colab, Jupyter Notebook through Visual Studio Code
- **Libraries** Pandas, Numpy, Scikit-Learn,SMOTE,Shap, XGBoost,Matplotlib,Seaborn
- **Cloud Environment** Google Colab
## 2.1 Data Loading

The dataset was imported using Python’s Pandas library. It consists of several columns, including credit amount, duration, and borrower demographics.

## 2.2 Descriptive Statistics

A comprehensive analysis of summary statistics was conducted to understand the distributions of key features:
- Mean, median, and standard deviation were calculated for numerical features.
- Categorical features were analyzed for frequency distributions.

## 2.3 Missing Value Treatment

Missing values were identified and addressed through:
- Imputation for numerical variables using the mean or median.
- Removal of rows with excessive missing data in categorical variables.

## 2.4 Feature Engineering

New features were created to enhance the model’s predictive capability, including:
- Interaction terms between key numerical features.
- Categorical encoding using one-hot encoding for non-numeric features.

## 2.5 Data Normalization

Normalization techniques (Min-Max scaling) were applied to ensure numerical stability during model training.

## 2.6 Visualization

Key visualizations included histograms, box plots, and correlation heatmaps to explore data distributions and relationships.

### 3. Model Selection and Comparison

#### 3.1 Selected Models

The following models were chosen and evaluated for their predictive performance:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier

#### 3.2 Performance Metrics

The models were evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC

#### 3.3 Model Comparison Table

| Model                    | Accuracy | Precision | Recall | F1 Score | ROC-AUC |
|--------------------------|----------|-----------|--------|----------|---------|
| Logistic Regression      | 1.00     | 1.00      | 1.00   | 1.00     | 1.00    |
| Random Forest Classifier | 1.00     | 1.00      | 1.00   | 1.00     | 1.00    |
| **XGBoost Classifier**   | **1.00** | **1.00**  | **1.00** | **1.00** | **1.00** |

#### 3.4 Model Performance Analysis

**Logistic Regression Metrics**
- **Accuracy**: 1.00
- **Precision**: 1.00
- **Recall**: 1.00
- **F1 Score**: 1.00

The Logistic Regression model displayed exceptional performance with perfect scores across all evaluation metrics. The confusion matrix indicated no misclassifications, demonstrating the model's accuracy in predicting loan defaults.

**Random Forest Classifier Metrics**
- **Accuracy**: 1.00
- **Precision**: 1.00
- **Recall**: 1.00
- **F1 Score**: 1.00

The Random Forest Classifier also achieved flawless performance, with no false positives or false negatives. The model's confusion matrix results were:
  - **True Negatives (TN)**: 141
  - **True Positives (TP)**: 59
  - **False Positives (FP)**: 0
  - **False Negatives (FN)**: 0

**XGBoost Classifier Metrics**
- **Accuracy**: 1.00
- **Precision**: 1.00
- **Recall**: 1.00
- **F1 Score**: 1.00

The XGBoost Classifier achieved identical perfect performance, demonstrating its strong capability in correctly classifying all instances. The confusion matrix showed:
  - **True Negatives (TN)**: 141
  - **True Positives (TP)**: 59
  - **False Positives (FP)**: 0
  - **False Negatives (FN)**: 0

**Confusion Matrix Overview**
All three models, Logistic Regression, Random Forest, and XGBoost, demonstrated perfect classification results on the dataset, with no misclassifications. The results indicate the models' reliability and suitability for predicting loan defaults accurately.

## 3.4 Visualizations


![Logistic:Random Forest.png](<attachment:Logistic:Random Forest.png>)

![Confusion Matrix.png](<attachment:Confusion Matrix.png>)]![XGBOOST.png](attachment:XGBOOST.png)]


### 4. Feature Interaction Analysis

#### 4.1 Feature Importance

To understand which features significantly contribute to predicting loan defaults, we employed SHAP (SHapley Additive exPlanations) values. This method allows us to interpret the model predictions and determine the influence of each feature on the output.

- **Most Significant Features**:
  - **Credit Amount**: Higher credit amounts were associated with increased risk of default.
  - **Duration**: Longer loan durations correlated with higher default rates.
  - **Borrower Income**: Lower income levels indicated a higher likelihood of default.

#### 4.2 Visualization

To visualize the interactions among features and their contributions to model predictions, we used pair plots and scatter plots. These visualizations highlighted complex relationships, revealing how certain features interacted with each other to affect loan default predictions.

- **Pair Plots**: Displayed the distributions and relationships between key numerical features.
- **Scatter Plots**: Illustrated the relationship between credit amount and default rates, showcasing how increased credit amounts generally correlated with higher default rates.

### 4.3 Feature Importance Visualization


![image.png](attachment:image.png)

# SHapley Additive exPlanations (SHAP)

## Key Features

- **Credit History**: Most significant feature (5.38); indicates creditworthiness.
- **Number of Existing Credits at This Bank**: Moderate impact (0.56); reflects potential risk.
- **Status of Existing Checking Account**: Important (0.22); positive status suggests financial health.
- **Credit Amount**: Notable (0.15); larger amounts may increase default risk.
- **Duration in Months**: Relevant (0.12); longer terms can strain finances.
- **Present Employment (Years)**: Suggests stability (0.09); longer employment is favorable.
- **Property**: Affects financial perception (0.08).
- **Age in Years**: Influences risk behavior (0.07).
- **Purpose of the Credit**: Varies in risk (0.04).
- **Installment Rate in Percentage of Disposable Income**: Higher rates may indicate overextension (0.03).


## Features with No Impact

- **Personal Status**
- **Job**
- **Telephone**



# Top 10 Feature Importances Random Forest vs XGBoost


![image.png](attachment:image.png) 
    ![featureforest.png](attachment:featureforest.png)]


In our analysis of loan default prediction, we utilized two different machine learning models: Random Forest and XGBoost. Below is a summary of the key findings regarding feature importances for both models.

#### Random Forest Model

The Random Forest model identified the following top features that significantly influence loan default risk:

1. **Credit History Critical/Other Existing Credit (Importance: 0.485641)**: This feature has the highest importance, indicating that having a critical credit history or other existing credits is a strong predictor of loan default. Borrowers with poor credit histories are more likely to default.

2. **Credit History Existing Paid (Importance: 0.189082)**: A considerable portion of importance is attributed to borrowers who have successfully paid off previous credits. This suggests that a history of on-time payments reduces the risk of default.

3. **Number of Existing Credits at This Bank (Importance: 0.083419)**: The number of existing credits held by a borrower at the same bank also plays a role. More existing credits may indicate a stronger relationship with the bank and potentially lower default risk.

4. **Credit History Delayed Previously (Importance: 0.045139)**: Previous delays in payments are associated with a higher likelihood of default, albeit with lower importance than the top features.

5. **Credit Amount (Importance: 0.024920)**: The amount of credit applied for can influence the risk, with larger amounts potentially leading to higher default rates.


#### XGBoost Model

In contrast, the XGBoost model yielded the following insights:

1. **Credit History Critical/Other Existing Credit (Importance: 1.0)**: This feature was found to be the most important, with maximum importance, mirroring the Random Forest findings. It highlights that borrowers with critical credit histories are at a significantly higher risk of default.

#### Conclusion

The results demonstrate a clear distinction in how each model evaluates the importance of features in predicting loan defaults. The Random Forest model highlights several factors related to credit history and financial behavior, while the XGBoost model focuses heavily on the significance of credit history alone, disregarding many other features. This underscores the importance of interpreting model results carefully, as different algorithms can lead to varying insights, which can inform lending practices and risk assessments.

![image.png](attachment:image.png)

# Correlation Matrix Results

The results you’ve provided represent a correlation matrix showing how different features in your dataset relate to one another. Each entry indicates the strength and direction of the relationship between pairs of variables, with values ranging from -1 to 1:

- **1** indicates a perfect positive correlation: as one variable increases, the other variable increases.
- **-1** indicates a perfect negative correlation: as one variable increases, the other variable decreases.
- **0** indicates no correlation: changes in one variable do not affect the other.

## Key Relationships:

1. **Credit History Relationships:**
   - **Credit history critical/other existing credit (1.0):** This feature has a perfect positive correlation with itself, which is expected.
   - **Credit history existing paid (-0.68):** There is a strong negative correlation with the ‘Credit history critical/other existing credit’, suggesting that individuals with a critical credit history are less likely to have existing paid credit histories.

2. **Existing Credits:**
   - **Number of existing credits at this bank (0.45):** This feature is positively correlated with ‘Credit history critical/other existing credit’, meaning that those with a critical credit history tend to have more existing credits at the bank.

3. **Delayed Payments:**
   - **Credit history delayed previously (1.0):** This feature has a perfect correlation with itself, but it is negatively correlated with both ‘Credit history existing paid (-0.33)’ and ‘Credit history critical/other existing credit (-0.20)’, indicating that a history of delayed payments negatively impacts the likelihood of having a good credit history.

4. **Credit Amount:**
   - **Credit amount (1.0):** As a standalone feature, it has positive correlations with ‘Duration in months (0.63)’ and ‘Credit history delayed previously (0.13)’, indicating that higher credit amounts are associated with longer durations of credit.

5. **Age:**
   - **Age in years (1.0):** This feature shows some positive correlation with ‘Credit history critical/other existing credit (0.17)’ and ‘Status of existing checking account good (0.14)’, suggesting that older individuals might have better credit histories and account statuses.



![![CreditxDuration.png](attachment:CreditxDuration.png)]



## Credit Amount Data

This column lists the amounts of credit that individuals have applied for or received. Here are some key points about this data:

- The values range significantly, with amounts such as **6836**, **2319**, and **1236**. This suggests that individuals are applying for various amounts of credit, which may reflect their different financial needs or situations.
- For example:
  - The highest amount in the sample is **6836**, which indicates a substantial loan application, likely for significant expenses (like buying a car or home).
  - The lowest amount in this snippet is **1236**, which might represent a smaller loan for personal or immediate expenses.

## Duration in Months Data

This column shows the length of time (in months) for which the credit is requested. The numbers represent how long the borrowers expect to repay the credit. Here are some insights:

- The values vary, with durations like **60**, **21**, and **6** months, indicating different repayment periods.
- For example:
  - A duration of **60 months** (5 years) suggests that the borrower is seeking a longer-term loan, which could be typical for larger amounts like mortgages or auto loans.
  - A shorter duration, like **6 months**, indicates a desire for quick repayment, perhaps for a smaller personal loan.

## Key Takeaways

- **Variety in Borrowing Needs**: The credit amounts and durations show a wide range of borrowing behaviors. Different individuals have unique financial situations that require different amounts and repayment terms.
- **Implications for Risk Assessment**: Understanding the relationship between the amount borrowed and the repayment period can help in assessing risk. For instance, higher amounts with shorter durations might indicate a higher repayment pressure on the borrower, potentially leading to a higher risk of default.

The provided data shows how much individuals are borrowing and for how long they plan to repay it. 

# 5. Model Training and Validation

* 5.1 Data Splitting
The dataset was split into training (80%) and testing (20%) sets using stratified sampling.
* 5.2 Cross-Validation
k-fold cross-validation was implemented to ensure robust evaluation and to reduce variance in performance metrics.
* 5.3 Hyperparameter Tuning
Grid search was performed to optimize hyperparameters for the Random Forest and XGBoost models.
* 5.4 Training Process Flowchart
A flowchart was created to document the model training workflow, including data preparation, fitting, and evaluation.

# 6. Post-Modeling Analysis

*  6.1 Error Analysis
An in-depth analysis of misclassifications was performed to identify patterns in false positives and negatives.
* 6.2 Confusion Matrix
Confusion matrices were generated for each model, providing insights into classification performance.


! [![ROCxPRCpng.png](attachment:ROCxPRCpng.png)]

### Analysis of ROC Curve and Precision-Recall Curve

#### ROC Curve
- **What is it?** The ROC (Receiver Operating Characteristic) curve shows how well our model can distinguish between two groups: borrowers who will default on their loans and those who won't.

- **Area Under the Curve (AUC):** The AUC measures the overall ability of the model to correctly classify the borrowers. An AUC of 1.0 means our model is perfect—every time it predicts a borrower will default, it is correct. An AUC of 0.5 means the model is no better than random guessing.

- **Interpretation:** Since our model has an AUC of 1.0, this means it is excellent at telling apart borrowers who will default from those who won't.

#### Precision-Recall Curve
- **What is it?** The Precision-Recall curve helps us understand how well the model performs when it comes to predicting loan defaults. It shows the trade-off between two important metrics:
  - **Precision:** This tells us how many of the borrowers predicted to default actually do default. High precision means the model is good at making accurate predictions about defaults.
  - **Recall:** This tells us how many actual defaults were correctly identified by the model. High recall means the model catches most of the borrowers who are likely to default.

- **Interpretation:** A good Precision-Recall curve will start high and stay high, indicating that the model maintains good accuracy even as it identifies more potential defaults. If the curve rises steeply, it shows that we can catch more borrowers likely to default without sacrificing too much accuracy.



# 7. Conclusion

The Credit Risk Prediction project successfully developed predictive models to assess loan default risk. XGBoost emerged as the most effective model, providing valuable insights into feature importance and interactions. The project highlights the importance of thorough data exploration, careful model selection, and robust evaluation methodologies in achieving reliable predictions.

## 7.1 Future Work

Further research could explore:

- Incorporating additional features such as temporal data.
- Testing other machine learning algorithms and ensemble methods.
- Continuous model monitoring and updates to adapt to changing economic conditions.

# 8. References


- ## References

- VanderPlas, J. (2016). *Python Data Science Handbook: Essential Tools for Working with Data*.
- Géron, A. (2019). *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems*. 
- Scikit-learn Documentation. (n.d.). Retrieved from https://scikit-learn.org/stable/user_guide.html
- XGBoost Documentation. (n.d.). Retrieved from https://xgboost.readthedocs.io/en/latest/
- Toward Data Science. (2018). A Comprehensive Guide to Hyperparameter Tuning with Grid Search in Python. Retrieved from https://towardsdatascience.com/a-comprehensive-guide-to-hyperparameter-tuning-with-grid-search-in-python-25de4c9a276b
