Predictive Modelling: German Credit Risk 
=========================================


## Summary

The goal of our analysis is to classify whether someone is a good or bad credit risk using attributes such as `Credit History`, `Duration`, and `Residence`. Our best-performing model is a Random Forest model. This model gave us an accuracy of 0.8 on unseen data, a decent result compared to the dummy model's accuracy of 0.7. We also obtained a precision score of 0.8, a recall score of 0.95, and F1 Score of 0.87. Our model performs decently well in terms of identifying people who are a good credit risk. However, if this model is to have a hand in real-world decision-making, precision should be improved to minimize classifying poor credit risks as good credit risks (false positives). In addition, more research should be done to ensure the model produces fair and equitable recommendations.

## Introduction

Understanding and predicting credit risk is crucial for banks in the finance sector. As the Journal of Applied Statistics shares “credit risk modelling, namely its component Probability of Default (PD), is very helpful in the consumer credit loan grant decision” (Costa e Silva et al., 2020). Further, a report by McKinsey shares that “at an average commercial bank, credit-related assets produce about 40 percent of total revenues” (Goraieb et al., n.d.). With this in mind, our data science project will develop a predictive model aimed at discerning good from bad credit risk. 

Our key question: How can we predict individuals with good or bad credit risk using relevant and representative input features?

The Statlog (German Credit Data) dataset, sourced from UCI’s Machine Learning Repository, can be used for classifying individuals as good or bad credit risks based on a variety of attributes. A cost matrix is required for evaluation, where misclassification costs are outlined. The cost matrix indicates that it is worse to classify a customer as good when they are bad, compared to classifying a customer as bad when they are good. The dataset contains 1000 instances with 20 features. Each feature has a different role, type, and demographic information, summarized as follows:

- Target Variable (Credit Risk): Binary, classifies individuals as either good (= 1) or bad (= 2) credit risks
- Status of existing checking account (Attribute1): Categorical, indicates the status of the existing checking account in 4 categories, such as the balance amount or absence of a checking account.
- Duration (Attribute2): Integer, represents the duration of credit in months.
- Credit history (Attribute3): Categorical, describes the credit history of individuals in 3 categories, including whether credits were paid back duly or if there were payment delays.
- Purpose (Attribute4): Categorical, specifies the purpose of the credit in 11 categories, like for a car purchase, furniture, education, or business, etc.
- Credit amount (Attribute5): Integer, denotes the amount of credit requested.
- Savings account/bonds (Attribute6): Categorical, indicates the status of savings accounts or bonds in categorical brackets of DEM currency.
- Present employment since (Attribute7): Categorical, shows the duration of present employment.
- Installment rate (Attribute8): Integer, represents the installment rate in terms of a percentage of disposable income.
- Personal status and sex (Attribute9): Categorical, provides information about personal status and sex.
- Other debtors/guarantors (Attribute10): Categorical, indicates the presence of other debtors or guarantors.
- Present residence since (Attribute11): Integer, denotes the duration of present residence.
- Property (Attribute12): Categorical, describes the type of property owned.
- Age in years (Attribute13): Integer, represents the age of individuals.
- Other installment plans (Attribute14): Categorical, specifies other installment plans held by individuals.
- Housing (Attribute15): Categorical, indicates the housing status.
- Number of existing credits at this bank (Attribute16): Integer, denotes the number of existing credits at this bank.
- Job (Attribute17): Categorical, describes the job status of individuals.
- Number of people being liable to provide maintenance for (Attribute18): Integer, represents the number of dependents.
- Telephone (Attribute19): Categorical, indicates the presence of a telephone registered under the customer's name.
- Foreign worker (Attribute20): Categorical, specifies whether the individual is a foreign worker.

## Analysis

In [2]:
# Load the dataset
#### TODO: READ FROM CLEAN DATA FOLDER
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,class
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


> <center><b><i>Table 1: Original Dataset</i></b></center>


The Dataset contains 21 columns of categorical and numerical data. The last column can be identified as the target variable.

In [4]:
type(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Status            1000 non-null   object
 1   Duration          1000 non-null   int64 
 2   Credit history    1000 non-null   object
 3   Purpose           1000 non-null   object
 4   Credit amount     1000 non-null   int64 
 5   Savings account   1000 non-null   object
 6   Employement       1000 non-null   object
 7   Rate              1000 non-null   int64 
 8   Personal status   1000 non-null   object
 9   Guarantors        1000 non-null   object
 10  Residence         1000 non-null   int64 
 11  Property          1000 non-null   object
 12  Age               1000 non-null   int64 
 13  Installment       1000 non-null   object
 14  Housing           1000 non-null   object
 15  Existing credits  1000 non-null   int64 
 16  Job               1000 non-null   object
 17  Liable people  

NoneType

The data does not contain any null value there for no imputation is required.

In [5]:
{column: df[column].nunique() for column in df.columns}

{'Status': 4,
 'Duration': 33,
 'Credit history': 5,
 'Purpose': 10,
 'Credit amount': 921,
 'Savings account': 5,
 'Employement': 5,
 'Rate': 4,
 'Personal status': 4,
 'Guarantors': 3,
 'Residence': 4,
 'Property': 4,
 'Age': 53,
 'Installment': 3,
 'Housing': 3,
 'Existing credits': 4,
 'Job': 4,
 'Liable people': 2,
 'Telephone': 2,
 'Foreign worker': 2,
 'Credit risk': 2}

In [6]:
df.describe()

Unnamed: 0,Duration,Credit amount,Rate,Residence,Age,Existing credits,Liable people,Credit risk
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


> <center><b><i>Table 3: Descriptive statistics of the data</i></b></center>


The numerical data will then be inspected to check is scaling is required.

In [1]:
# Histograms for numerical columns
#### TODO: READ 8 HISTOGRAMS FROM FIG FOLDER

> <center><b><i>Figure 1: Histograms of categorical columns</i></b></center>

In [2]:
# Correlation heatmap for numerical columns
#### TODO: READ HEATMAP FROM FIG FOLDER

> <center><b><i>Figure 2: Correlation heatmap of numerical columns</i></b></center>

The correlation does not display any high correlation between two given variables so no manipulation is required. Thus, the data can be scaled and fit prepared for the model. Column transformer will be used to scale the numerical data as well using one hot encoder for the cateogrical data.

### Modeling

The data will be split into a training and testing set where the Credit Risk will be set as the target variable for the supervised model

In [13]:
# Applying the preprocessing
### TODO: READ IN TRAINING AND TESTING DATA FROM DATA FOLDER

# Checking the shape of the splits
(X_train.shape, X_test.shape), (y_train.shape, y_test.shape)

(((800, 64), (200, 64)), ((800,), (200,)))

### Dummy Classifier

Lets first try the `Dummy Classifier`

Model Result:

This DummyClassifier scores act as our baseline model to test model performance. 

Now we will different models.

### Logistic Regression

Logistic regression seems to be doing better than dummy classifier. But note that there is a lot of variation in the scores.

In [3]:
# Hyperparameter optimization

### TODO: READ IN HYPERPARAMETER OPTIMIZATION TABLE FROM DATA FOLDER

> <center><b><i>Table 8: Cross-Validation results for Logistic Regression with varying regularization strengths</i></b></center>

The optimal C values is ???

After hyperparameter optimization the test_scores have increased indicating better model performance.

#### Fitting the model

Since the hyperparameter optimization is done, now lets fit the model to our training data

#### Accessing learned parameters

In [22]:
#Accessing the coefficients of the variables. 
#### TODO: READ IN LEARNED PARAMETERS FROM DATA FOLDER

Unnamed: 0,columns,coefs
9,Status_A14,-0.699637
14,Credit history_A34,-0.527882
16,Purpose_A41,-0.455999
37,Personal status_A93,-0.329487
52,Installment_A143,-0.298676
...,...,...
25,Savings account_A61,0.316278
10,Credit history_A30,0.342290
22,Purpose_A46,0.348201
15,Purpose_A40,0.444547


> <center><b><i>Table 10: Learned Parameters</i></b></center>

- The table displays coefficients assigned to various features by the logistic regression model.
- Negative coefficients (e.g., Status_A14, Credit history_A34) indicate a negative impact on the target variable (better credit risk), while positive coefficients (e.g., Purpose_A40, Status_A11) indicate a positive impact (poorer credit risk).
- The magnitude of the coefficients reflects the strength of the relationship between each feature and the target variable, with larger absolute values suggesting a more significant influence on the model's predictions.

#### ROC Plot

In [4]:
### TODO: READ IN ROC PLOT FROM IMG FOLDER

> <center><b><i>Figure 3: ROC Curve</i></b></center>

The above graph is a plot of our true positive rate/recall against false positive rate. For our case, a False Positive is when we predict someone to be a bad credit risk, when in reality they are a good credit risk. Therefore, we want to minimize False Positives while still maintaing a decent recall rate. The default predict-proba threshhold of 0.5 is our choice for balancing these two goals as it keeps the False Positive Rate quite low.

Model Performances: 


> <center><b><i>Table 11: Comparison of Model Performances: Dummy vs. Logistic Regression</i></b></center>

### Random Forest Model

Now create and fit the model with the best hyperparameters we got

Lets fit the data on the train set

#### Visualizing the tree

We can look at different trees created by random forest.

In [5]:
#### TODO: READ IN 3 TREE FIGS FROM IMG FOLDER

> <center><b><i>Figure 4: Visualization of Random Forest Tree</i></b></center>


### Model Performances

In [34]:
#Assessing model performance of different models
### TODO: READ IN CROSS VALIDATION SCORES FROM DATA FOLDER

Unnamed: 0,fit_time,score_time,test_score,train_score
Dummy,0.001127,0.001925,0.69875,0.69875
Logistic Regression,0.015703,0.000563,0.7425,0.783125
Random Forest,0.079874,0.012885,0.75625,1.0


> <center><b><i>Table 15: Summary of each model perfomance (Cross-Validation) </i></b></center>

In [6]:
#### READ IN TEST SCORES FROM DATA FOLDER

> <center><b><i>Table 16: Summary of each model perfomance (Test Scores) </i></b></center>

**Logistic Regression**
- **Accuracy (0.79)**: Around 79% of instances are classified correctly by the model.
- **Precision (0.81)**: When the model predicts a positive outcome, it's correct approximately 81% of the time.
- **Recall (0.91)**: The model can correctly identify about 91% of all actual positive instances.
- **F1 Score (0.86)**: Represents a balance between precision and recall, with higher values indicating good overall performance.

The model demonstrates good performance, with high accuracy, precision, recall, and F1 score, indicating it effectively classifies instances, makes few false positives, captures most positives, and maintains a balanced precision-recall trade-off.

**Random Forest**
- **Accuracy (0.8)**: The proportion of correctly classified instances out of the total instances in the test set.
- **Precision (0.802)**: The ratio of correctly predicted positive observations to the total predicted positives. It indicates the model's ability to correctly identify positive instances.
- **Recall (0.950)**: The ratio of correctly predicted positive observations to all actual positives. It shows the model's ability to capture all positive instances.
- **F1 Score (0.870)**: The harmonic mean of precision and recall. It provides a balance between precision and recall, especially useful when dealing with imbalanced classes.

Overall, these metrics suggest that the Random Forest classifier performs well on the test set, with relatively high accuracy, precision, recall, and F1 score. It correctly classifies a majority of instances while maintaining a good balance between false positives and false negatives.

**Overall**

Both the models are performing better than our baseline model.
- **Logistic Regression**: It has moderate training and prediction times and achieves a decent accuracy on the test set, indicating reasonable performance without overfitting.
- **Random Forest**: It takes significantly longer to train compared to logistic regression but achieves a slightly higher accuracy on the test set. However, there's a large discrepancy between the training and test scores, suggesting potential overfitting.

## Discussion

Our aim is to classify whether someone is a good or bad credit risk. First we used the dummy classifier which produced an accuracy of about 0.7 and we used this as a benchmark. We then employed logistic regression to improve on the dummy model's performance. The accuracy improved from 0.7 to 0.79. Three variables that strongly pushed the prediction in the positive direction (bad credit) are Status_A11 (< 0 Deutsche mark), Purpose_A40 (purpose is new car), and Purpose_A46 (purpose is education). Three variables that strongly pushed the prediction in the negative direction (good credit) are Status_A14 (no checking account), Credit history_A34 (other credits existing (not at this bank)), and Purpose_A41 (purpose is used car). Finally, we used a Random Forest model which improved on accuracy, recall, and F1-score compared to the Logistic Regression model. Both models had a decent precision value of around 0.8 which is important as business often want to avoid identifying people as good creditors when that is not the case. We would recommend using the Random Forest model due to its improved F1-score and accuracy.

It is unsurprising that we were able to improve on the Dummy classifier with logistic regression and random forest models as data such as credit status, purpose, and history seems fairly informative and somewhat linearly related to one's credit risk. It is also unsurprising that having negative DM (German currency) would indicate a bad credit risk. One surprise is that our model seems to think having no checking account is an indicator of a good credit risk- this is not very intuitive.

This model could potentially be used by banks to determine who they should loan money to since analyzing credit risks is important to banks (Dobby & Vossos, 2024). Even though there are potential benefits to using this logistic regression model, we should keep in mind negative impacts it could have. Using a model such as the one we created could result in un-intentional discrimination. For example, `Age` is an attribute in our data. Our model has the coefficient -0.21 for Age indicating a larger age value will have a better credit prediction, holding other variables constant. If our model was used to determine who should get a bank loan, there would be risk of age discrimination which is unethical and illegal (Personal Characteristics, Grounds of Discrimination Protected in the BC Human Rights Code - BC Human Rights Tribunal, 2023)
. 

While we only explored a linear regression model and random forest model, it would be interesting to see if gradient-boosted classifiers have improved accuracy and precision. If we used more complex classifiers such as lightLGBM, we would benefit from using SHAP graphs to understand how the model is making predictions generally, and on an individual basis. In addition, research could be done on how to prevent our model from discriminating based on age, gender, etc. as sometimes simply removing attributes does not lessen discrimination.

## References

Costa e Silva, E., Lopes, I. C., Correia, A., & Faria, S. (2020). A logistic regression model for consumer default risk. Journal of Applied Statistics, 47(13-15), 2879–2894. https://doi.org/10.1080/02664763.2020.1759030

Dobby, C., & Vossos, T. (2024, February 22). Wall Street to Follow Canada’s Hot Risk Transfer Trade. Bloomberg.com. https://www.bloomberg.com/news/articles/2024-02-22/wall-street-to-follow-canada-s-hot-capital-relief-trade

Goraieb, E., Kumar, S., & Pepanides, T. (n.d.). Credit Risk | Risk & Resilience | McKinsey & Company. Www.mckinsey.com. https://www.mckinsey.com/capabilities/risk-and-resilience/how-we-help-clients/credit-risk

Personal characteristics, grounds of discrimination protected in the BC Human Rights Code - BC Human Rights Tribunal. (2023, May 9). BC Human Rights Tribunal. https://www.bchrt.bc.ca/human-rights-duties/personal-characteritics/
