# Workshop: Creditworthiness Assessment using Central Limit Theorem

### Brief
The Head of Lending has asked you to compare the credit scores of default and non-default borrowers to determine if there is a significant difference between the two groups.

### Hypothesis Testing Approach

We will compare the credit scores of default and non-default borrowers using a two-sample t-test to determine if there is a significant difference between the two groups.

### Python Code Walkthrough


#### User-specified parameters

In [None]:
python_material_folder_name = "python-material"

#### Import libraries

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Check if in Google Colab environment
try:
    from google.colab import drive
    # Mount drive
    drive.mount('/content/drive')
    # Set up path to Python material parent folder
    path_python_material = rf"drive/MyDrive/{python_material_folder_name}"
        # If unsure, print current directory path by executing the following in a new cell:
        # !pwd
    IN_COLAB = True
except:
    IN_COLAB = False
    # If working locally on Jupyter Notebook, parent folder is one folder up (assuming you are using the folder structure shared at the beginning of the course)
    path_python_material = ".."

## Import data

In [None]:
# Read data that was exported from previous session
df = pd.read_csv(f"{path_python_material}/data/2-intermediate/df_out_dsif3.csv")
df.head()

In [None]:
df.columns[df.columns.str.contains("fico")]

In [None]:
df.last_fico_range_high.describe()

In [None]:
df.loan_status.value_counts()

Useful resources:
- https://www.experian.co.uk/consumer/guides/good-credit-score.html
- https://www.experian.com/blogs/ask-experian/what-is-the-difference-between-fico-score-and-credit-score/
    

In [None]:
default_scores = df[df.loan_status == "Charged Off"]["last_fico_range_high"]
non_default_scores = df[df.loan_status != "Charged Off"]["last_fico_range_high"]

print(f"Found {len(default_scores)} default scores and {len(non_default_scores)} non default scores")


The t-statistic is a measure used to determine if there is a significant difference between the means of two groups. It is calculated using the formula:

$$
t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

where:

- $\bar{X}_1$ and $\bar{X}_2$ are the sample means of the two groups.
- $s_1^2$ and $s_2^2$ are the sample variances of the two groups.
- $n_1$ and $n_2$ are the sample sizes of the two groups.

In [None]:
from scipy import stats

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(default_scores, non_default_scores)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.5f}")

### <span style="color:BLUE"> **>>> QUESTION:**  </span>    
> What conclusions should we share with the Head of Credit Cards?



### Result Interpretation

-   **T-statistic:** Measures the difference between the means of two groups relative to the variation in the sample data. A higher absolute value indicates a larger difference between group means.

-   **P-Values:** The p-value represents the probability of observing the test results under the null hypothesis. It helps to determine the statistical significance of the test.

    -   **Low P-value (< 0.05):** Indicates strong evidence against the null hypothesis, suggesting that the observed difference is statistically significant.
    -   **High P-value (>= 0.05):** Indicates weak evidence against the null hypothesis, suggesting that the observed difference could be due to chance.

### <span style="color:BLUE"> **>>> EXERCISE:**  </span>    
> The Head of Lending has now asked you to test if the default rates vary significantly across loan grades.  
> What test would you be using for this case and what would be your null and alternative hypotheses?    
> Please implement in the cells below.  

> Tip: check out the following [guide](https://statsandr.com/blog/files/overview-statistical-tests-statsandr.pdf) on how to choose the right test.  
> For python implementation of tests, check out [scipy docs](https://docs.scipy.org/doc/scipy/reference/stats.html).

In [None]:
# YOUR CODE HERE

**Solution:**  
1. Choosing the right statistical test
Given that we're dealing with two categorical variables: 
- Loan Grade - Categorical: e.g., A, B, etc.
- Default Status - Categorical: Defaulted or Not Defaulted  

, the most suitable test is the Chi-Square Test of Independence. 
This test allows us to evaluate whether there's a statistically significant association between these two variables.

**Formulating the Hypotheses**
- Null Hypothesis (H₀): There is no association between loan grade and default rate. In simpler terms, the default rate is independent of the loan grade.
- Alternative Hypothesis (H₁): There is an association between loan grade and default rate. The default rate is dependent on the loan grade.

In [None]:
# SOLUTION
# Null Hypothesis (H₀): Loan grade is independent of loan default 
# (i.e., default rates are the same across all loan grades).

# Alternative Hypothesis (H₁): Loan grade is not independent of loan default 
# (i.e., default rates vary across different loan grades).

df['loan_default'] = df.loan_status == "Charged Off"

# Create a contingency table (cross-tabulation) of loan grades and loan status
contingency_table = pd.crosstab(df['grade'], df['loan_default'])

# Perform Chi-Square Test of Independence
chi2_stat, p_val, dof, expected = stats.chi2_contingency(contingency_table)

# Output the results
print(f"Chi-Square Statistic: {chi2_stat:.2f}")
print(f"P-value: {p_val:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected frequencies (under H0):")
print(expected)


### **Interpretation:**

-   **Chi-Square Statistic:** This value tells us the magnitude of the difference between the observed and expected values.
-   **P-value:** The p-value tells us whether the observed association between loan grade and default status is statistically significant. If the p-value is less than a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that loan grade and default rate are not independent. A p-value of 0.0000 (or essentially zero) means it's extremely unlikely to observe such a large Chi-Square statistic if loan grade and default were truly independent. In most cases, if the p-value is less than the significance level (commonly 0.05), you reject the null hypothesis.

For example:

-   If the p-value is **less than 0.05**, there is evidence to suggest that default rates vary significantly across different loan grades.
-   If the p-value is **greater than 0.05**, we do not have enough evidence to reject the null hypothesis, meaning loan grade and default rates are likely independent.

* * * * *

This test gives insight into whether loan grades (a measure of creditworthiness) are related to the likelihood of default, helping lenders fine-tune their credit risk models.

### <span style="color:BLUE"> **>>> OPTIONAL EXERCISE (advanced): Bayesian Inference Implementation**  </span>    

In this exercise, you will apply **Bayesian Inference** to estimate the probability of default based on prior beliefs and new data. Bayesian inference is a powerful tool for updating probabilities as new information becomes available, and is commonly used in finance for risk assessment and modeling uncertainties.

* * * * *

### **Objective:**

The goal is to help students implement a simple Bayesian model that calculates the probability of a borrower defaulting on a loan, given prior probabilities and new evidence.

#### **Bayesian Inference Basics**:

-   **Prior:** Initial belief about a parameter before seeing any data.
-   **Likelihood:** The probability of observing the data given the hypothesis.
-   **Posterior:** The updated belief after considering the new data (evidence).

The formula for Bayesian inference is given by:

P(H∣E)= ( P(E∣H) * P(H) ) / P(E)

Where:

-   P(H) is the **prior** probability (initial belief).
-   P(H∣E) is the **posterior** probability (the updated probability after observing the evidence).
-   P(E∣H) is the **likelihood** (the probability of the evidence given the hypothesis).
-   P(E) is the **evidence** (the total probability of the evidence across all hypotheses).

* * * * *

### **Scenario:**

A bank wants to estimate the probability that a borrower will **default** on a loan based on two pieces of information:

1.  **Prior Information**: The bank has historical data showing that 5% of all borrowers default on their loans. This is the **prior probability** P(Default)=0.05.

2.  **New Evidence**: The bank analyzes the borrower's credit score. Based on historical data:

    -   70% of borrowers who defaulted had a credit score below 600.
    -   20% of borrowers who did not default had a credit score below 600.

The borrower in question has a **credit score below 600**. You will calculate the **posterior probability** of the borrower defaulting using this new evidence.

In [None]:
# # a) Define prior probabilities
# P_default = # YOUR CODE HERE
# P_no_default = # YOUR CODE HERE

# # b) Define likelihoods based on new evidence
# P_evidence_given_default =    # YOUR CODE HERE
# P_evidence_given_no_default = # YOUR CODE HERE

# # c) Calculate the total probability of the evidence
# P_evidence = # YOUR CODE HERE

# # d) Calculate the posterior probability of default given the evidence
# P_default_given_evidence = # YOUR CODE HERE

# # Output the results
# print(f"Posterior Probability of Default given credit score < 600: {P_default_given_evidence:.4f}")


In [None]:
# SOLUTION
# a) Define prior probabilities
P_default = 0.05              # Prior probability of default
P_no_default = 1 - P_default  # Prior probability of no default

# b) Define likelihoods based on new evidence
P_evidence_given_default = 0.70     # Likelihood of credit score < 600 given default
P_evidence_given_no_default = 0.20  # Likelihood of credit score < 600 given no default

# c) Calculate the total probability of the evidence
P_evidence = (P_evidence_given_default * P_default) + (P_evidence_given_no_default * P_no_default)

# d) Calculate the posterior probability of default given the evidence
P_default_given_evidence = (P_evidence_given_default * P_default) / P_evidence

# Output the results
print(f"Posterior Probability of Default given credit score < 600: {P_default_given_evidence:.4f}")


# Appendix

## Calculating t statistic manually vs. scipy library

**Steps to Calculate the T-Statistic:**

1. **Calculate the Sample Means:**

   For each group, compute the mean of the sample data.

2. **Calculate the Sample Variances:**

   Compute the variance for each group. Variance measures how much the data points differ from the mean.

3. **Compute the Standard Error:**

   The standard error is the square root of the sum of the variances divided by the sample sizes.

4. **Calculate the T-Statistic:**

   Plug the sample means, variances, and sample sizes into the formula to get the t-statistic.

### Python Code Walkthrough

In [None]:

import numpy as np
from scipy import stats

# Sample data: credit scores of default and non-default borrowers
default_scores = np.array([600, 620, 580, 570, 630])
non_default_scores = np.array([700, 720, 710, 690, 740])

# Compute sample means
mean_default = np.mean(default_scores)
mean_non_default = np.mean(non_default_scores)

# Compute sample variances
var_default = np.var(default_scores, ddof=1)
var_non_default = np.var(non_default_scores, ddof=1)

# Sample sizes
n_default = len(default_scores)
n_non_default = len(non_default_scores)

# Calculate the t-statistic
t_stat = (mean_default - mean_non_default) / np.sqrt((var_default / n_default) + (var_non_default / n_non_default))

# Perform two-sample t-test using scipy
t_stat_scipy, p_value = stats.ttest_ind(default_scores, non_default_scores)

print(f"Calculated T-statistic: {t_stat}")
print(f"Scipy T-statistic: {t_stat_scipy}")
print(f"P-value: {p_value}")

# End of session

In [None]:
from IPython.display import Image
Image(filename=f"{path_python_material}/images/the-end.jpg", width=500,)
