## Analysing Whether Grade is a Predictor for Default Through Chi-Square Testing

We here aim to test whether loan grade is a good predictor for default. I.e., whether loan grade and default rate are independent.

$H_0: P(Default | Grade) = P(Default)$

$H_1: P(Default | Grade) \neq P(Default)$

We use the chi square statistic to test independence.

$\chi^2 \sim \sum \frac{(O-E)^2}{E}$

Where O is the **observed** _count_ of defaults, and E is the **expected** _count_ of defaults.

This is a right-tailed test, since the chi-square statistic is always positive, and a small value of it indicates independence while a large value indicates dependence.

If the test statistic $\gt \chi^2_{\alpha}$, then we reject $H_0$ with a significance level of $\alpha$.


In [1]:
import pandas as pd
import numpy as np
from main import preprocess_df
from collections import defaultdict as dd


df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(df)

In [2]:
print(cleaned_df["did_default"].value_counts())
overall_percentage_of_default = sum(cleaned_df["did_default"]) / len(cleaned_df) * 100
print(f"Overall percentage of default: {round(overall_percentage_of_default, 2)}%")

did_default
False    70288
True     17603
Name: count, dtype: int64
Overall percentage of default: 20.03%


Since 20% of loans in the first 100,000 rows of the dataset were defaulted on, then under $H_0$ we expect each Grade to have a 20% default rate.


In [3]:
grade_map = "ABCDEFG"
cleaned_df.loc[:, "grade_as_letter"] = cleaned_df["grade"].map(lambda x: grade_map[x])
loans_per_grade_dict = dd(int)
for grade in grade_map:
    loans_per_grade_dict[grade] = cleaned_df["grade_as_letter"].value_counts()[grade]
print(loans_per_grade_dict)

defaultdict(<class 'int'>, {'A': np.int64(17059), 'B': np.int64(28027), 'C': np.int64(24487), 'D': np.int64(10951), 'E': np.int64(5498), 'F': np.int64(1532), 'G': np.int64(337)})


The total count of loans in each Grade are displayed in this table.

```
A  |  17059
B  |  28027
C  |  24487
D  |  10951
E  |   5498
F  |   1532
G  |    337
```


In [4]:
expected_defaults_dict = dd(int)
for grade in grade_map:
    specific_grade_df = cleaned_df.loc[cleaned_df["grade_as_letter"] == grade, :]
    expected_num_defaults = 0.2 * len(specific_grade_df)
    expected_defaults_dict[grade] = expected_num_defaults
    # find the observed number of defaults for each grade
    print(f"Grade: {grade}")
    print("expected:", round(expected_num_defaults))

Grade: A
expected: 3412
Grade: B
expected: 5605
Grade: C
expected: 4897
Grade: D
expected: 2190
Grade: E
expected: 1100
Grade: F
expected: 306
Grade: G
expected: 67


Since $H_0$ states that Grade and Default Rate are independent, we should expect all Grades to have equal rates of default.

Therefore, since overall rate of default is 20%, we should expect 20% of all loans in each Grade to be defaulted:

### Expected Default Counts for Each Grade

```
A  |  3412
B  |  5605
C  |  4897
D  |  2190
E  |  1100
F  |   306
G  |    67
```


In [5]:
# find the observed number of defaults for each grade

observed_defaults_dict = dd(int)
for grade in grade_map:
    specific_grade_df = cleaned_df.loc[cleaned_df["grade_as_letter"] == grade, :]
    observed_num_defaults = sum(specific_grade_df["did_default"])
    observed_defaults_dict[grade] = observed_num_defaults
    print(
        f"Grade: {grade} | observed count of defaults: {round(observed_num_defaults)}"
    )

Grade: A | observed count of defaults: 907
Grade: B | observed count of defaults: 3699
Grade: C | observed count of defaults: 5803
Grade: D | observed count of defaults: 3776
Grade: E | observed count of defaults: 2410
Grade: F | observed count of defaults: 809
Grade: G | observed count of defaults: 199


### Observed Default Counts for Each Grade

```
A |  907
B | 3699
C | 5803
D | 3776
E | 2410
F |  809
G |  199
```


In [6]:
for grade in grade_map:
    num_loans_with_grade = sum(cleaned_df["grade_as_letter"] == grade)
    num_defaulted_loans_in_grade = len(
        cleaned_df[
            (cleaned_df["grade_as_letter"] == grade)
            & (cleaned_df["did_default"] == True)
        ]
    )
    percentage_of_defaults = num_defaulted_loans_in_grade / num_loans_with_grade * 100
    print(
        grade,
        "| Number of loans:",
        num_loans_with_grade,
        "| Rate of default:",
        round(percentage_of_defaults, 2),
    )

A | Number of loans: 17059 | Rate of default: 5.32
B | Number of loans: 28027 | Rate of default: 13.2
C | Number of loans: 24487 | Rate of default: 23.7
D | Number of loans: 10951 | Rate of default: 34.48
E | Number of loans: 5498 | Rate of default: 43.83
F | Number of loans: 1532 | Rate of default: 52.81
G | Number of loans: 337 | Rate of default: 59.05


### Observed percentage rates of Default

Some Grades had a much smaller sample size than others, G being the smallest at $n=~337$. Different sample sizes make it difficult to rely on counts to see trends. For illustrative purposes, the percentage rates of default for each Grade are displayed in this table.

```
A | Number of loans: 17059 | % of default:  5.32%
B | Number of loans: 28027 | % of default: 13.20%
C | Number of loans: 24487 | % of default: 23.70%
D | Number of loans: 10951 | % of default: 34.48%
E | Number of loans: 5498  | % of default: 43.83%
F | Number of loans: 1532  | % of default: 52.81%
G | Number of loans: 337   | % of default: 59.05%
```


In [7]:
from scipy.stats import chi2
import math

print(observed_defaults_dict)
print(expected_defaults_dict)

default_scaled_sqd_devs = [
    (observed_defaults_dict[grade] - expected_defaults_dict[grade]) ** 2
    / (expected_defaults_dict[grade])
    for grade in grade_map
]
non_default_scaled_sqd_devs = [
    (observed_defaults_dict[grade] - expected_defaults_dict[grade]) ** 2
    / (loans_per_grade_dict[grade] - expected_defaults_dict[grade])
    for grade in grade_map
]
test_stat = sum(default_scaled_sqd_devs) + sum(non_default_scaled_sqd_devs)
print(default_scaled_sqd_devs)
print(non_default_scaled_sqd_devs)

print(test_stat)

chi2_stat = 8057.41
df = 6

log_p = chi2.logsf(chi2_stat, df)
print("log_p:", log_p)

p_value = math.exp(log_p)

print("p_value:", p_value)

defaultdict(<class 'int'>, {'A': 907, 'B': 3699, 'C': 5803, 'D': 3776, 'E': 2410, 'F': 809, 'G': 199})
defaultdict(<class 'int'>, {'A': 3411.8, 'B': 5605.400000000001, 'C': 4897.400000000001, 'D': 2190.2000000000003, 'E': 1099.6000000000001, 'F': 306.40000000000003, 'G': 67.4})
[1838.9187642886454, 648.3678167481361, 167.4585208477966, 1148.188128937996, 1561.6116405965802, 824.434595300261, 256.95192878338275]
[np.float64(459.72969107216136), np.float64(162.09195418703405), np.float64(41.864630211949155), np.float64(287.04703223449906), np.float64(390.4029101491451), np.float64(206.10864882506527), np.float64(64.23798219584569)]
8057.414244378499
log_p: -inf
p_value: 0.0


### The Test

For our test statistic, we must compute $\chi^2 \sim \sum \frac{(O-E)^2}{E}$ over all cells. That means not only the expected/observed defaults, but the expected/observed non-defaults as well.

Our test statistic comes out to be $8057.414$.

Since we are testing Grades A-G, we have 7 Grades. And we are testing their impact on a binary variable, did default or did not default. Therefore the contingency table is 2x7. And degrees of freedom are calculated by (c-1)(r-1), or in this case (1)(6) = 6. So there are 6 degrees of freedom.

The probability of getting a value equal to or larger (more extreme, since chi-square is right-tailed) than $8057.414$ on a chi-square distribution with 6 df, also known as the $p$-value, is $\lt 10^{-300}$, so we reject $H_0$.
