## Analysing Whether Grade is a Predictor for Default Through Chi-Square Testing

We here aim to test whether loan grade is a good predictor for default. I.e., whether loan grade and default rate are independent.

$H_0: P(Default \cap Grade) = P(Default) \times P(Grade)$

$H_1: P(Default \cap Grade) \neq P(Default) \times P(Grade)$

We use the chi square statistic to test independence.

$\chi^2 \sim \sum \frac{(O-E)^2}{E}$

If the |test statistic| $\gt \chi^2_{\alpha/2}$, then we reject $H_0$ with a significance level of $\alpha$.


In [None]:
import pandas as pd
import numpy as np
from main import preprocess_df

df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(df)

In [None]:
print(cleaned_df["did_default"].value_counts())
grade_map = "ABCDEFG"
cleaned_df.loc[:, "grade_as_letter"] = cleaned_df["grade"].map(lambda x: grade_map[x])
print(cleaned_df["grade_as_letter"].value_counts())

## Aggregate Counts for Grades of Loans

```
A  |  17059
B  |  28027
C  |  24487
D  |  10951
E  |   5498
F  |   1532
G  |    337
```


Since $H_0$ states that Grade and Default Rate are independent, we should expect all Grades to have equal rates of default.

Therefore, since overall rate of default is 20%:

### Expected percentage rates of Default

```
A  |  20
B  |  20
C  |  20
D  |  20
E  |  20
F  |  20
G  |  20
```


In [None]:
for grade in grade_map:
    num_loans_with_grade = sum(cleaned_df["grade_as_letter"] == grade)
    num_defaulted_loans_in_grade = len(
        cleaned_df[
            (cleaned_df["grade_as_letter"] == grade)
            & (cleaned_df["did_default"] == True)
        ]
    )
    percentage_of_defaults = num_defaulted_loans_in_grade / num_loans_with_grade * 100
    print(
        grade,
        "| Number of loans:",
        num_loans_with_grade,
        "| Rate of default:",
        round(percentage_of_defaults, 2),
    )

### Observed percentage rates of Default

The rates of default for each Grade are here. Some Grades had a much smaller sample size than others, G being the smallest at $n=~337$.

```
A | Number of loans: 17059 | % of default: 05.32
B | Number of loans: 28027 | % of default: 13.20
C | Number of loans: 24487 | % of default: 23.70
D | Number of loans: 10951 | % of default: 34.48
E | Number of loans: 5498  | % of default: 43.83
F | Number of loans: 1532  | % of default: 52.81
G | Number of loans: 337   | % of default: 59.05
```
