## Analysing Whether Grade is a Predictor for Default Through Chi-Square Testing

We here aim to test whether loan grade is a good predictor for default. I.e., whether loan grade and default rate are independent.

$H_0: P(Default \cap Grade) = P(Default) \times P(Grade)$

$H_1: P(Default \cap Grade) \neq P(Default) \times P(Grade)$

We use the chi square statistic to test independence.

$\chi^2 \sim \sum \frac{(O-E)^2}{E}$

Where O is the **observed** _count_ of defaults, and E is the **expected** _count_ of defaults.

If the |test statistic| $\gt \chi^2_{\alpha/2}$, then we reject $H_0$ with a significance level of $\alpha$.


In [None]:
import pandas as pd
import numpy as np
from main import preprocess_df

df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(df)

In [None]:
print(cleaned_df["did_default"].value_counts())
overall_percentage_of_default = sum(cleaned_df["did_default"]) / len(cleaned_df) * 100
print(f"Overall percentage of default: {round(overall_percentage_of_default, 2)}%")

Since 20% of loans in the first 100,000 rows of the dataset were defaulted on, then under $H_0$ we expect each Grade to have a 20% default rate.


In [None]:
grade_map = "ABCDEFG"
cleaned_df.loc[:, "grade_as_letter"] = cleaned_df["grade"].map(lambda x: grade_map[x])
print(cleaned_df["grade_as_letter"].value_counts())

The total count of loans in each Grade are displayed in this table.

```
A  |  17059
B  |  28027
C  |  24487
D  |  10951
E  |   5498
F  |   1532
G  |    337
```


In [None]:
for grade in grade_map:
    specific_grade_df = cleaned_df.loc[cleaned_df["grade_as_letter"] == grade, :]
    expected_num_defaults = 0.2 * len(specific_grade_df)
    # find the observed number of defaults for each grade
    print(f"Grade: {grade}")
    print("expected:", round(expected_num_defaults))

Since $H_0$ states that Grade and Default Rate are independent, we should expect all Grades to have equal rates of default.

Therefore, since overall rate of default is 20%, we should expect 20% of all loans in each Grade to be defaulted:

### Expected Default Counts for Each Grade

```
A  |  3412
B  |  5605
C  |  4897
D  |  2190
E  |  1100
F  |   306
G  |    67
```


In [None]:
# find the observed number of defaults for each grade
for grade in grade_map:
    specific_grade_df = cleaned_df.loc[cleaned_df["grade_as_letter"] == grade, :]
    observed_num_defaults = sum(specific_grade_df["did_default"])
    print(
        f"Grade: {grade} | observed count of defaults: {round(observed_num_defaults)}"
    )

### Observed Default Counts for Each Grade

```
A |  907
B | 3699
C | 5803
D | 3776
E | 2410
F |  809
G |  199
```


In [None]:
for grade in grade_map:
    num_loans_with_grade = sum(cleaned_df["grade_as_letter"] == grade)
    num_defaulted_loans_in_grade = len(
        cleaned_df[
            (cleaned_df["grade_as_letter"] == grade)
            & (cleaned_df["did_default"] == True)
        ]
    )
    percentage_of_defaults = num_defaulted_loans_in_grade / num_loans_with_grade * 100
    print(
        grade,
        "| Number of loans:",
        num_loans_with_grade,
        "| Rate of default:",
        round(percentage_of_defaults, 2),
    )

### Observed percentage rates of Default

Some Grades had a much smaller sample size than others, G being the smallest at $n=~337$. Different sample sizes make it difficult to rely on counts to see trends. For illustrative purposes, the percentage rates of default for each Grade are displayed in this table.

```
A | Number of loans: 17059 | % of default:  5.32%
B | Number of loans: 28027 | % of default: 13.20%
C | Number of loans: 24487 | % of default: 23.70%
D | Number of loans: 10951 | % of default: 34.48%
E | Number of loans: 5498  | % of default: 43.83%
F | Number of loans: 1532  | % of default: 52.81%
G | Number of loans: 337   | % of default: 59.05%
```
