## Robustness Test: Testing Whether FICO Score is Independent of Default Rate

In a previous test, it was found that loan Grade is probabilistically certain to have a dependent relationship with default rate. In this test, we test the robustness of that by exchanging the variable Grade for FICO score, a similar metric. Everything is basically the same except with FICO score band instead of Grade.

$H_0: P(Default \space|\space FICO \space Score) = P(Default)$

$H_1: P(Default\space |\space FICO \space Score) \neq P(Default)$

We again use the chi square statistic to test independence.

$\chi^2 \sim \sum \frac{(O-E)^2}{E}$


In [1]:
import pandas as pd
import numpy as np
from main import preprocess_df
from collections import defaultdict as dd


df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(df)

In [2]:
print(cleaned_df["did_default"].value_counts())
overall_percentage_of_default = sum(cleaned_df["did_default"]) / len(cleaned_df) * 100
print(f"Overall percentage of default: {round(overall_percentage_of_default, 2)}%")

did_default
False    70288
True     17603
Name: count, dtype: int64
Overall percentage of default: 20.03%


Again, since 20% of loans in the first 100,000 rows of the dataset were defaulted on, then under $H_0$ we expect each FICO band to have a 20% default rate.


In [22]:
fico_bands = np.array(sorted([650, 690, 730, 770, 810, 850]))

df_for_testing = cleaned_df.copy()

fico_indices = np.searchsorted(fico_bands, df_for_testing["avg_fico"], side="right")
fico_indices = np.clip(fico_indices, 0, len(fico_bands) - 1)
df_for_testing.loc[:, "fico_band_index"] = fico_indices


counts_series = df_for_testing["fico_band_index"].value_counts()
loans_per_grade_dict = counts_series.to_dict()
print(loans_per_grade_dict)

{1: 44545, 2: 31633, 3: 8463, 4: 2737, 5: 513}


With FICO bands allotted as follows:

```
Band 0|       score < 650
Band 1| 650 < score < 689
Band 2| 690 < score < 729
Band 3| 730 < score < 769
Band 4| 770 < score < 809
Band 5| 810 < score < 850
```

The total count of loans in each FICO band are displayed in this table.

```
0 |     0
1 | 44545
2 | 31633
3 |  8463
4 |  2737
5 |   513
```


In [None]:
expected_defaults_dict = dd(int)
for band_index in set(fico_indices):  # to get only the uniques
    specific_band_df = df_for_testing.loc[
        df_for_testing["fico_band_index"] == band_index, :
    ]
    expected_num_defaults = 0.2 * len(specific_band_df)
    expected_defaults_dict[band_index] = expected_num_defaults
    # find the observed number of defaults for each grade
    print(f"{band_index} | {round(expected_num_defaults)}")

1 | 8909
2 | 6327
3 | 1693
4 | 547
5 | 103


Since $H_0$ states that FICO band and Default Rate are independent, we should expect all FICO bands to have equal rates of default.

Therefore, since overall rate of default is 20%, we should expect 20% of all loans in each FICO band to be defaulted:

### Expected Default Counts for Each FICO band

```
1 | 8909
2 | 6327
3 | 1693
4 |  547
5 |  103
```


In [27]:
# find the observed number of defaults for each fico band

observed_defaults_dict = dd(int)
for band_index in set(fico_indices):
    specific_band_df = df_for_testing.loc[
        df_for_testing["fico_band_index"] == band_index, :
    ]
    observed_num_defaults = sum(specific_band_df["did_default"])
    observed_defaults_dict[band_index] = observed_num_defaults
    print(f"{band_index} | {round(observed_num_defaults)}")

1 | 10874
2 | 5583
3 | 933
4 | 191
5 | 22


### Observed Default Counts for Each FICO band

```
1 | 10874
2 |  5583
3 |   933
4 |   191
5 |    22
```


In [None]:
for band_index in set(fico_indices):
    num_loans_in_band = sum(df_for_testing["fico_band_index"] == band_index)
    num_defaulted_loans_in_grade = len(
        cleaned_df[
            (cleaned_df["grade_as_letter"] == grade)
            & (cleaned_df["did_default"] == True)
        ]
    )
    percentage_of_defaults = num_defaulted_loans_in_grade / num_loans_in_band * 100
    print(f"{grade} | {round(percentage_of_defaults, 2)}")