# Do Defaulters and Non-Defaulters Have Different Mean Incomes?

**Dataset:** Approved LendingClub loans  
**Objective:** Test whether the mean incomes of these two different groups differ to a statistically significant degree  

In [None]:
import pandas as pd
from main import preprocess_df

df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 10k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(df)

## Hypothesis 
Let:

- $\mu_D=$ the mean income of defaulters
- $\mu_N=$ the mean income of non-defaulters 


$H_0: \mu_D=\mu_N$  
$H_1: \mu_D \neq \mu_N$

This is a two-tailed test.

## Statistical Test

We use a two-tailed t-test to compare the means since the means come from (large) samples, and the population variance is unknown.

**Assumptions**

- Observations are independent
- Sample size n is large, so CLT applies
- Sample variances may be different therefore we use Welch's t-test

**Details**

- 100,000 rows taken from dataset
- All unresolved loans removed
- 87,891 loans remain

In [9]:
import numpy as np

income_default = cleaned_df.loc[cleaned_df["did_default"], "annual_inc"].dropna()
income_non_default = cleaned_df.loc[~cleaned_df["did_default"], "annual_inc"].dropna()

xbar_income_d = np.mean(income_default)
xbar_income_non_d = np.mean(income_non_default) 
s_income_d = np.std(income_default)
s_income_non_d = np.std(income_non_default)
n_defaulters = len(income_default)
n_non_defaulters = len(income_non_default)

print("Sample mean of defaulter income:", xbar_income_d)
print("Sample mean of non-defaulter income:", xbar_income_non_d)
print("Sample std dev of defaulter income:", s_income_d)
print("Sample std dev of non-defaulter income:", s_income_non_d)
print("Sample size n of defaulters", n_defaulters)
print("Sample size n of non-defaulters", n_non_defaulters)


Sample mean of defaulter income: 70957.4259211498
Sample mean of non-defaulter income: 79370.1077819827
Sample std dev of defaulter income: 57995.86936751661
Sample std dev of non-defaulter income: 95912.71129640169
Sample size n of defaulters 17603
Sample size n of non-defaulters 70288
