# Table of Contents

* [Import data](#Import-data)
* [Two sample t-test of equal mean of numeric variables on default and non-default groups](#Two-sample-t-test-of-equal-mean-of-numeric-variables-on-default-and-non-default-groups)
* [Test of equal distributions of numeric variables on default and non-default groups](#Test-of-equal-distributions-of-numeric-variables-on-default-and-non-default-groups)
* [Test if the default rates from 2 categories are statistically the same using z-test](#Test-if-the-default-rates-from-2-categories-are-statistically-the-same-using-z-test)
* [Chi-squared test of independence between categorical variables and the target](#Chi-squared-test-of-independence-between-categorical-variables-and-the-target)

## Import data

In [1]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
%matplotlib inline  
import statistics
from scipy import stats
from scipy.stats import t
from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest
import seaborn as sns
import sklearn
import sqlite3
from sqlite3 import Error
import csv

In [2]:
# use the data before one hot encoding of categorical variables to do hypothesis testing for categorical variables
# open connection to sqlite database 
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

sql_sm = '''SELECT A.* FROM data_all AS A'''
data_all = pd.read_sql(sql_sm, coerce_float=True, con=con)
#display(data_all.head(3))
print(data_all.shape)

con.close()

(307511, 332)


In [3]:
# use the data after missing value imputations to do hypothesis testing for numeric variables
# open connection to sqlite database 
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

sql_sm = '''SELECT A.* FROM data_all_3 AS A'''
data_all_3 = pd.read_sql(sql_sm, coerce_float=True, con=con)
#display(data_all_3.head(3))
print(data_all_3.shape)

con.close()

(307502, 351)


## Two sample t-test of equal mean of numeric variables on default and non-default groups

The null hypothesis is that the mean of the variable are the same in default and non-default groups. The alternative hypothesis is that the mean are different.

The 2 sample t test assumes the distribution of the 2 samples are approximately normal, whereas in our cases, many variables are highly skewed. We first do the test using original scales of the variables, next we chop the middle part of some of the variables where the distribution is less skewed and pass the new ranges to the test again, to check if the results are different.

For most of the variables, their mean values are statistically different in the default and non-default groups. Among the variables tested, only AMT_ANNUITY and SUM_AMT_CREDIT_MAX_OVERDUE have insignificant p-values, meaning there is no evidence to reject the null hypothesis that these 2 variables have the same mean values on default and non-default groups.

In [4]:
# use scipy.stat to do 2 sample t test
def ttest(df, col):
    default_N = df[col][df['TARGET']== 0]
    default_Y = df[col][df['TARGET']== 1]
    t_results = stats.ttest_ind(default_N, default_Y, equal_var = False)
    return t_results

In [5]:
# list of numeric variables to do t-test on 
tlist = ['AMT_INCOME_TOTAL','REGION_POPULATION_RELATIVE','YEARS_BEGINEXPLUATATION_AVG',
         'MIN_DAYS_CREDIT_UPDATE','MAX_DAYS_CREDIT_UPDATE',
         'MAX_DAYS_CREDIT_ENDDATE','MIN_DAYS_CREDIT','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3',
         'MAX_DAYS_ENDDATE_FACT','DAYS_BIRTH','DAYS_REGISTRATION','DAYS_LAST_PHONE_CHANGE',
         'DAYS_ID_PUBLISH','AMT_CREDIT','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','SUM_AMT_CREDIT_MAX_OVERDUE',
         'SUM_AMT_CREDIT_SUM_OVERDUE','SUM_AMT_CREDIT_SUM']
for i in tlist:
    t = ttest(data_all_3, i)
    #print("t-test statistics for variable {} is {:.4f}.".format(i, t.statistic))
    print("t-test p-value for variable {} is {:.4f}.".format(i, t.pvalue))

t-test p-value for variable AMT_INCOME_TOTAL is 0.0000.
t-test p-value for variable REGION_POPULATION_RELATIVE is 0.0000.
t-test p-value for variable YEARS_BEGINEXPLUATATION_AVG is 0.0167.
t-test p-value for variable MIN_DAYS_CREDIT_UPDATE is 0.0000.
t-test p-value for variable MAX_DAYS_CREDIT_UPDATE is 0.0000.
t-test p-value for variable MAX_DAYS_CREDIT_ENDDATE is 0.0000.
t-test p-value for variable MIN_DAYS_CREDIT is 0.0000.
t-test p-value for variable EXT_SOURCE_1 is 0.0000.
t-test p-value for variable EXT_SOURCE_2 is 0.0000.
t-test p-value for variable EXT_SOURCE_3 is 0.0000.
t-test p-value for variable MAX_DAYS_ENDDATE_FACT is 0.0000.
t-test p-value for variable DAYS_BIRTH is 0.0000.
t-test p-value for variable DAYS_REGISTRATION is 0.0000.
t-test p-value for variable DAYS_LAST_PHONE_CHANGE is 0.0000.
t-test p-value for variable DAYS_ID_PUBLISH is 0.0000.
t-test p-value for variable AMT_CREDIT is 0.0000.
t-test p-value for variable AMT_GOODS_PRICE is 0.0000.
t-test p-value for vari

Chop the variables to a range that the variables are less skewed, then do 2 sample t test again. We see some variables' p-values changed from significant to insignificant as we only test on the middle part of the distribution. It seems the differences are mainly caused by the tails. 

In [6]:
# use scipy.stat to do 2 sample t test
def ttest2(df, col, lower, upper):   
    filter_N = (df.TARGET == 0) & (df[col] <= upper) & (df[col] >= lower)
    default_N = df.loc[filter_N, col]
    filter_Y = (df.TARGET == 1) & (df[col] <= upper) & (df[col] >= lower)
    default_Y = df.loc[filter_Y, col]
    t_results = stats.ttest_ind(default_N, default_Y, equal_var = False)
    return t_results

In [7]:
# list of numeric variables to do t-test on 
tlist2 = ['AMT_INCOME_TOTAL','REGION_POPULATION_RELATIVE','YEARS_BEGINEXPLUATATION_AVG',
         'MIN_DAYS_CREDIT_UPDATE','MAX_DAYS_CREDIT_UPDATE',
         'MAX_DAYS_CREDIT_ENDDATE','MIN_DAYS_CREDIT','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3',
         'MAX_DAYS_ENDDATE_FACT','DAYS_BIRTH','DAYS_REGISTRATION','DAYS_LAST_PHONE_CHANGE',
         'DAYS_ID_PUBLISH','AMT_CREDIT','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','SUM_AMT_CREDIT_MAX_OVERDUE',
         'SUM_AMT_CREDIT_SUM_OVERDUE','SUM_AMT_CREDIT_SUM']

rangelist = [(10,14),(0,0.03),(0.96,1),
             (-3000,-1000),(-2000,-1000),
             (1000,3650),(-3000,-1000),(0,1),(0,1),(0,1),
             (-730,-365),(-25550,-7300),(-5000,-1000),(-730,-365),
             (-7300,-3650),(11,15),(12,14),(8,12),(-2700,-1000),(0,250),
             (1000,5000),(20000,80000)]

a = list(zip(tlist2, rangelist))

for i in a:
    t = ttest2(data_all_3, i[0], i[1][0], i[1][1])
    #print("t-test statistics for variable {} is {:.4f}.".format(i[0], t.statistic))
    print("t-test p-value for variable {} is {:.4f}.".format(i[0], t.pvalue))

t-test p-value for variable AMT_INCOME_TOTAL is 0.0000.
t-test p-value for variable REGION_POPULATION_RELATIVE is 0.0210.
t-test p-value for variable YEARS_BEGINEXPLUATATION_AVG is 0.0000.
t-test p-value for variable MIN_DAYS_CREDIT_UPDATE is 0.0000.
t-test p-value for variable MAX_DAYS_CREDIT_UPDATE is 0.5390.
t-test p-value for variable MAX_DAYS_CREDIT_ENDDATE is 0.4993.
t-test p-value for variable MIN_DAYS_CREDIT is 0.0000.
t-test p-value for variable EXT_SOURCE_1 is 0.0000.
t-test p-value for variable EXT_SOURCE_2 is 0.0000.
t-test p-value for variable EXT_SOURCE_3 is 0.0000.
t-test p-value for variable MAX_DAYS_ENDDATE_FACT is 0.0638.
t-test p-value for variable DAYS_BIRTH is 0.0000.
t-test p-value for variable DAYS_REGISTRATION is 0.6056.
t-test p-value for variable DAYS_LAST_PHONE_CHANGE is 0.6424.
t-test p-value for variable DAYS_ID_PUBLISH is 0.0000.
t-test p-value for variable AMT_CREDIT is 0.0000.
t-test p-value for variable AMT_GOODS_PRICE is 0.0000.
t-test p-value for vari

## Test of equal distributions of numeric variables on default and non-default groups

Use 2-sample Kolmogorov-Smirnov (KS) test to check if the distributions of one numeric variable are the same on default and non-default groups. Under the null hypothesis the two distributions are identical. The alternative hypothesis is that their distributions are different. The KS test is only valid for continuous distributions.

Compute the Kolmogorov-Smirnov statistic on 2 samples.

Among the numeric features tested, none of them have the same distribution in the default and non-default groups. This is expected as testing the distributions are similar or the same is more restrictive than having the same mean values. Intuitively at least their means should be pretty close, and we already saw in the previous t-test that only 2 variables have similar mean values on the default and non-default groups. 

In [8]:
# use scipy.stat to do 2 sample KS test
def kstest(df, col):
    default_N = df[col][df['TARGET']== 0]
    default_Y = df[col][df['TARGET']== 1]
    ks_results = stats.ks_2samp(default_N, default_Y)
    return ks_results

In [9]:
# list of numeric variables to do ks-test on 
kslist = ['AMT_INCOME_TOTAL','YEARS_BEGINEXPLUATATION_AVG','MIN_DAYS_CREDIT_UPDATE','MIN_DAYS_CREDIT_UPDATE',
         'MAX_DAYS_CREDIT_ENDDATE','MIN_DAYS_CREDIT','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3',
         'MAX_DAYS_ENDDATE_FACT','DAYS_BIRTH','DAYS_REGISTRATION','DAYS_LAST_PHONE_CHANGE',
         'DAYS_ID_PUBLISH','AMT_CREDIT','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','SUM_AMT_CREDIT_MAX_OVERDUE',
         'SUM_AMT_CREDIT_SUM_OVERDUE','SUM_AMT_CREDIT_SUM']
for i in kslist:
    ks = kstest(data_all_3, i)
    #print("ks-test statistics for variable {} is {:.4f}.".format(i, ks.statistic))
    print("ks-test pvalue for variable {} is {:.4f}.".format(i, ks.pvalue))

ks-test pvalue for variable AMT_INCOME_TOTAL is 0.0000.
ks-test pvalue for variable YEARS_BEGINEXPLUATATION_AVG is 0.0000.
ks-test pvalue for variable MIN_DAYS_CREDIT_UPDATE is 0.0000.
ks-test pvalue for variable MIN_DAYS_CREDIT_UPDATE is 0.0000.
ks-test pvalue for variable MAX_DAYS_CREDIT_ENDDATE is 0.0000.
ks-test pvalue for variable MIN_DAYS_CREDIT is 0.0000.
ks-test pvalue for variable EXT_SOURCE_1 is 0.0000.
ks-test pvalue for variable EXT_SOURCE_2 is 0.0000.
ks-test pvalue for variable EXT_SOURCE_3 is 0.0000.
ks-test pvalue for variable MAX_DAYS_ENDDATE_FACT is 0.0000.
ks-test pvalue for variable DAYS_BIRTH is 0.0000.
ks-test pvalue for variable DAYS_REGISTRATION is 0.0000.
ks-test pvalue for variable DAYS_LAST_PHONE_CHANGE is 0.0000.
ks-test pvalue for variable DAYS_ID_PUBLISH is 0.0000.
ks-test pvalue for variable AMT_CREDIT is 0.0000.
ks-test pvalue for variable AMT_GOODS_PRICE is 0.0000.
ks-test pvalue for variable AMT_ANNUITY is 0.0000.
ks-test pvalue for variable DAYS_EMPLO

## Test if the default rates from 2 categories are statistically the same using z-test

This is testing the proportion of default in 2 categories are the same. Pick a few categorical variables with 2 levels, and compare the default rate in each level. Test results show a few variables having p-values > 0.05, indicating the default proportions are not significantly different in its 2 categories. Such variables may not have a good separating power of default loans from non-default loans.

In [10]:
def ztest(df, col):
    tmp = pd.crosstab(df[col], df['TARGET']).reset_index()
    tmp.columns = [col,'Default_No','Default_Yes']
    c1 = tmp.iloc[0,2]
    c2 = tmp.iloc[1,2]
    n1 = tmp.iloc[0,1] + tmp.iloc[0,2]
    n2 = tmp.iloc[1,1] + tmp.iloc[1,2]
    count = np.array([c1, c2])
    nobs = np.array([n1, n2])
    stat, pval = proportions_ztest(count, nobs)
    return [stat, pval]

In [11]:
# list of 2-category categorical variables to do z-test on 
zlist = ['NAME_CONTRACT_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','FLAG_DOCUMENT_2','FLAG_DOCUMENT_3',
         'FLAG_DOCUMENT_5','FLAG_DOCUMENT_6','FLAG_DOCUMENT_7','FLAG_DOCUMENT_8',
         'FLAG_DOCUMENT_9','FLAG_DOCUMENT_11','FLAG_DOCUMENT_13','FLAG_DOCUMENT_14',
         'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16','FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19',
         'FLAG_DOCUMENT_20','FLAG_DOCUMENT_21']
for i in zlist:
    z = ztest(data_all, i)
    #print("z-test statistics for variable {} is {:.4f}.".format(i, z[0]))
    print("z-test pvalue for variable {} is {:.4f}.".format(i, z[1]))

z-test pvalue for variable NAME_CONTRACT_TYPE is 0.0000.
z-test pvalue for variable FLAG_OWN_CAR is 0.0000.
z-test pvalue for variable FLAG_OWN_REALTY is 0.0007.
z-test pvalue for variable FLAG_DOCUMENT_2 is 0.0027.
z-test pvalue for variable FLAG_DOCUMENT_3 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_5 is 0.8610.
z-test pvalue for variable FLAG_DOCUMENT_6 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_7 is 0.3994.
z-test pvalue for variable FLAG_DOCUMENT_8 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_9 is 0.0158.
z-test pvalue for variable FLAG_DOCUMENT_11 is 0.0190.
z-test pvalue for variable FLAG_DOCUMENT_13 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_14 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_15 is 0.0003.
z-test pvalue for variable FLAG_DOCUMENT_16 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_17 is 0.0611.
z-test pvalue for variable FLAG_DOCUMENT_18 is 0.0000.
z-test pvalue for variable FLAG_DOCUMENT_19 is 0.4516.
z-test pvalue for va

## Chi-squared test of independence between categorical variables and the target

This is testing the distribution of counts in each category of a categorical variable is independent of the target group, whether default or non-default. Rejecting the null hypothesis would mean the categorical variable differs in the 2 target groups, in which case this categorical variable may be a good feature to separate out default vs. non-default loans; otherwise we would have no evidence to conclude the distribution of counts are different in the 2 target groups.

Choose a few categorical variables where there are not too many levels, and need to make sure there are > 0 counts in every category in both default and non-default loans. 

It turns out that the p-values obtained from this Chi-squared test are exactly the same as the p-values coming out of the z-test above. 

In [12]:
# if use lambda_="log-likelihood" in the function, then Perform 
# the test using log-likelihood ratio (i.e. the “G-test”) instead of Pearson’s chi-squared statistic.
def chisqtest(df, col):
    list1 = df[col][df['TARGET']==0].value_counts().sort_index()
    list2 = df[col][df['TARGET']==1].value_counts().sort_index()
    obs = np.array([list1, list2])
    chi2, p, dof, expected = stats.chi2_contingency(obs, correction=False)
    return [chi2, p, dof, expected]

In [13]:
chisqlist = ['NAME_CONTRACT_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','FLAG_DOCUMENT_2','FLAG_DOCUMENT_3',
             'FLAG_DOCUMENT_5','FLAG_DOCUMENT_6','FLAG_DOCUMENT_7','FLAG_DOCUMENT_19',
             'FLAG_DOCUMENT_20','FLAG_DOCUMENT_21','WEEKDAY_APPR_PROCESS_START']
for i in chisqlist:
    chi = chisqtest(data_all, i)
    #print("chisq-test statistics for variable {} is {:.4f}.".format(i, chi[0]))
    print("chisq-test pvalue for variable {} is {:.4f}.".format(i, chi[1]))

chisq-test pvalue for variable NAME_CONTRACT_TYPE is 0.0000.
chisq-test pvalue for variable FLAG_OWN_CAR is 0.0000.
chisq-test pvalue for variable FLAG_OWN_REALTY is 0.0007.
chisq-test pvalue for variable FLAG_DOCUMENT_2 is 0.0027.
chisq-test pvalue for variable FLAG_DOCUMENT_3 is 0.0000.
chisq-test pvalue for variable FLAG_DOCUMENT_5 is 0.8610.
chisq-test pvalue for variable FLAG_DOCUMENT_6 is 0.0000.
chisq-test pvalue for variable FLAG_DOCUMENT_7 is 0.3994.
chisq-test pvalue for variable FLAG_DOCUMENT_19 is 0.4516.
chisq-test pvalue for variable FLAG_DOCUMENT_20 is 0.9049.
chisq-test pvalue for variable FLAG_DOCUMENT_21 is 0.0397.
chisq-test pvalue for variable WEEKDAY_APPR_PROCESS_START is 0.0174.
