In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv("bank_altered.csv", usecols=np.arange(1,17))
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,y
0,58,management,married,tertiary,0,2143,1,0,unknown,5,5,261,1,-1,0,0
1,44,technician,single,secondary,0,29,1,0,unknown,5,5,151,1,-1,0,0
2,33,entrepreneur,married,secondary,0,2,1,1,unknown,5,5,76,1,-1,0,0
3,47,blue-collar,married,unknown,0,1506,1,0,unknown,5,5,92,1,-1,0,0
4,35,management,married,tertiary,0,231,1,0,unknown,5,5,139,1,-1,0,0


Before I dig into statistics let me outline an agenda for the questions I would like answered.

1. Perform the appropriate significance tests on each of the attributes as they pertain to the target, y.  For example, the age attribute can be binned into different age categories.  The age categories can then be compared pairwise by means of two-sample hypothesis testing techniques.  Likewise, significance tests can be performed on the other categorical variables.  Attributes that show no statistically significant difference in terms of the target variable, y, may be discarded.
2. Determine if strong correlations exist between attributes in order to detect surrogacy between variables and potentially eliminate variables that may not offer additional predictive power.  I will use chi-squared tests to investigate dependence between categorical variables, the F-test to invenstigate dependence between categorical and numerical variables, and the Pearson correlation coefficient to investigate dependence between numerical variables.

First lets examine the relationship between marital status and term deposit subscription status.  I will use the chi-squared test to determine whether or not these variables are dependent.

In [4]:
df_y_marital = df[['marital', 'y']]
df_y_marital.head()

Unnamed: 0,marital,y
0,married,0
1,single,0
2,married,0
3,married,0
4,married,0


In [5]:
tbl_y_marital = pd.crosstab(df_y_marital.y, df_y_marital.marital)
tbl_y_marital

marital,divorced,married,single
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4569,24274,10822
1,621,2734,1900


In [6]:
_, p_val, _, _ = stats.chi2_contingency(tbl_y_marital)
print("The p-value is {}.".format(p_val))

The p-value is 6.604211765227822e-43.


Thus we must reject the null hypothesis, $H_0$, that marital status and term deposit subscription status are independent and accept the alternative hypothesis, $H_a$, that the two are correlated.

Now lets examine the relationship between education and term deposit subscription status, again by means of the chi-squared test.

In [7]:
df_y_education = df[['y','education']]

In [8]:
tbl_y_education = pd.crosstab(df_y_education.y, df_y_education.education)

In [9]:
_, p_val, _, _ = stats.chi2_contingency(tbl_y_education)
print("The p-value is {}.".format(p_val))

The p-value is 2.8026078374488455e-51.


Again, this p-value suggests that, if it were true that education and term deposit subscription status are independent, it would be nearly impossible to obtain the $\chi^2$ statistic that was obtained.  Thus it is sensible to conclude that education and term deposit status are correlated.

Now lets examine the relationship between the variables balance and y.  We will divide the balance variable into two groups: those who purchased a term deposit and those who did not.  Then we will compare the population means for statistical significance.

In [15]:
df_balance_0 = df.loc[df.y == 0, 'balance']
df_balance_1 = df.loc[df.y == 1, 'balance']

In [16]:
df_balance_0.mean()

1301.0548342367326

In [18]:
df_balance_1.mean()

1801.4947668886775

In [26]:
diff_mean = df_balance_1.mean() - df_balance_0.mean()

In [23]:
df_balance_0.std()

2975.4532726690936

In [24]:
df_balance_1.std()

3495.68188789597

In [25]:
S_d = np.sqrt(df_balance_0.var()/df_balance_0.count() + df_balance_1.var()/df_balance_1.count())

In [32]:
p_val = 2*(1 - stats.norm(0,1).cdf(diff_mean/S_d))
print("The p-value is {}.".format(p_val))

The p-value is 0.0.


We can see from the result above that, assuming there is no difference between the those who purchased a term deposit and those who didn't in terms of average annual balance, the chances of the observed difference in sample means would be nearly impossible.  Thus we must conclude that there is statistically significant difference between the two populations in terms of average annual balance.