# Bank Marketing Dataset

## Importing the libraries and data

In [1]:
%matplotlib notebook
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import csv
import seaborn as sns

bank_dat = pd.read_csv('bank-additional-full.csv', sep=';')
bank_dat.head()
#bank_dat.shape #(41188, 21)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


## Cleaning and removing non-determinant rows

Let us first remove all the rows with unknown values, since they will add to the noise. We also remove the calls whose duration is zero.

In [2]:
bank_dat = bank_dat[bank_dat != 'unknown'].dropna()
bank_dat = bank_dat[bank_dat['duration'] != 0]
#bank_dat.shape #(30484, 21)
print(bank_dat.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30484 entries, 0 to 41187
Data columns (total 21 columns):
age               30484 non-null int64
job               30484 non-null object
marital           30484 non-null object
education         30484 non-null object
default           30484 non-null object
housing           30484 non-null object
loan              30484 non-null object
contact           30484 non-null object
month             30484 non-null object
day_of_week       30484 non-null object
duration          30484 non-null int64
campaign          30484 non-null int64
pdays             30484 non-null int64
previous          30484 non-null int64
poutcome          30484 non-null object
emp.var.rate      30484 non-null float64
cons.price.idx    30484 non-null float64
cons.conf.idx     30484 non-null float64
euribor3m         30484 non-null float64
nr.employed       30484 non-null float64
y                 30484 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

## Analysis and Inferences

In [3]:
sns.set()
df_dummy = pd.get_dummies(bank_dat['y'])
bank_dat = pd.concat([bank_dat, df_dummy['yes']], axis=1)
init_corr = bank_dat.corr()
# This doesn't provide correlation for categorical variables, and hence can't paint the complete picture
plt.figure()
ax = sns.heatmap(init_corr, linewidths=0.5, cmap='coolwarm', center = 0)
for item in ax.xaxis.get_ticklabels():
    item.set_rotation(45)
    item.set_fontsize(8)
plt.subplots_adjust(bottom=0.25, left=0.25)
plt.title('Correlation of non-categorical variables to the decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'Correlation of non-categorical variables to the decision')

This is getting interesting. We see some correlations between duration(duration of the call), pdays(days passed since contact to client), and previous(number of contacts before the present). The consumer confidence and employment variance also play a part in the decision-making activity.

Let us start listing the attributes in terms of their impact on each other as well as on the decision.
+ The **Age** of the receiver is neither related to the decision nor to any other factor.
+ The **Duration** of the call has a fairly strong direct relation to the client saying 'yes'.
+ The **pdays** attribute(Days passed since last contact) has an inverse relation to a 'yes'.
+ The **pdays** attribute is obviously inversely related to **previous**, which are the previous number of contacts made to the client.
+ The **Employment variation rate**(Index of employment stability) inversely affects the decision of na individual to say 'yes'. This indicates that clients generally give favorable response if the employment conditions are stable.
+ Surprisingly, **Number of employees** are directly related to **Employment variation**.

Let us analyze the duration of call first.

In [4]:
bank_dat['duration'] = bank_dat['duration']/60

In [5]:
#Let us divide the call duration to less than 5 minutes, 5-15 minutes, 15-30 minutes and beyond 30 minutes
duration_list = pd.cut(bank_dat['duration'], [0,5,15,30, 100]).rename('dur_interval')
duration_list = pd.concat([duration_list, bank_dat['yes']], axis=1)

In [6]:
sns.factorplot(data=duration_list, x='dur_interval', hue='yes', kind='count')
plt.xlabel('Duration of call(in minutes)')
plt.ylabel('No of calls made')
plt.title('Call effectiveness w.r.t. duration')
plt.subplots_adjust(top=0.9)

<IPython.core.display.Javascript object>

We see that a longer duration call often resulted in an affirmative response.

In [7]:
#job_data = pd.get_dummies(bank_dat['job'])
sns.factorplot(data=bank_dat, x='job', hue='yes', kind='count')
for item in plt.gca().xaxis.get_ticklabels():
    item.set_rotation(90)
plt.subplots_adjust(bottom=0.25, top=0.9)
plt.title('Job titles and their responses')

<IPython.core.display.Javascript object>

Text(0.5,1,'Job titles and their responses')

In [8]:
job_data = pd.get_dummies(bank_dat['job'])
job_data = pd.concat([job_data, bank_dat['yes']], axis=1)

In [9]:
plt.figure()
corr_data = job_data.corr()
sns.heatmap(corr_data, linewidths=0.5, cmap='coolwarm', center = 0)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10da6290>

Job classes such as administrators, technicians, and especially retired people have higher chances of subscribing to the term deposit. The heatmap fails to even give that semblance.

In [10]:
sns.factorplot(data=bank_dat, x='marital', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.25, top=0.9)
plt.title('Marital status and their responses')

<IPython.core.display.Javascript object>

Text(0.5,1,'Marital status and their responses')

This factor is inconsequential.

In [11]:
sns.factorplot(data=bank_dat, x='education', hue='yes', kind='count')
for item in plt.gca().xaxis.get_ticklabels():
    item.set_rotation(90)
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Level of education w.r.t. to their response')

<IPython.core.display.Javascript object>

Text(0.5,1,'Level of education w.r.t. to their response')

We see a direct relationship between an increasing education level to an increase in interest in the fixed deposits. However, this again, is not definitive, since the increase is also accompanied in the number of calls made.

In [12]:
sns.factorplot(data=bank_dat, x='default', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('How being a defaulter affects your decision')
len(bank_dat[bank_dat['default'] == 'yes'])

<IPython.core.display.Javascript object>

3

Since we just have 3 defaulters, this is clearly a useless parameter.

In [13]:
sns.factorplot(data=bank_dat, x='housing', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('How having a housing loan affects your decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'How having a housing loan affects your decision')

Again, this is a non-determinant parameter.

In [14]:
sns.factorplot(data=bank_dat, x='loan', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('How having a personal loan affects your decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'How having a personal loan affects your decision')

This again does not play a significant part in the decision making activity.

In [15]:
sns.factorplot(data=bank_dat, x='contact', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Contact medium vs Decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'Contact medium vs Decision')

We see a strange connection between contacts on phone and an increase in the term deposits, where the approvals quadruple in case of contact via mobile phones, with the customers in contact only roughly doubling.
Possible reasons can be that people generally use mobile phones during commuting, and hence are more prone to getting the call cut as soon as possible. There is a lack of concentration during those periods. This can be leveraged.

In [16]:
#Let us divide the months to Quarters
quarter_dict = {'jan': 'Q1',
               'feb': 'Q1',
               'mar': 'Q1',
               'apr':'Q2',
               'may':'Q2',
               'jun':'Q2',
               'jul':'Q3',
               'aug':'Q3',
               'sep':'Q3',
               'oct':'Q4',
               'nov':'Q4',
               'dec':'Q4'}
bank_dat = bank_dat.replace(to_replace={'month':quarter_dict})

In [17]:
sns.factorplot(data=bank_dat, x='month', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Contact quarter vs decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'Contact quarter vs decision')

We again notice that Q3 and Q4 show increased interests in comparison to Q2.

In [18]:
sns.factorplot(data=bank_dat, x='day_of_week', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Contact day vs Decision')

<IPython.core.display.Javascript object>

Text(0.5,1,'Contact day vs Decision')

This again, is an inconsequential factor.

Let us add a new column determining whether the client was contacted before this campaign or not.

In [19]:
bank_dat['campaign_before'] = np.where(bank_dat['pdays'] == 999, 0, 1)

Let us now use this to compare our decision outcome.

In [20]:
sns.factorplot(data=bank_dat, x='campaign_before', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Effect of previous contact on decision')
labels = ['No', 'Yes']
plt.gca().set_xticklabels(labels)
plt.xlabel('Contacted in previous campaign')

<IPython.core.display.Javascript object>

Text(0.5,32.1406,'Contacted in previous campaign')

We see that customers who are contacted repeatedly in campaigns have a high chance of affirming in subsequent campaigns.

In [21]:
#print(bank_dat['campaign'].max()) #43
campaign_data = pd.cut(bank_dat['campaign'], [0, 1, 2, 3, 10, 15, 20, 50])
campaign_data = pd.concat([campaign_data, bank_dat['yes']], axis=1)


In [22]:
sns.factorplot(data=campaign_data, x='campaign', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Contacts in current campaign')

<IPython.core.display.Javascript object>

Text(0.5,1,'Contacts in current campaign')

Clearly, if the client doesn't get attracted in the first couple of calls in the current campaign, it probably is a waste of resources to pursue.

In [23]:
pdata = bank_dat[bank_dat['pdays'] != 999]
pdata
sns.factorplot(data=pdata, x='pdays', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Results after a client is contacted multiple times')
plt.xlabel('Days since last contact in previous campaign')

<IPython.core.display.Javascript object>

Text(0.5,32,'Days since last contact in previous campaign')

Super encouraging! If a client has been previously contacted in a previous campaign, then a quick follow up can seal the deal!

In [24]:
sns.factorplot(data=bank_dat, x='previous', hue='yes', kind='count')
plt.subplots_adjust(bottom=0.3, top=0.9)
plt.title('Contacts in previous campaign')

<IPython.core.display.Javascript object>

Text(0.5,1,'Contacts in previous campaign')

We can hence, safely conclude that a solid previous campaign along with quick follow ups in the current campaign can almost guarantee good results.

We omit a discussion of the socio-economic environment at the time of making the decision.

## Hypothesis validation

**Hypothesis:** A client that has been contacted in a previous campaign has more chances of agreeing to a deposit in the current campaign.

In [25]:
# We first divide the data into those individuals who have been contacted in a previous campaign, vs those, who have not been.
from scipy.stats import ttest_ind, ttest_rel
freshClients = bank_dat[bank_dat['campaign_before'] == 0]
freshClients = freshClients['yes']
oldClients = bank_dat[bank_dat['campaign_before'] == 1]
oldClients = oldClients['yes']

In [26]:
freshClients.shape

(29174,)

In [27]:
sizeReq = oldClients.shape[0]
sizeReq

1310

In [28]:
# freshClients = freshClients[0:sizeReq]
# freshClients.shape

In [29]:
freshClients = freshClients.sample(n=sizeReq)

In [30]:
freshClients = freshClients.reset_index()['yes']

In [31]:
oldClients = oldClients.reset_index()['yes']

We take the sums of 'yes' in freshClients as well as oldClients and group them in groups of 50. Then we carry the ttest on them.

In [32]:
group1 = freshClients.groupby(freshClients.index//50).sum()
group2 = oldClients.groupby(oldClients.index//50).sum()
result1 = ttest_rel(list(group1), list(group2))
result2 = ttest_ind(list(group1), list(group2))

Considering those values independent gives us:

In [33]:
result1

Ttest_relResult(statistic=-13.772811358689008, pvalue=1.8660506218596392e-13)

While if we consider them related:

In [34]:
result2

Ttest_indResult(statistic=-13.4576218017344, pvalue=1.4256088544568218e-18)

In either case, we get a healthy T-value and a small p-value indicating that our hypothesis is *correct*.

**Hypothesis:** A person currently on a loan has higher chances of subscribing to a term deposit.

In [74]:
onLoan = bank_dat[bank_dat['loan'] == 'yes']
offLoan = bank_dat[bank_dat['loan'] == 'no']
onLoan = onLoan['yes']
offLoan = offLoan['yes']

In [75]:
onLoan.shape

(4768,)

In [76]:
offLoan.shape

(25716,)

In [77]:
onLoanSamples = []
offLoanSamples = []
for i in range(20):
    onLoanSample = onLoan.sample(n=500)
    onLoanSamples.append(onLoanSample.mean())
    offLoanSample = offLoan.sample(n=500)
    offLoanSamples.append(offLoanSample.mean())
ttest_ind(onLoanSamples, offLoanSamples)

Ttest_indResult(statistic=-1.4688760085626638, pvalue=0.15009685944709217)

The P-values vary wildly, and hence, this attribute cannot be a reliable factor for determining the outcome of the call.

## Conclusion

We hence did a preliminary analysis of the Banking data to find out some key inferences:
* The **duration of call** plays a significant role in convincing a client, as does his/her job, especially **retired** people are more susceptible to agreeing to a fixed deposit.
* Contacting people on a **cellphone** in the later half of the year, gives better results.
* A quick follow up to a call from previous campaign can guarantee a good result. Perseverance is key!

We also tested the last point as a hypothesis and got fairly satisfying results.

*P.S. : I left a discussion of the socio-economic factors, since most of them seemed beyond my background. I'll add them once I've had a study about them.*