The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).


Attribute Information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')



In [1]:
#load the relevant libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn 
import seaborn as sns
sns.set()

from scipy import stats
stats.chisqprob =lambda chisq,df:stats.chi2.sf(chisq,df)

In [2]:
bank_data = pd.read_csv('bank_marketing_training')
bank_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,days_since_previous,previous,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
4,29,blue-collar,single,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,admin.,married,university.degree,no,no,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,no
26870,37,admin.,married,university.degree,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,unemployed,single,basic.4y,no,yes,no,cellular,nov,fri,...,1,9,1,success,-1.1,94.767,-50.8,1.028,4963,no
26872,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes


In [3]:
bnk_data = bank_data.copy()

# Preprocessing
### Age


In [4]:
bnk_data['age'].min()

17

In [5]:
bnk_data['age'].max()

91

In [6]:
len(bnk_data['age'].unique())

73

In [7]:
bnk_data['age'].unique()

array([56, 57, 41, 25, 29, 35, 39, 30, 55, 37, 46, 49, 54, 34, 52, 58, 32,
       38, 44, 40, 53, 47, 42, 48, 33, 50, 51, 31, 45, 43, 60, 36, 59, 28,
       27, 26, 23, 24, 20, 22, 21, 61, 19, 66, 67, 73, 18, 88, 70, 77, 75,
       63, 68, 80, 62, 72, 64, 71, 69, 78, 65, 85, 79, 81, 74, 17, 76, 82,
       83, 91, 86, 84, 89], dtype=int64)

In [8]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,days_since_previous,previous,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
4,29,blue-collar,single,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,admin.,married,university.degree,no,no,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,no
26870,37,admin.,married,university.degree,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,unemployed,single,basic.4y,no,yes,no,cellular,nov,fri,...,1,9,1,success,-1.1,94.767,-50.8,1.028,4963,no
26872,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes


In [9]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,days_since_previous,previous,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
2,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,25,services,single,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
4,29,blue-collar,single,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,admin.,married,university.degree,no,no,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,no
26870,37,admin.,married,university.degree,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,unemployed,single,basic.4y,no,yes,no,cellular,nov,fri,...,1,9,1,success,-1.1,94.767,-50.8,1.028,4963,no
26872,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes


### Job

In [10]:
#job is categorical variable

bnk_data['job'].unique()

array(['housemaid', 'services', 'blue-collar', 'management', 'unemployed',
       'retired', 'technician', 'admin.', 'unknown', 'entrepreneur',
       'student', 'self-employed'], dtype=object)

In [11]:
bnk_data['job'].value_counts()

admin.           6757
blue-collar      6051
technician       4437
services         2581
management       1889
retired          1143
self-employed     918
entrepreneur      914
housemaid         709
unemployed        667
student           598
unknown           210
Name: job, dtype: int64

In [12]:
#mapping
bnk_data['job'] = bnk_data['job'].map({'admin':0, 'blue-collar':1, 'technician':1, 'services':1, 'management':0, 'retired':1,
                                      'self-employed':1, 'entrepreneur':0, 'housemaid':1, 'student':1, 'unknown': np.NaN})

### marital

In [13]:
#age is categorical variable

bnk_data['marital'].unique()

array(['married', 'single', 'divorced', 'unknown'], dtype=object)

In [14]:
bnk_data['marital'].value_counts()

married     16187
single       7575
divorced     3055
unknown        57
Name: marital, dtype: int64

In [15]:
#mapping
bnk_data['marital'] = bnk_data['marital'].map({'married':0, 'single':1, 'divorced':1,'unknown': np.NaN})

### education

In [16]:
#education is categorical variable

bnk_data['education'].unique()

array(['basic.4y', 'high.school', 'unknown', 'basic.6y', 'basic.9y',
       'university.degree', 'professional.course', 'illiterate'],
      dtype=object)

In [17]:
bnk_data['education'].value_counts()

university.degree      7946
high.school            6130
basic.9y               4050
professional.course    3423
basic.4y               2688
basic.6y               1498
unknown                1127
illiterate               12
Name: education, dtype: int64

In [18]:
#mapping
bnk_data['education'] = bnk_data['education'].map({'university degree':0, 'high.school':1, 'basic.9y':1,
                                                   'professional.course':1, 'basic.4y':1, 'basic.6y':1, 'illiterate':1,
                                                  'unknown':np.NaN})

### Loan default

In [19]:

bnk_data['default'].unique()

array(['no', 'unknown', 'yes'], dtype=object)

In [20]:
bnk_data['default'].value_counts()

no         21219
unknown     5652
yes            3
Name: default, dtype: int64

In [21]:
bnk_data['default'] = bnk_data['default'].map({'no':0, 'unknown':np.NaN, 'yes':1})

### housing

In [22]:

bnk_data['housing'].unique()

array(['no', 'yes', 'unknown'], dtype=object)

In [23]:
bnk_data['housing'].value_counts()

yes        13967
no         12234
unknown      673
Name: housing, dtype: int64

In [24]:
bnk_data['housing'] = bnk_data['housing'].map({'yes':0, 'no':1, 'unknown':np.NaN})

### Loan(personal)

In [25]:
bnk_data['loan'].unique()

array(['no', 'yes', 'unknown'], dtype=object)

In [26]:
bnk_data['loan'].value_counts()

no         22061
yes         4140
unknown      673
Name: loan, dtype: int64

In [27]:
bnk_data['loan'] = bnk_data['loan'].map({'no':0, 'yes':1, 'unknown':np.NaN})

In [28]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,days_since_previous,previous,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
1,57,1.0,0.0,1.0,,1.0,0.0,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
2,41,1.0,0.0,,,1.0,0.0,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,,0.0,,0.0,1.0,0.0,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,no
26870,37,,0.0,,0.0,0.0,0.0,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,,1.0,1.0,0.0,0.0,0.0,cellular,nov,fri,...,1,9,1,success,-1.1,94.767,-50.8,1.028,4963,no
26872,73,1.0,0.0,1.0,0.0,0.0,0.0,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963,yes


### previous_outcome

In [29]:
bnk_data['previous_outcome'].unique()

array(['nonexistent', 'failure', 'success'], dtype=object)

In [30]:
bnk_data['previous_outcome'].value_counts()

nonexistent    23210
failure         2775
success          889
Name: previous_outcome, dtype: int64

In [31]:
bnk_data['previous_outcome'] = bnk_data['previous_outcome'].map({'nonexistent':0, 'failure':1, 'success':1})

In [32]:
bnk_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,days_since_previous,previous,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,telephone,may,mon,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191,no
1,57,1.0,0.0,1.0,,1.0,0.0,telephone,may,mon,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191,no
2,41,1.0,0.0,,,1.0,0.0,telephone,may,mon,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,telephone,may,mon,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,telephone,may,mon,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191,no


In [33]:
#drop predictors that have little or no predicting 'power'
bnk_data = bnk_data.drop(['contact','month','day_of_week','duration', 'days_since_previous','previous'], axis=1)


In [34]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
1,57,1.0,0.0,1.0,,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
2,41,1.0,0.0,,,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,,0.0,,0.0,1.0,0.0,2,0,-1.1,94.767,-50.8,1.028,4963,no
26870,37,,0.0,,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,,1.0,1.0,0.0,0.0,0.0,1,1,-1.1,94.767,-50.8,1.028,4963,no
26872,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.028,4963,yes


### Reorder Columns


In [35]:
bnk_data.columns.values

array(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'campaign', 'previous_outcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'response'],
      dtype=object)

In [36]:
columns_reorder = [ 'age','job', 'marital', 'education', 'default', 'housing', 'loan',
       'campaign', 'previous_outcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'response']

In [37]:
bnk_data = bnk_data[columns_reorder]

In [38]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
1,57,1.0,0.0,1.0,,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
2,41,1.0,0.0,,,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26869,36,,0.0,,0.0,1.0,0.0,2,0,-1.1,94.767,-50.8,1.028,4963,no
26870,37,,0.0,,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.028,4963,yes
26871,29,,1.0,1.0,0.0,0.0,0.0,1,1,-1.1,94.767,-50.8,1.028,4963,no
26872,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.028,4963,yes


### checking for missing values

In [39]:
bnk_data.isnull().sum()

age                    0
job                 7634
marital               57
education           9073
default             5652
housing              673
loan                 673
campaign               0
previous_outcome       0
emp.var.rate           0
cons.price.idx         0
cons.conf.idx          0
euribor3m              0
nr.employed            0
response               0
dtype: int64

In [40]:
bnk_data = bnk_data.dropna(axis=0)

In [41]:
bnk_data.isnull().sum()

age                 0
job                 0
marital             0
education           0
default             0
housing             0
loan                0
campaign            0
previous_outcome    0
emp.var.rate        0
cons.price.idx      0
cons.conf.idx       0
euribor3m           0
nr.employed         0
response            0
dtype: int64

In [42]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,4.857,5191,no
5,57,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
6,35,1.0,0.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,4.857,5191,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26863,62,1.0,0.0,1.0,0.0,0.0,0.0,5,0,-1.1,94.767,-50.8,1.030,4963,no
26865,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.031,4963,yes
26868,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,1.031,4963,no
26872,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.028,4963,yes


In [43]:
##drop some further predictors(didn't see them initially)

bnk_data = bnk_data.drop(['nr.employed','euribor3m'], axis=1)

In [44]:
bnk_data

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,no
3,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
4,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,no
5,57,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
6,35,1.0,0.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...
26863,62,1.0,0.0,1.0,0.0,0.0,0.0,5,0,-1.1,94.767,-50.8,no
26865,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,yes
26868,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,no
26872,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,yes


In [45]:
bank_dta = bnk_data.copy()

In [46]:
bank_dta = bank_dta.reset_index(drop=True)

In [47]:
bank_dta

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,no
1,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
2,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,no
3,57,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
4,35,1.0,0.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10361,62,1.0,0.0,1.0,0.0,0.0,0.0,5,0,-1.1,94.767,-50.8,no
10362,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,yes
10363,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,no
10364,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,yes


### Target Variable-Response

In [48]:
bank_dta['response'].unique()

array(['no', 'yes'], dtype=object)

In [49]:
bank_dta['response'].value_counts()

no     9198
yes    1168
Name: response, dtype: int64

In [50]:
#transform into binary classification
bank_dta['response'] = bank_dta['response'].map({'no':0, 'yes':1})

In [51]:
##Balancing the datasets
from sklearn.utils import resample

In [52]:
bank_dta_major =bank_dta[bank_dta.response==0]
bank_dta_minor =bank_dta[bank_dta.response==1]

In [53]:
bank_dta_downsampling = resample(bank_dta_major,replace=False, n_samples=1168, random_state=42)
bank_dta_downsampled = pd.concat([bank_dta_downsampling,bank_dta_minor])

In [54]:
bank_dta_downsampled.response.value_counts()

1    1168
0    1168
Name: response, dtype: int64

In [55]:
bank_dta['response']=bank_dta_downsampled.response

In [56]:
bank_dta['response'].value_counts()

1.0    1168
0.0    1168
Name: response, dtype: int64

In [57]:
bank_dta

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,response
0,56,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,
1,25,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,
2,29,1.0,1.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,
3,57,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,
4,35,1.0,0.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10361,62,1.0,0.0,1.0,0.0,0.0,0.0,5,0,-1.1,94.767,-50.8,
10362,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0
10363,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,0.0
10364,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0


In [58]:
bank_dta.isnull().sum()

age                    0
job                    0
marital                0
education              0
default                0
housing                0
loan                   0
campaign               0
previous_outcome       0
emp.var.rate           0
cons.price.idx         0
cons.conf.idx          0
response            8030
dtype: int64

In [59]:
bank_dta = bank_dta.dropna(axis=0)

In [60]:
bank_dta

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,response
8,34,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,0.0
12,44,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,0.0
14,49,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,1.0
15,42,1.0,0.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,0.0
18,49,1.0,0.0,1.0,0.0,1.0,0.0,3,0,1.1,93.994,-36.4,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10360,35,1.0,1.0,1.0,0.0,0.0,0.0,3,1,-1.1,94.767,-50.8,1.0
10362,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0
10363,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,0.0
10364,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0


In [61]:
bank_dta = bank_dta.reset_index(drop=True)

In [62]:
bank_dta

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx,response
0,34,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,0.0
1,44,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4,0.0
2,49,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4,1.0
3,42,1.0,0.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4,0.0
4,49,1.0,0.0,1.0,0.0,1.0,0.0,3,0,1.1,93.994,-36.4,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,35,1.0,1.0,1.0,0.0,0.0,0.0,3,1,-1.1,94.767,-50.8,1.0
2332,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0
2333,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8,0.0
2334,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8,1.0


### Standardization

In [63]:
#from sklearn import preprocessing
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module, 
# so you can imagine that the Custom Scaler is build on it

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler()
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


In [64]:
inputs_unscaled = bank_dta.iloc[:,:-1]

In [65]:
inputs_unscaled


Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx
0,34,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4
1,44,1.0,1.0,1.0,0.0,0.0,0.0,1,0,1.1,93.994,-36.4
2,49,1.0,0.0,1.0,0.0,1.0,0.0,1,0,1.1,93.994,-36.4
3,42,1.0,0.0,1.0,0.0,1.0,1.0,1,0,1.1,93.994,-36.4
4,49,1.0,0.0,1.0,0.0,1.0,0.0,3,0,1.1,93.994,-36.4
...,...,...,...,...,...,...,...,...,...,...,...,...
2331,35,1.0,1.0,1.0,0.0,0.0,0.0,3,1,-1.1,94.767,-50.8
2332,33,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8
2333,57,1.0,0.0,1.0,0.0,0.0,0.0,6,0,-1.1,94.767,-50.8
2334,73,1.0,0.0,1.0,0.0,0.0,0.0,1,0,-1.1,94.767,-50.8


In [66]:
inputs_unscaled.columns.values

array(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'campaign', 'previous_outcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx'], dtype=object)

In [67]:
columns_to_omit =['job', 'marital', 'education', 'default', 'housing', 'loan','previous_outcome']

In [68]:
columns_to_scale = [x for x in inputs_unscaled.columns.values if x not in columns_to_omit]

In [69]:
bank_scaler = CustomScaler(columns_to_scale)

In [70]:
bank_scaler.fit(inputs_unscaled)

CustomScaler(columns=['age', 'campaign', 'emp.var.rate', 'cons.price.idx',
                      'cons.conf.idx'])

In [71]:
inputs_scaled = bank_scaler.transform(inputs_unscaled)

In [72]:
inputs_scaled

Unnamed: 0,age,job,marital,education,default,housing,loan,campaign,previous_outcome,emp.var.rate,cons.price.idx,cons.conf.idx
0,-0.504552,1.0,0.0,1.0,0.0,1.0,0.0,-0.602558,0,0.981327,0.837804,0.747312
1,0.245931,1.0,1.0,1.0,0.0,0.0,0.0,-0.602558,0,0.981327,0.837804,0.747312
2,0.621173,1.0,0.0,1.0,0.0,1.0,0.0,-0.602558,0,0.981327,0.837804,0.747312
3,0.095834,1.0,0.0,1.0,0.0,1.0,1.0,-0.602558,0,0.981327,0.837804,0.747312
4,0.621173,1.0,0.0,1.0,0.0,1.0,0.0,0.372555,0,0.981327,0.837804,0.747312
...,...,...,...,...,...,...,...,...,...,...,...,...
2331,-0.429504,1.0,1.0,1.0,0.0,0.0,0.0,0.372555,1,-0.310320,2.046738,-1.886108
2332,-0.579601,1.0,0.0,1.0,0.0,0.0,0.0,-0.602558,0,-0.310320,2.046738,-1.886108
2333,1.221560,1.0,0.0,1.0,0.0,0.0,0.0,1.835225,0,-0.310320,2.046738,-1.886108
2334,2.422334,1.0,0.0,1.0,0.0,0.0,0.0,-0.602558,0,-0.310320,2.046738,-1.886108


In [73]:
targets = bank_dta['response']

In [74]:
targets

0       0.0
1       0.0
2       1.0
3       0.0
4       0.0
       ... 
2331    1.0
2332    1.0
2333    0.0
2334    1.0
2335    0.0
Name: response, Length: 2336, dtype: float64

In [75]:
from sklearn.model_selection import train_test_split

In [76]:
train_test_split(inputs_scaled,targets)

[           age  job  marital  education  default  housing  loan  campaign  \
 618  -0.429504  1.0      0.0        1.0      0.0      1.0   1.0 -0.115002   
 1679 -0.204359  1.0      1.0        1.0      0.0      0.0   0.0  1.347669   
 1887  0.846318  1.0      0.0        1.0      0.0      0.0   1.0 -0.602558   
 1023 -0.204359  1.0      0.0        1.0      0.0      1.0   0.0 -0.115002   
 198  -0.654649  1.0      0.0        1.0      0.0      1.0   0.0  0.372555   
 ...        ...  ...      ...        ...      ...      ...   ...       ...   
 453   1.221560  1.0      0.0        1.0      0.0      0.0   1.0 -0.115002   
 139  -0.429504  1.0      1.0        1.0      0.0      1.0   1.0 -0.602558   
 413  -0.729698  1.0      0.0        1.0      0.0      1.0   0.0  2.810339   
 1843  2.272237  1.0      0.0        1.0      0.0      0.0   0.0 -0.602558   
 2251  1.446705  1.0      0.0        1.0      0.0      0.0   0.0 -0.115002   
 
       previous_outcome  emp.var.rate  cons.price.idx  cons.co

In [77]:
#this is to split the dataset into its targets and input appropraitely(test and train)
#the train_size is to set the train data set to be 80% and the test 20%,
#the random_state is to shuffle it thesame random way
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, train_size = 0.8, random_state = 20)

In [78]:
#train shape with its inputs and targets
#x is inputs, y is targets
#the output shows 1868 observations along 12 inputs variable for the inputs
#and a vector of length 525(targets) 
#which is corresponding to excessive absenteeism column i.e one targets variable per obsevation
print(x_train.shape, y_train.shape)

(1868, 12) (1868,)


In [79]:
#test shape with its inputs and targets
#x is inputs, y is targets
##the output shows 468 observations along 12 inputs variable for the inputs
#and a vector of length 468 (targets) 
#which is corresponding to excessive absenteeism column i.e one targets variable per obsevation
print(x_test.shape, y_train.shape)

(468, 12) (1868,)


In [80]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [81]:
reg = LogisticRegression()

In [82]:
reg.fit(x_train, y_train)

LogisticRegression()

In [83]:
#this evaluates the model accuracy
reg.score(x_train, y_train)
#this means about 70% of our model outputs matches the target

0.6986081370449678

In [84]:
reg.intercept_

array([-0.67443124])

In [85]:
reg.coef_

array([[ 1.26869416e-01,  7.65685373e-01,  1.12688748e-01,
        -1.14755704e-04,  0.00000000e+00, -1.03950590e-01,
        -2.42385871e-01, -8.76396507e-02,  1.31914650e-01,
        -1.13992331e+00,  4.69036004e-01,  2.21691260e-01]])

In [86]:
# however this results doesn't show the true picture of things
# we have know what they are ref to
inputs_unscaled.columns.values

array(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'campaign', 'previous_outcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx'], dtype=object)

In [87]:
#therefore
Feature_name = inputs_unscaled.columns.values

In [88]:
summary_table = pd.DataFrame(columns=['Feature_name'], data=Feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table
#note that we must transpose ndarray to rows, as ndarray are not columns

Unnamed: 0,Feature_name,Coefficient
0,age,0.126869
1,job,0.765685
2,marital,0.112689
3,education,-0.000115
4,default,0.0
5,housing,-0.103951
6,loan,-0.242386
7,campaign,-0.08764
8,previous_outcome,0.131915
9,emp.var.rate,-1.139923


In [89]:
#let's add the intercept
summary_table.index = summary_table.index+1
summary_table.loc[0] = ['Intercept',reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature_name,Coefficient
0,Intercept,-0.674431
1,age,0.126869
2,job,0.765685
3,marital,0.112689
4,education,-0.000115
5,default,0.0
6,housing,-0.103951
7,loan,-0.242386
8,campaign,-0.08764
9,previous_outcome,0.131915


In [90]:
# create a new Series called: 'Odds ratio' which will show the.. odds ratio of each feature
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [91]:
# display the df
summary_table

Unnamed: 0,Feature_name,Coefficient,Odds_ratio
0,Intercept,-0.674431,0.509446
1,age,0.126869,1.135269
2,job,0.765685,2.150468
3,marital,0.112689,1.119283
4,education,-0.000115,0.999885
5,default,0.0,1.0
6,housing,-0.103951,0.90127
7,loan,-0.242386,0.784753
8,campaign,-0.08764,0.916091
9,previous_outcome,0.131915,1.141011


In [92]:
# sort the table according to odds ratio
# note that by default, the sort_values method sorts values by 'ascending'
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature_name,Coefficient,Odds_ratio
2,job,0.765685,2.150468
11,cons.price.idx,0.469036,1.598453
12,cons.conf.idx,0.221691,1.248186
9,previous_outcome,0.131915,1.141011
1,age,0.126869,1.135269
3,marital,0.112689,1.119283
5,default,0.0,1.0
4,education,-0.000115,0.999885
8,campaign,-0.08764,0.916091
6,housing,-0.103951,0.90127


### Interpretation

interpreting the coefficients of the logistic regression, posits that the further away its from zero 
(both positive and negative), the bigger its importance.

Jobs, consumer price index, consumer confidence index, previous outcome, age, housing, loan and marital, show bigger weights and holds a huge effect on the targets(response). Whereas, default(loan), education, campaign, their weights is almost zero, so they will barely affect the model.

The most crucial variable for determining if a customer(client) would subscribe to a term(fixed) deposits is the variable 'job' i.e. the kind of Job he/she does; the weight means that the odds of someone getting a term deposits that has a job are 2.15 times higher than somebody who doesn't.

Another, is CPI(consumer price index), this plays a role in many key financial decisions. CPI measures the average change in prices overtime that a consumer pays for goods and services commonly inflation. Therefore if there is a general increase in the price of goods and services, investment in fixed income assests can be negatively impacted as higher interest occurs. 
Therefore, CPI affects the choice of a client to opt in for a term deposits, and the odds ratio shows a 1.59 times than someone who doesn't

Briefly, the intercept,it is use to make accurate prediction, there is no specific meaning attach to it, that's why in machine learning it is called a bias. It calibrates the model. Nevertheless, without and intercept each prediction would be off the mark by precisely that value.


In [93]:
reg.score(x_test, y_test)

0.7158119658119658

In [94]:
#to determine the output from the target, by getting its probability being zero or one
predicted_proba = reg.predict_proba(x_test)
predicted_proba
#the first columns shows the probability our model assigned to the observation being zero and second one

array([[0.78423355, 0.21576645],
       [0.21432999, 0.78567001],
       [0.72190457, 0.27809543],
       [0.67678588, 0.32321412],
       [0.70169977, 0.29830023],
       [0.2527706 , 0.7472294 ],
       [0.71252023, 0.28747977],
       [0.21162117, 0.78837883],
       [0.71256805, 0.28743195],
       [0.49910766, 0.50089234],
       [0.21059725, 0.78940275],
       [0.21959828, 0.78040172],
       [0.63205848, 0.36794152],
       [0.6262245 , 0.3737755 ],
       [0.46526589, 0.53473411],
       [0.3669632 , 0.6330368 ],
       [0.12188914, 0.87811086],
       [0.59192729, 0.40807271],
       [0.10784015, 0.89215985],
       [0.66678027, 0.33321973],
       [0.58091864, 0.41908136],
       [0.72248168, 0.27751832],
       [0.46325423, 0.53674577],
       [0.24731134, 0.75268866],
       [0.39812128, 0.60187872],
       [0.68025369, 0.31974631],
       [0.09135422, 0.90864578],
       [0.2415496 , 0.7584504 ],
       [0.59676353, 0.40323647],
       [0.79509536, 0.20490464],
       [0.

In [95]:
# select ONLY the probabilities referring to 1s(as we only interested in how many clients would say YES to term deposits.
predicted_proba[:,1]
#in reality, logistic regression model calculates this probabilities in the background, if the probability is less than 0.5,
#it places a 0, otherwise 1

array([0.21576645, 0.78567001, 0.27809543, 0.32321412, 0.29830023,
       0.7472294 , 0.28747977, 0.78837883, 0.28743195, 0.50089234,
       0.78940275, 0.78040172, 0.36794152, 0.3737755 , 0.53473411,
       0.6330368 , 0.87811086, 0.40807271, 0.89215985, 0.33321973,
       0.41908136, 0.27751832, 0.53674577, 0.75268866, 0.60187872,
       0.31974631, 0.90864578, 0.7584504 , 0.40323647, 0.20490464,
       0.37346712, 0.37486765, 0.76443344, 0.35146982, 0.1932813 ,
       0.37693602, 0.10362127, 0.33451786, 0.81479716, 0.39642616,
       0.29122838, 0.27889922, 0.52604784, 0.22612793, 0.39619471,
       0.27539829, 0.21494229, 0.29167162, 0.77699009, 0.37486765,
       0.39697996, 0.2470254 , 0.49573938, 0.7633527 , 0.26294007,
       0.27510438, 0.7941102 , 0.16621382, 0.79925829, 0.64430303,
       0.37312076, 0.66479223, 0.89157852, 0.7317103 , 0.58210335,
       0.21385678, 0.2461193 , 0.73065816, 0.72432451, 0.39809219,
       0.76295982, 0.67788291, 0.574756  , 0.17768827, 0.22875