# Bank Marketing Data Set

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

<h3>Goal: To predict if the client will subscribe a term deposit (variable y).<h3>

In [59]:
# Update sklearn to prevent version mismatches
!pip install sklearn --upgrade

Requirement already up-to-date: sklearn in c:\users\maritza\anaconda3\lib\site-packages (0.0)


In [60]:
# install joblib. This will be used to save your model. 
# Restart your kernel after installing 
!pip install joblib



In [61]:
import pandas as pd

# Read the CSV and Perform Basic Data Cleaning

In [4]:
df = pd.read_csv("bankingdata.csv")
# Drop the null columns where all values are null
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# Data Contains

Input variables:
Bank client data:

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

Related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

Social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

# Data Exploration

In [6]:
df.y.value_counts()

no     36548
yes     4640
Name: y, dtype: int64

In [7]:
df.marital.value_counts()

married     24928
single      11568
divorced     4612
unknown        80
Name: marital, dtype: int64

In [8]:
#import matplotlib.pyplot as plt
#import seaborn as sns

In [9]:
#sns.countplot(x='y', data= df, palette = 'bwr')
#plt.xlabel("Subscribes a Term Deposit")
#plt.show()

In [10]:
#from matplotlib.ticker import FuncFormatter
#pd.crosstab(df.marital,df.y).plot(kind="bar",figsize=(15,6))
#plt.title('Marital Status Subscriptions')
#plt.xlabel("Subscribes a Term Deposit = (0 = NO, 1 = YES)")
#plt.xticks(rotation = 0)
#plt.legend(["No", "Yes"])
#plt.ylabel('Frequency')
#plt.show()

In [11]:
#edudf = df.groupby(['y'])\
#        .count()\
#        .reset_index()
#edudf

In [12]:
#import matplotlib.pyplot as plt
#df.education.value_counts().plot.bar()

In [13]:
#plt.scatter(x=df.age,y= df.y==1])
#plt.show()

In [14]:
#import matplotlib as plt
#pd.crosstab(df.age,df.y).plot(kind="bar",figsize=(20,6))
#plt.title('Heart Disease Frequency for Ages')
#plt.xlabel('Age')
#plt.ylabel('Frequency')
#plt.show()

In [15]:
#import seaborn as sns
#sns.set(style="ticks")

#df = sns.load_dataset("df")
#sns.pairplot(df, hue="y")
#plt.show()

In [16]:
#selected_columns = df(['education']) #'job', 'martial', 'default', 'housing', 'loan','contact','month','day_of_week', 'poutcome','y')
#encoders = {name: LabelEncoder().fit(df[name]) for name in selected_columns} 
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df = df.apply(le.fit_transform)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null int32
marital           41188 non-null int32
education         41188 non-null int32
default           41188 non-null int32
housing           41188 non-null int32
loan              41188 non-null int32
contact           41188 non-null int32
month             41188 non-null int32
day_of_week       41188 non-null int32
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null int32
emp.var.rate      41188 non-null int64
cons.price.idx    41188 non-null int64
cons.conf.idx     41188 non-null int64
euribor3m         41188 non-null int64
nr.employed       41188 non-null int64
y                 41188 non-null int32
dtypes: int32(11), int64(10)
memory usage: 4.9 MB


# Decision Tree

In [18]:
from sklearn import tree

In [19]:
target = df.y

In [20]:
df=df.drop(['y'], axis=1)

In [21]:
feature_names = df.columns
feature_names

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype='object')

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, target, random_state=42)

In [23]:
y_test

32884    0
3169     0
32206    0
9403     0
14020    0
17201    0
879      0
23757    0
10821    0
14355    0
32311    0
31850    0
13595    1
21871    1
16735    0
8040     0
14064    0
17688    0
15507    0
37480    1
17267    1
1670     0
8528     1
15755    0
20770    0
36915    0
33728    0
22969    0
8925     0
144      0
        ..
34875    0
2967     0
16425    0
20831    0
21679    0
12890    0
27203    0
21478    0
34551    0
23309    0
12157    0
1672     0
28213    0
2546     0
11034    0
6956     0
9149     1
18210    0
12637    0
12851    0
7825     0
25497    0
27184    0
14177    1
40238    1
35087    0
12883    0
3588     0
31192    0
1937     0
Name: y, Length: 10297, dtype: int32

In [27]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8894823735068467

In [28]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30891 entries, 549 to 15795
Data columns (total 20 columns):
age               30891 non-null int64
job               30891 non-null int32
marital           30891 non-null int32
education         30891 non-null int32
default           30891 non-null int32
housing           30891 non-null int32
loan              30891 non-null int32
contact           30891 non-null int32
month             30891 non-null int32
day_of_week       30891 non-null int32
duration          30891 non-null int64
campaign          30891 non-null int64
pdays             30891 non-null int64
previous          30891 non-null int64
poutcome          30891 non-null int32
emp.var.rate      30891 non-null int64
cons.price.idx    30891 non-null int64
cons.conf.idx     30891 non-null int64
euribor3m         30891 non-null int64
nr.employed       30891 non-null int64
dtypes: int32(10), int64(10)
memory usage: 3.8 MB


# Random Forest Classifier

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, precision_score

rf = RandomForestClassifier(max_depth=5, n_estimators=200)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9053122268621929

In [30]:
rf = RandomForestClassifier(n_estimators=100)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)*100

91.56064873264057

In [31]:
RandomForestClassifier?

[1;31mInit signature:[0m
[0mRandomForestClassifier[0m[1;33m([0m[1;33m
[0m    [0mn_estimators[0m[1;33m=[0m[1;34m'warn'[0m[1;33m,[0m[1;33m
[0m    [0mcriterion[0m[1;33m=[0m[1;34m'gini'[0m[1;33m,[0m[1;33m
[0m    [0mmax_depth[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmin_samples_split[0m[1;33m=[0m[1;36m2[0m[1;33m,[0m[1;33m
[0m    [0mmin_samples_leaf[0m[1;33m=[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m    [0mmin_weight_fraction_leaf[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mmax_features[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mmax_leaf_nodes[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmin_impurity_decrease[0m[1;33m=[0m[1;36m0.0[0m[1;33m,[0m[1;33m
[0m    [0mmin_impurity_split[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0moob_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33

# GridSearch CV

In [32]:
parameters = {'n_estimators':[100, 200, 300, 400], 
              'max_depth':[2, 4, 5, 6],
             'min_samples_split': [2,3,4]
             }

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, precision_score

In [34]:
rf = RandomForestClassifier()
gs = GridSearchCV(rf, parameters)
gs.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
           

In [35]:
gs.best_score_

0.9082256968048946

In [36]:
gs.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=6, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [51]:
f1_score(y_test, gs.predict(X_test))

0.39133205863607395

In [52]:
rf = RandomForestClassifier(n_estimators=300, max_depth=6, min_samples_split=4)
rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9078372341458677

In [53]:
f1_score(y_test, rf.predict(X_test))

0.3966942148760331

In [54]:
gs.score(X_test, y_test)

0.9072545401573274

In [55]:
sorted(gs.cv_results_)

['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_max_depth',
 'param_min_samples_split',
 'param_n_estimators',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']

In [56]:
sorted(zip(gs.best_estimator_.feature_importances_, feature_names), reverse=True)

[(0.3473289422314986, 'duration'),
 (0.1536382998971682, 'nr.employed'),
 (0.14172631397249, 'euribor3m'),
 (0.07577922341141041, 'pdays'),
 (0.07278263550195989, 'poutcome'),
 (0.05277391639038204, 'emp.var.rate'),
 (0.03874655955173225, 'cons.price.idx'),
 (0.038388318604875064, 'cons.conf.idx'),
 (0.024384912460732825, 'month'),
 (0.01310348906089507, 'age'),
 (0.010256712581956842, 'contact'),
 (0.008332315544964239, 'previous'),
 (0.006350922365726006, 'day_of_week'),
 (0.00438809508063376, 'campaign'),
 (0.0035663949140207216, 'education'),
 (0.002886831773091611, 'job'),
 (0.0019146565863894511, 'default'),
 (0.001850128179084238, 'marital'),
 (0.0009267870333890334, 'loan'),
 (0.0008745448575996838, 'housing')]

In [57]:
precision_score(y_test, rf.predict(X_test))

0.7428571428571429

In [58]:
sorted(zip(rf.feature_importances_, feature_names), reverse=True)

[(0.3536166777760609, 'duration'),
 (0.17887844278061021, 'nr.employed'),
 (0.10605381508375163, 'euribor3m'),
 (0.08212807376299996, 'pdays'),
 (0.056331915098153475, 'emp.var.rate'),
 (0.05389442304767124, 'poutcome'),
 (0.05070465876809792, 'cons.conf.idx'),
 (0.038530756345298636, 'cons.price.idx'),
 (0.02365583873247805, 'month'),
 (0.011814341504564903, 'age'),
 (0.011394546911461089, 'previous'),
 (0.010051155352145496, 'contact'),
 (0.00675676347712125, 'day_of_week'),
 (0.004102025077638925, 'campaign'),
 (0.003585322017776072, 'education'),
 (0.003168317199386643, 'job'),
 (0.0021768615114114926, 'default'),
 (0.0016233563412485137, 'marital'),
 (0.0007748814546630593, 'housing'),
 (0.0007578277574604339, 'loan')]

# Source : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

# Software: Tableau / Jupyter Lab