# Bank Marketing Problem

#### *Justyna Komorowska*<BR>
*Oct 25 2020*

Dataset contains information about direct marketing campaign (phone calls) of a Portuguese banking institution.<BR>
Disclamer: Dataset is based on publicly available Bank Marketing dataset (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#)


#### Atribute information

Input variables:
# bank client data:
1 - age (numeric) <BR>
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') <BR>
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)<BR>
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')<BR>
5 - default: has credit in default? (categorical: 'no','yes','unknown')<BR>
6 - housing: has housing loan? (categorical: 'no','yes','unknown')<BR>
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')<BR>
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')<BR>
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')<BR>
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.<BR>
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<BR>
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)<BR>
14 - previous: number of contacts performed before this campaign and for this client (numeric)<BR>
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')<BR>
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)<BR>
17 - cons.price.idx: consumer price index - monthly indicator (numeric)<BR>
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)<BR>
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)<BR>
20 - nr.employed: number of employees - quarterly indicator (numeric)<BR>

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')<BR>

======================================================================================================================

How to  help the marketing team in increasing performance of their marketing campaign? The team wants to laverage Advanced Analytics to improve campaign targeting. In other words they want to identify customers for which the gain from being contacted is the highest. To establish a proof of concept they provided data from previous campaign for both control and targeted groups, which were selected at random from non-users before the campaign start. Aim of the campaign was to persuade customers to subscribe to the term deposit. The product (term deposit) was available also to the control group (but not marketed).

### Task: Assuming the total campaign budget is fixed calculate expected lift from using model predictions vs random selection (as done before).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Predictive model for a binar classification.

### 1.1. Data Input

#### 1.1.1 In

In [2]:
data = pd.read_csv("bank_data_prediction_task.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,...,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,test_control_flag
0,1,37,services,married,high.school,no,yes,no,telephone,may,...,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,campaign group
1,2,45,services,married,basic.9y,unknown,no,no,telephone,may,...,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,campaign group
2,3,59,admin.,married,professional.course,no,no,no,telephone,may,...,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,campaign group
3,4,25,services,single,high.school,no,yes,no,telephone,may,...,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,campaign group
4,5,35,blue-collar,married,basic.6y,no,yes,no,telephone,may,...,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,campaign group


**Normalization** technically should be done on train dataset and re-apllied to all other data sub-sets (test, control and prod) - applying to all data once - for simplicity  

In [3]:
data_norm = data.copy()

In [5]:
data_norm['emp.var.rate.norm'] = (data_norm["emp.var.rate"] - data_norm["emp.var.rate"].min())/data_norm["emp.var.rate"].max()
data_norm['cons.price.idx.norm'] = (data_norm['cons.price.idx'] - data_norm['cons.price.idx'].min())/data_norm['cons.price.idx'].max()
data_norm['cons.conf.idx.norm'] = (data_norm['cons.conf.idx'] - data_norm['cons.conf.idx'].min())/data_norm['cons.conf.idx'].max()
data_norm['euribor3m.norm'] = (data_norm['euribor3m'] - data_norm['euribor3m'].min())/data_norm['euribor3m'].max()
data_norm['nr.employed.norm'] = (data_norm['nr.employed'] - data_norm['nr.employed'].min())/data_norm['nr.employed'].max()
data_norm['age.norm'] = (data_norm['age'] - data_norm['age'].min())/data_norm['age'].max()

#### 1.1.2 data split

As we want to analyze only campaign data, the control group is going to be filtered out.

In [8]:
control = data_norm[data_norm['test_control_flag'] == 'control group']
campaign = data_norm[data_norm['test_control_flag'] != 'control group']

- campaign_train - main exploratory and model building sample  
- campaign_test - "blind" sample" - used ONLY for final model performance evaluation  

In [9]:
#Campaign
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(campaign.drop(['y'], axis=1), campaign['y'],test_size=0.30, random_state=1206, 
                                                 stratify=campaign['y'])

In [10]:
#Checking the shapes
print("X_train shape :",X_train.shape)
print("Y_train shape :",y_train.shape)
print("X_test shape :",X_test.shape)
print("Y_test shape :",y_test.shape)

X_train shape : (11533, 28)
Y_train shape : (11533,)
X_test shape : (4943, 28)
Y_test shape : (4943,)


### Feature Selection

In [11]:
from sklearn.feature_selection import SelectKBest #Feature Selector
from sklearn.feature_selection import f_classif #F-ratio statistic for categorical values

In [17]:
#Feature Selection
X=X_train
Y=y_train
bestfeatures = SelectKBest(score_func=f_classif, k='all')
fit = bestfeatures.fit(X,Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Feature','Score']  #naming the dataframe columns

#Visualize the feature scores
fig, ax=plt.subplots(figsize=(7,7))
plot=sns.barplot(data=featureScores, x='Score', y='Feature', palette='viridis',linewidth=0.5, saturation=2, orient='h')
Plotter(plot, 'Score', 'Feature', legend=False, save=True, save_name='Feature Importance.png')#Plotter function for aesthetics
plot

ValueError: could not convert string to float: 'services'

In [15]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11533 entries, 785 to 3187
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           11533 non-null  int64  
 1   age                  11533 non-null  int64  
 2   job                  11533 non-null  object 
 3   marital              11533 non-null  object 
 4   education            11533 non-null  object 
 5   default              11533 non-null  object 
 6   housing              11533 non-null  object 
 7   loan                 11533 non-null  object 
 8   contact              11533 non-null  object 
 9   month                11533 non-null  object 
 10  day_of_week          11533 non-null  object 
 11  duration             11533 non-null  float64
 12  campaign             11533 non-null  float64
 13  pdays                11533 non-null  int64  
 14  previous             11533 non-null  int64  
 15  poutcome             11533 non-null