# Bank marketing use case | Introduction

## 0. Setup

In [None]:
!pip install scikit-learn=

In [1]:
import pandas as pd
import numpy as np
import pickle

In [2]:
import sys
sys.path.append("..")

In [3]:
import warnings
warnings.filterwarnings("ignore")

## 1. Use case

### Introduction and business goal

Throughout the class of Monitoring Machine Learning Models in Python, we will use a freely adapted version of the `Bank Marketing` dataset (you can find the original version [here]( https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)). The dataset has been modified for our example. 

This exercise will familiarize you with the use case and the data we are using.

The `Bank Marketing` dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit or not.

As a data team, you were asked to identify the customers who are most likely to subscribe for a new term deposit. A well-targeted customer is expected to make the company earn a gross revenue of `$70`. Each phone call costs `$5` to the bank. The business team wants to minimise the phone call marketing costs, and therefore wants that about `70%` of the calls result in a subscription. Moreover, the bank have a capacity of `300` calls per month.  

### The data

The datasets are available in the working directory, in the `data` folder. For each month, we have collected a list of potential customers we'd like to reach out. Let's have a look at the dataset for `january`

In [4]:
jan = pd.read_csv('../data/predict/jan-data.csv')
jan.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,id,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
0,45,services,divorced,high.school,no,no,no,14286,telephone,may,fri,8,999,0,nonexistent,1.1,93.994,-36.4,4.864,5191.0
1,41,blue-collar,married,basic.9y,unknown,no,no,22110,telephone,jun,fri,6,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1
2,37,housemaid,married,university.degree,no,yes,no,31123,telephone,nov,wed,1,999,0,nonexistent,-0.1,93.2,-42.0,4.286,5195.8
3,58,management,single,university.degree,no,no,yes,21491,cellular,may,fri,1,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1
4,38,technician,married,professional.course,no,no,no,30064,cellular,aug,mon,1,999,1,failure,-2.9,92.201,-31.4,0.884,5076.2


For your information, here is the description of the fields of the dataset:

1. `age` (numeric)
2. `job` : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
3. `marital` : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
4. `education` : (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
5. `default`: has credit in default? (categorical: “no”, “yes”, “unknown”)
6. `housing`: has housing loan? (categorical: “no”, “yes”, “unknown”)
7. `loan`: has personal loan? (categorical: “no”, “yes”, “unknown”)
8. `contact`: contact communication type (categorical: “cellular”, “telephone”)
9. `month`: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10. `day_of_week`: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
11. `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
12. `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
13. `previous`: number of contacts performed before this campaign and for this client (numeric)
14. `poutcome`: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
15. `emp.var.rate`: employment variation rate — (numeric)
16. `cons.price.idx`: consumer price index — (numeric)
17. `cons.conf.idx`: consumer confidence index — (numeric)
18. `euribor3m`: euribor 3 month rate — (numeric)
19. `nr.employed`: number of employees — (numeric)
20. `id` : the identification of the potential contractor

### Data preparation

In order to meet the model's requirements, the data has to be prepared. 

We will:
- Create a new category inside `Education`
- Dummify the categorical data fields
- Only keep the needed features. 


In [5]:

jan['education']=np.where(jan['education'] =='basic.9y', 'Basic', jan['education'])
jan['education']=np.where(jan['education'] =='basic.6y', 'Basic', jan['education'])
jan['education']=np.where(jan['education'] =='basic.4y', 'Basic', jan['education'])

cat = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']

jan_dummified = pd.get_dummies(jan,columns=cat)


features=['euribor3m', 'job_blue-collar', 'job_housemaid', 'marital_unknown', 
  'month_apr', 'month_aug', 'month_jul', 'month_jun', 'month_mar', 
  'month_may', 'month_nov', 'month_oct', "poutcome_success"] 

jan_final = jan_dummified[features]



In [6]:
jan_dummified

Unnamed: 0,age,id,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,45,14286,8,999,0,1.1,93.994,-36.4,4.864,5191.0,...,0,0,1,0,0,0,0,0,1,0
1,41,22110,6,999,0,1.4,94.465,-41.8,4.967,5228.1,...,0,0,1,0,0,0,0,0,1,0
2,37,31123,1,999,0,-0.1,93.200,-42.0,4.286,5195.8,...,0,0,0,0,0,0,1,0,1,0
3,58,21491,1,999,0,-1.8,92.893,-46.2,1.313,5099.1,...,0,0,1,0,0,0,0,0,1,0
4,38,30064,1,999,1,-2.9,92.201,-31.4,0.884,5076.2,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6389,25,13774,2,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,0,0,1,0,0,1,0
6390,46,23834,2,999,0,-1.8,93.075,-47.1,1.410,5099.1,...,0,0,0,0,1,0,0,0,1,0
6391,56,29486,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
6392,26,38744,1,999,0,1.4,93.918,-42.7,4.962,5228.1,...,0,0,0,0,0,0,1,0,1,0


In [7]:
jan_final.head()

Unnamed: 0,euribor3m,job_blue-collar,job_housemaid,marital_unknown,month_apr,month_aug,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,poutcome_success
0,4.864,0,0,0,0,0,0,0,0,1,0,0,0
1,4.967,1,0,0,0,0,0,1,0,0,0,0,0
2,4.286,0,1,0,0,0,0,0,0,0,1,0,0
3,1.313,0,0,0,0,0,0,0,0,1,0,0,0
4,0.884,0,0,0,0,1,0,0,0,0,0,0,0


### The model

Your team has presented several models and decided to go for a classification algorithm, a Logistic Regression, coded with scikit-learn. The model is available in the workspace as a pickle file. 

We will perform the model on the January dataset.

In [8]:
model = pickle.load(open('../models/model_log.cav','rb'))
predictions = model.predict(jan_final)
jan_final['id'] = jan['id']
jan_final['prediction']=pd.Series(predictions)

In [9]:
jan_final.head()

Unnamed: 0,euribor3m,job_blue-collar,job_housemaid,marital_unknown,month_apr,month_aug,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,poutcome_success,id,prediction
0,4.864,0,0,0,0,0,0,0,0,1,0,0,0,14286,0
1,4.967,1,0,0,0,0,0,1,0,0,0,0,0,22110,0
2,4.286,0,1,0,0,0,0,0,0,0,1,0,0,31123,0
3,1.313,0,0,0,0,0,0,0,0,1,0,0,0,21491,0
4,0.884,0,0,0,0,1,0,0,0,0,0,0,0,30064,0


### Performance of the model

Now it's your turn to manipulate the data. 

As only the potential customers for whom the model has returned 1 were called, we want to evaluate the performance of the model: Was the model able to correctly identify people who subscribed to the offer?

The goal of this exercise is to evaluate the performance of the model in 2 ways, by comparing the predictions the model gave with the reality. Here is some valuable information:

- The real data are available in the `data/real/jan-data.csv` file and corresponds to the `y` column
- We want to obtain the precision of the predictions, use known scikit learn function to do so
- We also need to evaluate the business result of the model: how much did the model earn for the bank?
- If more than 250 calls are foreseen, we will randomly select 250 prospects.

*Hint:* A proposed solution is available in `solution path`


In [10]:
##Solution

def model_performance(predicted_dataset, month):
    from sklearn.metrics import precision_score
    predictions = predicted_dataset['prediction']
    real_data = pd.read_csv('../data/real/'+month+'-data.csv')[['id','y']]
    precision = (precision_score(real_data['y'],predictions,labels=1))
    print('The precision of the model in {} was of {}'.format(month,round(precision,2)))
    data_revenues = pd.merge(predicted_dataset,real_data,on='id')[['id','prediction','y']]
    TP = 0 
    P = sum(data_revenues['prediction'])
    if P>300:
        factor = (P/300)-1
    else:
        factor=1
    for i in range(len(real_data['y'])): 
        if data_revenues['prediction'][i]==data_revenues['y'][i]==1:
            TP += 1
    revenues = 70 * TP * factor
    costs = 5 * P * factor
    profit = revenues - costs
    print('This result in a profit of ${} '.format(profit))




In [11]:
model_performance(jan_final,'jan')

The precision of the model in jan was of 0.68
This result in a profit of $8075 


This is pretty in line with the `70%` target the management has set. 

### Well done!

You have just finished the first exercise. We've created a function to analyse the performance of the model running in production. In the next exercise, we will discover what may affect and decrease this performance. 