# Bank marketing use case | Introduction

## 0. Setup

In [None]:
import pandas as pd
import numpy as np
import pickle

In [None]:
import sys
sys.path.append("..")

In [None]:
import warnings
warnings.filterwarnings("ignore")

## 1. Use case

### Introduction and business goal

Throughout the class of Monitoring Machine Learning Models in Python, we will use a freely adapted version of the `Bank Marketing` dataset (you can find the original version [here]( https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)). The dataset has been modified for our example. 

This exercise will familiarize you with the use case and the data we are using.

The `Bank Marketing` dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether or not a client will subscribe to a term deposit. 

As a data team, you were asked to identify the customers who are most likely to subscribe for a new term deposit. A well-targeted customer is expected to make the company earn a gross revenue of `$70`. Each phone call costs `$5` to the bank. The business team wants to minimise the phone call marketing costs, and therefore wants that:
- about `70%` of the calls result in a subscription,
- the profit (gross revenue - costs) resulting from the campaign should be min `$6000`/month

Moreover, the bank has a capacity of `300` calls per month.  

### The data

The datasets are available in the working directory, in the `data` folder. For each month, we have collected a list of potential customers we would like to reach out. Let's have a look at the dataset for `january`

In [None]:
jan = pd.read_csv('../data/predict/jan-data.csv')
jan.head()

For your information, here is the description of the fields of the dataset:

1. `age` (numeric)
2. `job` : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
3. `marital` : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
4. `education` : (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
5. `default`: has credit in default? (categorical: “no”, “yes”, “unknown”)
6. `housing`: has housing loan? (categorical: “no”, “yes”, “unknown”)
7. `loan`: has personal loan? (categorical: “no”, “yes”, “unknown”)
8. `contact`: contact communication type (categorical: “cellular”, “telephone”)
9. `month`: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10. `day_of_week`: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
11. `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
12. `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
13. `previous`: number of contacts performed before this campaign and for this client (numeric)
14. `poutcome`: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
15. `emp.var.rate`: employment variation rate — (numeric)
16. `cons.price.idx`: consumer price index — (numeric)
17. `cons.conf.idx`: consumer confidence index — (numeric)
18. `euribor3m`: euribor 3 month rate — (numeric)
19. `nr.employed`: number of employees — (numeric)
20. `id` : the identification of the potential contractor

### Data preparation

In order to meet the model's requirements, the data has to be prepared. 

We will:
- Create a new category inside `Education`
- Dummify the categorical data fields
- Only keep the needed features. 


In [None]:

jan['education']=np.where(jan['education'] =='basic.9y', 'Basic', jan['education'])
jan['education']=np.where(jan['education'] =='basic.6y', 'Basic', jan['education'])
jan['education']=np.where(jan['education'] =='basic.4y', 'Basic', jan['education'])

cat = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']

jan_dummified = pd.get_dummies(jan,columns=cat)

features=['euribor3m', 'job_blue-collar', 'job_housemaid', 'marital_unknown', 
  'month_apr', 'month_aug', 'month_jul', 'month_jun', 'month_mar', 
  'month_may', 'month_nov', 'month_oct', "poutcome_success"] 

jan_final = jan_dummified[features]

In [None]:
jan_final.head()

### The model

Your team has presented several models and decided to go for a classification algorithm, a Logistic Regression, coded with the scikit-learn library. The model is available in the workspace as a pickle file. 

We will perform the model on the January dataset.

In [None]:
model = pickle.load(open('../models/model_log.cav','rb'))
predictions = model.predict(jan_final)
jan_final['id'] = jan['id']
jan_final['prediction']=pd.Series(predictions)

In [None]:
jan_final.head()

In [None]:
jan_final[jan_final.prediction==1]

### Performance of the model

Now it's your turn to manipulate the data. 

As only the potential customers for whom the model has returned 1 were called, we want to evaluate the performance of the model: Was the model able to correctly identify people who subscribed to the offer?

The goal of this exercise is to evaluate the performance of the model in 2 ways, by comparing the predictions the model gave with the reality. Here is some valuable information:


As a reminder : 
- A well-targeted customer = gross revenue of `$70`
- Each phone call costs `$5` 

The business team wants to minimise the phone call marketing costs, and therefore wants that:

- about `70%` of the calls result in a subscription,
- the profit (gross revenue - costs) resulting from the campaign should be min `$6000`/month

Moreover, the bank has a capacity of `300` calls per month.  


- The real data are available in the `data/real/jan-data.csv` file and corresponds to the `y` column
- We want to obtain the precision of the predictions, use known scikit learn function to do so
- We also need to evaluate the business result of the model: how much did the model earn for the bank?
- If more than 300 calls are foreseen, we will randomly select 300 prospects.

*Hint:* A proposed solution is available in the Solutions folder of the repo

In short, we ask a function to find the precision and the feedback profit of the model in january.


In [None]:
##Write your code here



Is this in line with the `70%` and `$6000` target the management has set?

### Well done!

You have just finished the first exercise. We've created a function to analyse the performance of the model running in production. In the next exercise, we will discover what may affect and decrease this performance. 