# Bank marketing use case | What can go wrong?

## 0. Setup

In [None]:
import sys
sys.path.append("..")

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
from utils import *

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
jan = pd.read_csv('../data/predict/jan-data.csv')
jan_final = data_prep(jan)
model = pickle.load(open('../models/model_log.cav','rb'))

## 1. Introduction

In the previous exercise, we have created a function in order to evaluate the performance of the model. This function is now available in your working directory as `model_performance`. 

In this exercise, we will investigate the performance on 3 datasets: February, March and April, and we will analyse, if applicable, the cause of potential failures. 

## 2. Case 1: February

The first failure that we can explore is a change of data definition. To do so, we will apply the model on data from February. 

In [None]:
feb = pd.read_csv('../data/predict/feb-data.csv')
feb.head()

In [None]:
feb_final = data_prep(feb)

In [None]:
feb_final

In [None]:
predictions = model.predict(feb_final)
feb_final['id'] = feb['id']
feb_final['prediction']=pd.Series(predictions)
model_performance(feb_final,'feb')

As you can see, the precision of the model was impacted, meaning that the maximum profit was impacted. 

We will now investigate the causes of that failure. 

**Exercise:** describe and compare the data from the `jan` dataset with the one of `feb`. 

In [None]:
#Solution
jan.describe()

In [None]:
feb.describe()

In [None]:
sns.boxplot(data=(jan['euribor3m'],feb['euribor3m']))

Congratulations! You've just discovered a first type of failure. 

It seems that the order of magnitude of the `euribor3m` feature has been divided by 100, resulting in a lack of precision. 

## Case 2: March



In [None]:
mar = pd.read_csv('../data/predict/mar-data.csv')
mar.head()

In [None]:
mar_final = data_prep(mar)

In [None]:
predictions = model.predict(mar_final)
mar_final['id'] = mar['id']
mar_final['prediction']=pd.Series(predictions)
model_performance(mar_final,'mar')

This month, the model was not able to produce results. Why is that the case?

**Exercise:** Compare the schema of March with the one of January

In [None]:
#Solution

set_jan = set(jan.dtypes.to_dict().keys())
set_mar = set(mar.dtypes.to_dict().keys())

diff = set_jan - set_mar
print(diff)

This is another common failure in a model: the data is no longer available, because of a change in its name, its format, or simply because it was removed from the database.

## Case 3: April

In [None]:
apr = pd.read_csv('../data/predict/apr-data.csv')
apr.head()

In [None]:
apr_final = data_prep(apr)

In [None]:
predictions = model.predict(apr_final)
apr_final['id'] = apr['id']
apr_final['prediction']=pd.Series(predictions)
model_performance(apr_final,'apr')

**Exercise:** Compare the distribution of the column `poutcome` in jan, feb and apr. Is there something different?

In [None]:
#Solution
apr_poutcome = apr[apr['poutcome'] != 'nonexistent']
jan_poutcome = jan[jan['poutcome'] != 'nonexistent']
feb_poutcome = feb[feb['poutcome'] != 'nonexistent']

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(18, 6), dpi=80)

ax1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
apr_poutcome['poutcome'].hist()
plt.title("April")
ax2 = plt.subplot2grid((2,6), (0,2), colspan=2)
jan_poutcome['poutcome'].hist()
plt.title("January")
ax3 = plt.subplot2grid((2,6), (0,4), colspan=2)
feb_poutcome['poutcome'].hist()
plt.title("February")
plt.show()


As you can see, the data is skewed and the success category is underrepresented in the `poutcome` column. The weight of the poutome_success is too important compared to other variables. However in April, less individuals with this attribute were in the dataset, we lost a key feature of the model. 

In [None]:
print(jan_final['poutcome_success'].skew())
print(feb_final['poutcome_success'].skew())
print(apr_final['poutcome_success'].skew())

### Well done!

You have discover some of the many so-called `datastrophes` that may happen in your data pipelines. Now, let's see how we could prevent them. 
