# The modelling process
This notebook represents our work to solve the problem of bank marketing.

👨‍💻: I've compiled a complete definition of our dataset, here's a list of things I could gather from it:

|Variable|Also called|Description|Type|
|--------|-----------|-----------|----|
|age|||(numeric)||
|job||type of job| (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')|
|marital||marital status| (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)|
|education||| (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')|
|default||has credit in default?| (categorical: 'no','yes','unknown')|
|housing||has housing loan?| (categorical: 'no','yes','unknown')|
|loan||has personal loan?| (categorical: 'no','yes','unknown')|
|**RELATED WITH THE LAST CONTACT OF THE CURRENT CAMPAIGN:**|
|contact|comm_type|contact communication type| (categorical: 'cellular','telephone') |
|month|comm_month|last contact month of year| (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')|
|day_of_week|comm_day|last contact day of the week| (categorical: 'mon','tue','wed','thu','fri')|
|duration|comm_duration|last contact duration, in seconds| (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.|
|**OTHER ATTRIBUTES**|
|campaign|curr_n_contact|number of contacts performed during this campaign and for this client| (numeric, includes last contact)
|pdays|days_since_last_campaign|number of days that passed by after the client was last contacted from a previous campaign| (numeric; 999 means client was not previously contacted)
|previous|last_n_contact|number of contacts performed before this campaign and for this client| (numeric)
|poutcome|last_outcome|outcome of the previous marketing campaign| (categorical: 'failure','nonexistent','success')
|**SOCIAL AND ECONOMIC CONTEXT ATTRIBUTES**|
|emp.var.rate||employment variation rate - quarterly indicator| (numeric)|
|cons.price.idx||consumer price index - monthly indicator| (numeric)|
|cons.conf.idx||consumer confidence index - monthly indicator| (numeric)|
|euribor3m||euribor 3 month rate - daily indicator| (numeric)|
|y|curr_outcome|Whether deposited|(categorical: 'yes', 'no')|


<hr/>
<p style="font-size: 1.3em; color: red;">START: Dependencies</p>

<p style="font-size: 1.3em; color: red;">End of section: Dependencies. STOP HERE AND GO BACK TO INSTRUCTIONS</p>
<hr/>

<hr/>
<p style="font-size: 1.3em; color: red;">START: Dependencies</p>

👨‍💻: All good notebooks start with the imports at the top...

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Install libraries if they aren't installed, so this notebook can always have up-to-date dependencies

_Note: rerun previous cell after install !_

In [None]:
! pip install pandas numpy matplotlib seaborn scikit-learn

<p style="font-size: 1.3em; color: red;">End of section: Dependencies. STOP HERE AND GO BACK TO INSTRUCTIONS</p>
<hr/>

<hr/>
<p style="font-size: 1.3em; color: red;">START: EDA</p>

## Let's start with EDA

<span style="background-color: yellow;">TODO: FIND ERROR(S) AND IDENTIFY IMPROVEMENT(S)</span>

In [None]:
df = pd.read_csv('/home/emilio/Repos/work/ivado/ml-powered-bank-marketing-solution/before_everything_there_were_notebooks/data/csv/bank_marketing_2008-05-01_to_2010-07-31.csv', sep=';')

# This is necessary because we had to rename the columns at some point...
df = df.rename(columns={"comm_type": "contact",
"comm_month": "month",
"comm_day": "day_of_week",
"comm_duration": "duration",
"curr_n_contact": "campaign",
"days_since_last_campaign": "pdays",
"last_n_contact": "previous",
"last_outcome": "poutcome",
"curr_outcome": "y"})

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
# Let's check how pandas imported this dataframe
print(df.dtypes)

In [None]:
print([col for col in df.columns if df[col].dtype != 'object'])

Categorical columns:

In [None]:
print([col for col in df.columns if df[col].dtype == 'object'])

### General column information

In [None]:
for col_name in df.columns:
    col = df[col_name]
    if col.dtype == 'object':
        print('Column', col_name, 'has {} columns. Values are:'.format(col.unique().shape[0]))
        print('  ', col.unique())
    else:
        print(col.describe())
    print('---')

### Some other checks

In [None]:
df['pdays'].describe()

### Checking for categorical values

In [None]:
df['job'].unique()

In [None]:
(df['y'] == 'yes').astype(int).sum()

In [None]:
df.y.value_counts()

In [None]:
sns.histplot(df.campaign)

### Examining duration

In [None]:
df[df.duration < 1]

In [None]:
df[df.duration == 0]

In [None]:
df.duration[df.duration <= 900].hist(bins=100)

### Examining previous contacts

In [None]:
df[df.previous > 0]

### Examining previous campaigns

TODO: Explain this, how can we have **not contacted** someone and have a **failure**?

In [None]:
df[(df.pdays == 999) & (df.poutcome != 'nonexistent')]

In [None]:
df[(df.pdays == 999) & (df.poutcome == 'failure')]

In [None]:
df[df.poutcome == 'nonexistent']

In [None]:
df[df.pdays == 999].poutcome.hist()

In [None]:
df[df.campaign == 0]

In [None]:
df[df.pdays == 999].campaign.describe()

In [None]:
df[df.pdays == 999]['y'].hist()

In [None]:
# Check if same person there multiple times
df[df[["age","job","marital","education","default","housing","loan","contact"]].duplicated(keep=False)].sort_values(["age","job","marital","education","default","housing","loan","contact"]).head(20)

In [None]:
df.euribor3m.plot()

In [None]:
df['emp.var.rate'].plot()

In [None]:
df['nr.employed'].plot()

In [None]:
df['cons.conf.idx'].plot()

In [None]:
df['cons.price.idx'].plot()

<p style="font-size: 1.3em; color: red;">End of section: EDA STOP HERE AND GO BACK TO INSTRUCTIONS</p>
<hr/>

<hr/>
<p style="font-size: 1.3em; color: red;">START: First Model, in EDA 🤔? </p>

# Let's try to build a model

👨‍💻: I think I understand the data enough to make a model at this point.

In [None]:
y = df.y.replace({'yes':1, 'no':0 })

In [None]:
train_x, test_x, train_y, test_y = train_test_split(df.drop(columns='y'), y, train_size=0.8)
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, train_size=0.8)

In [None]:
train_x.shape, train_y.shape, val_x.shape, val_y.shape

👨‍💻: Make sure we remove `duration` because it constitutes a leakage ; that information is only known after the call is made.

In [None]:
train_x = train_x.drop(columns='duration')
val_x = val_x.drop(columns='duration')
test_x = test_x.drop(columns='duration')

In [None]:
train_x

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler
scaler = StandardScaler()
scalerReal = RobustScaler()

In [None]:
age_scaler = scaler.fit(train_x.age.values.reshape(-1, 1))
age_scaler_real = scalerReal.fit(train_x.age.values.reshape(-1,1))

In [None]:
# TODO: Rememeber to comment this out
# train_x['has_been_contacted_in_previous'] = (train_x.pdays != 999).astype(int)

In [None]:
train_x['pdays'] = train_x.pdays.replace({999: -1})

In [None]:
train_x.describe()

👨‍💻: These variable names are pretty self-explanatory methinks. I also just _had_ to try `camelCase`, it's pretty

In [None]:
age_scaler = StandardScaler()
campaign_scaler = StandardScaler()
pdays_scaler = StandardScaler()
previous_scaler = StandardScaler()
empVarRate_scaler = StandardScaler()
consPriceidx_scaler = StandardScaler()
consConfIdx_scaler = StandardScaler()
euribor3m_scaler = StandardScaler()
nrEmployed_scaler = StandardScaler()
has_been_contacted_in_previous_scaler = StandardScaler()

In [None]:
age_scaler.fit(train_x.age.values.reshape(-1,1))
campaign_scaler.fit(train_x.campaign.values.reshape(-1,1))
pdays_scaler.fit(train_x.pdays.values.reshape(-1,1))
previous_scaler.fit(train_x.previous.values.reshape(-1,1))
empVarRate_scaler.fit(train_x['emp.var.rate'].values.reshape(-1,1))
consPriceidx_scaler.fit(train_x['cons.price.idx'].values.reshape(-1,1))
consConfIdx_scaler.fit(train_x['cons.conf.idx'].values.reshape(-1,1))
euribor3m_scaler.fit(train_x.euribor3m.values.reshape(-1,1))
nrEmployed_scaler.fit(train_x['nr.employed'].values.reshape(-1,1))
# has_been_contacted_in_previous_scaler.fit(train_x.has_been_contacted_in_previous.values.reshape(-1,1))

👨‍💻: ~~All good notebooks start with the imports at the top...~~ If you think about it, importing later in the file is _actually_ an optimization

👨‍💻: `Pipeline` and `Transformers` are underrated 😍

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# This is added after going on the notebook once:
from sklearn.base import BaseEstimator, TransformerMixin
class HasBeenCalledBeforeTransformer(BaseEstimator,TransformerMixin):
    def fit(self, X=None, y=None):
        return self
    def transform(self, X, y=None):
        X['has_been_called_before'] = (X.pdays != 999).astype(int)
        return X

In [None]:
transformerCategorical = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='error')),
])
transformerNumerical = Pipeline([
    # ('add_has_been_called_before', HasBeenCalledBeforeTransformer()),
    ('scale', StandardScaler()),
])
transformer = ColumnTransformer([
    ('num', transformerNumerical, train_x.select_dtypes(exclude='object').columns.values.tolist()),
    ('cat',  transformerCategorical, train_x.select_dtypes('object').columns.values.tolist()),
])

In [None]:
transformer

# First model for boss

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_clf = DecisionTreeClassifier()

In [None]:
transformer.fit(train_x)

In [None]:
# transformer.transform(train_x).toarray()[0]

In [None]:
tree_clf.fit(transformer.transform(train_x), train_y)

In [None]:
from sklearn.metrics import RocCurveDisplay, accuracy_score

In [None]:
train_x.columns

In [None]:
# train_x.drop(columns='has_been_contacted_in_previous', inplace=True)

In [None]:
import numpy as np

a=transformer.transform(test_x)[0]

In [None]:
test_y.values

In [None]:
tree_clf.predict_proba(transformer.transform(test_x))

**Metrics**

In [None]:
accuracy_score(test_y.values, tree_clf.predict(transformer.transform(test_x)))

👨‍💻: We should probably use AUC rather than accuracy...

In [None]:
RocCurveDisplay.from_predictions(test_y.values,tree_clf.predict_proba(transformer.transform(test_x))[:,1])

👨‍💻: 😭

In [None]:
(train_y.values, tree_clf.predict(transformer.transform(train_x)))

In [None]:
(val_y.values, tree_clf.predict(transformer.transform(val_x)))

In [None]:
(test_y.values, tree_clf.predict(transformer.transform(test_x)))

<p style="font-size: 1.3em; color: red;">End of section: First Model, in EDA 🤔? STOP HERE AND GO BACK TO INSTRUCTIONS</p>
<hr/>

<hr/>
<p style="font-size: 1.3em; color: red;">START: Iterative improvements™️</p>

# Iteration 2:

👨‍💻: The previous model didn't work too well. Let's make a new one

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()

In [None]:
mlp.fit(transformer.transform(train_x), train_y)

In [None]:
(val_y.values, mlp.predict(transformer.transform(val_x)))

In [None]:
RocCurveDisplay.from_predictions(test_y.values,mlp.predict_proba(transformer.transform(test_x))[:,1])

# Iteration 3

In [None]:

numerical_pipeline = Pipeline([
    ('scaler', StandardScaler()),
])

categorical_pipeline = Pipeline(steps=[
    # TODO: raise when there are unknown values
    ('one_hot_encoding', OneHotEncoder(handle_unknown='ignore')),
])

# Alternatively, we could select the columns automatically,
# using sklearn.compose.make_column_selector
# that would assume we have correctly loaded the data (which
# might not be the case)
# Note: The columns were copied from the output above
#       Don't forget to add it back
data_processor = ColumnTransformer([
    ('numerical_simple_scaler', numerical_pipeline, ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']),
    ('categorical_handler', categorical_pipeline, ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome'])
])

In [None]:
data_processor.fit(train_x)

In [None]:
new_2 = DecisionTreeClassifier()

In [None]:
new_2.fit(data_processor.transform(train_x), train_y)

In [None]:
(val_y.values, new_2.predict(data_processor.transform(val_x)))

In [None]:
RocCurveDisplay.from_predictions(val_y.values,new_2.predict_proba(data_processor.transform(val_x))[:,1])

# Iteration 4

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_4 = RandomForestClassifier(n_estimators=20)

In [None]:
clf_4.fit(transformer.transform(train_x), train_y)

In [None]:
(val_y.values, clf_4.predict(transformer.transform(val_x)))

In [None]:
RocCurveDisplay.from_predictions(val_y.values,clf_4.predict_proba(transformer.transform(val_x))[:,1])

# Iteration 5

👨‍💻: Copy code from first iteration to try out new idea, after all, we wouldn't want both variables to interact, so it's actually safer to copy-paste.

👨‍💻: For this experiment, we'll cluster the users and then feed them through boosted tree.

In [None]:
from sklearn.cluster import KMeans, Birch, MiniBatchKMeans

In [None]:
train_x.head()

In [None]:
DBSCAN()

👨‍💻: 😶

In [None]:
display(train_x.columns)
display(train_x.select_dtypes('object').columns)

In [None]:
train_x.dtypes

In [None]:
person_info_cols_cat = ['job','marital','education','default', 'housing','loan', 'contact']
person_info_cols_num = ['age']

In [None]:
kmeans_transformer = ColumnTransformer([
    ('scaleAge', StandardScaler(), person_info_cols_num),
    ('onehot', OneHotEncoder(sparse_output=True), person_info_cols_cat),
], remainder='drop')
kmeans_pipeline = Pipeline([
    ('c', kmeans_transformer),
    # ('cluster', Birch()),
    ('cluster', KMeans(n_clusters=12)),
])

In [None]:
kmeans_pipeline.fit(train_x)

In [None]:
# kmeans_pipeline.transform(train_x).shape

In [None]:
train_x.columns.drop(person_info_cols_cat + person_info_cols_num)

In [None]:
num_cols_wo_customer = ['campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
cat_cols_wo_customer = ['month','day_of_week', 'poutcome']

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier,VotingClassifier
# classifier5 = VotingClassifier(estimators=[('a', DecisionTreeClassifier()), ('b', GradientBoostingClassifier())])
classifier5 = GradientBoostingClassifier()

In [None]:
transformerCategorical5 = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='error')),
])
transformerNumerical5 = Pipeline([
    # ('add_has_been_called_before', HasBeenCalledBeforeTransformer()),
    ('scale', StandardScaler()),
])
transformer5 = ColumnTransformer([
    ('cluster_customer', kmeans_pipeline, person_info_cols_cat + person_info_cols_num),
    ('num', transformerNumerical, num_cols_wo_customer),
    ('cat',  transformerCategorical, cat_cols_wo_customer),
])

pipeline_5 = Pipeline([
    ('data', transformer5),
    ('clf', classifier5),
])

In [None]:
pipeline_5.fit(train_x, train_y,)

In [None]:
(val_y.values, pipeline_5.predict(val_x))

In [None]:
RocCurveDisplay.from_predictions(val_y.values,pipeline_5.decision_function(val_x))

👨‍💻: This looks promising!

<p style="font-size: 1.3em; color: red;">End of section: Iterative improvements™️ STOP HERE AND GO BACK TO INSTRUCTIONS</p>
<hr/>

# What now

We feel OK with the above result. We decide to save the dataframe transformed for this model.

In [None]:
pipeline_5.named_steps.data.transform(train_x).shape