# Seminar - Bank Marketing Dataset

During this seminar we will be analysing [this dataset](https://archive.ics.uci.edu/dataset/222/bank+marketing).

The goal is to fit two models - a very simple decision tree and a random forest, compare those models via cross validation and then evaluate the best model on a holdout set.


In [1]:
import pandas as pd
import numpy as np
import zipfile
import urllib
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
import warnings

warnings.filterwarnings("ignore")

In [2]:
def plot_tree(model, features):
    plt.figure(figsize=(15, 10))
    tree.plot_tree(
        model,
        feature_names=features,
        proportion=True,
        precision=2,
        filled=True,
    )
    plt.show()

In [3]:
url = "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"
urllib.request.urlretrieve(url, "bank_marketing.zip")

with zipfile.ZipFile("bank_marketing.zip", "r") as zip_ref:
    zip_ref.extractall("bank_marketing")

with zipfile.ZipFile("bank_marketing/bank.zip", "r") as zip_ref:
    zip_ref.extractall("bank_marketing")

In [4]:
df = pd.read_csv("bank_marketing/bank-full.csv", sep=";")

In [5]:
df = df[["balance", "housing", "loan", "job", "age", "campaign", "previous", "y"]]

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   balance   45211 non-null  int64 
 1   housing   45211 non-null  object
 2   loan      45211 non-null  object
 3   job       45211 non-null  object
 4   age       45211 non-null  int64 
 5   campaign  45211 non-null  int64 
 6   previous  45211 non-null  int64 
 7   y         45211 non-null  object
dtypes: int64(4), object(4)
memory usage: 2.8+ MB


In [7]:
df["y"] = df["y"].map({"yes": 1, "no": 0})

In [8]:
df.value_counts("y")

y
0    39922
1     5289
Name: count, dtype: int64

In [9]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

### Task 1: plotting

- Create plots showing the distribution of `balance` and `age` by `y`.
- Show average of `y` (share of cases where y=1) by `job`.



### Task 2: Decision tree

- Fit a decision tree model using `balance` and `age` features. Set `max_depth` to 2.
- Evaluate the model using cross-validation (`cross_val_score`), once using "accuracy" and once using "roc_auc" as the metric.
- Visualize the decision tree using `plot_tree` function.


### Task 3: Random forest

- Fit a random forest with instead of a single decision tree, using the two features from above.
- Find optimal parameters for `max_depth` and `class_weight`. Show the performance on the model using the best hyperparameters. For `max_depth` use a range of 1 to 20 (inclusive).

### Task 4: Random forest, all features

- Add all features to the model.
- Tune `max_depth`, `class_weight` and `min_samples_split` hyperparameters. For `max_depth` use a range of 10 to 20 (inclusive), for `min_samples_split` use a range of 2 to 10 (inclusive).

- Compare the first model (decision tree) to this model.



### Task 5: Report the performance of the best model

- Show how precision, recall and F1 score change as you change the threshold
- On a separate plot, show how the number of positive predictions (the number of cases where model prediction exceeds the threshold) changes as a function of the threshold

Interpret the plots. Try to pretend that you're working in a bank and are creating this model for the marketing team. What would you communicate to the stakeholders in the team?
