# Module 4 homework

**This homework has 7 questions.**

In [109]:
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import (RandomForestClassifier)
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    ConfusionMatrixDisplay,
    mean_squared_error,
    precision_score,
    recall_score
)
from sklearn.model_selection import (
    GridSearchCV,
    KFold,
    train_test_split
)
from sklearn.metrics import roc_curve

In this Homework, we take another look at the Bank Marketing dataset that we have already examined in Module 1.

## Question 1 (1 points)

Load the `bank_data.csv` dataset in a variable named `bank_data`.

## Answer 1

In [110]:
bank_data = pd.read_csv('data/bank_data.csv', sep=';')

Like we did in Module 1, we remove some variables that would be hard to handle when using our models in a predictive fashion.

In [111]:
try:
    bank_data.drop(['day', 'month', 'duration', 'pdays', 'poutcome'], axis=1, inplace=True)
except NameError:
    print('The object `bank_data` does not exist! Did you forget to create it?')

Here, we encode the response variable (whether or not a customer subscribed a term deposit with the bank following a marketing campaign) with a binary indicator.

In [112]:
try:
    bank_data["y"] = bank_data['y'].apply(lambda y: 1 if y == 'yes' else 0)
    bank_data.head()
except NameError:
    print('The object `bank_data` does not exist! Did you forget to create it?')

## Question 2 (1 points)

Create "dummy" variables for the categorical predictors in the dataset (also known as "one-hot encoding").

Reassign the new dataframe to the variable `bank_data`.

## Answer 2

In [113]:
bank_data = pd.get_dummies(bank_data, dtype=int)

Let's now separate the predictors from the response variable.

In [114]:
try:
    X = bank_data.drop('y', axis=1)
    Y = bank_data['y']
except NameError:
    print('The object `bank_data` does not exist! Did you forget to create it?')

## Question 3 (1 points)

Split the data into a training and test set (`X_train, X_test, Y_train, Y_test`).

Use `random_state=42` and `test_size=0.25`).

## Answer 3

In [115]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

We want to use the available data to build a predictive model that can assist us in making our next marketing campaign more efficient.


Here are some quantities to keep in mind. Our department estimates that:

- a marketing contact with a potential customer costs around 10 Euros on average

- a successful contact (i.e. the customer subscribes a term deposit) generates on average 100 Euros of profits for the bank (say, present value net the cost of marketing).

Accordingly, we estimate that:

- the value associated with a true negative prediction from our model is 10 Euros (it saves us the waste of 10 Euros associated with the marketing contact)

- the value associated with a false positive prediction from our model is -10 Euros

- the value associated with a false negative prediction from our model is -100 Euros

- the value associated with a true positive prediction from our model is +100 Euros.

Let's encode this information in a "value function" that we will use later.

In [116]:
def value_function(y_true, y_pred, tn_value=10, fp_value=-10, fn_value=-100, tp_value=100):
    sum_ = y_pred + y_true
    diff_ = y_pred - y_true
    tn_contrib = tn_value * np.mean((sum_ == 0) & (diff_ == 0))
    fp_contrib = fp_value * np.mean((sum_ == 1) & (diff_ == 1))
    fn_contrib = fn_value * np.mean((sum_ == 1) & (diff_ == -1))
    tp_contrib = tp_value * np.mean((sum_ == 2) & (diff_ == 0))
    return tn_contrib + fp_contrib + fn_contrib + tp_contrib

## Question 4 (1 points)

In this exercise, assume that you have already performed cross-validation for an `AdaBoostClassifier` and found that good values for its parameters are as follows:

- `base_estimator=DecisionTreeClassifier(random_state=42, max_depth=5)`

- `n_estimators=2000`

- `learning_rate=0.80`

Fit an `AdaBoostClassifier` on the training data using these parameter values (and `random_state=42`).

*Warning: it may take a few minutes to fit this beefy model! Feel free to take a coffee break ;)*

## Answer 4

In [117]:
ada_boost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5, random_state=42),
                               random_state=42, n_estimators=2000, learning_rate=0.80).fit(X_train, Y_train)



We will also fit a conventional decision tree for reference.

In [118]:
try:
    tree = DecisionTreeClassifier(random_state=42).fit(X_train, Y_train)
except NameError:
    print('The objects `X_train, Y_train` do not exist!')

## Question 5 (1 points)

Compute the problem-specific "value function" for the `AdaBoostClassifier` and for the `DecisionTreeClassifier`. 

Which model is performing best with respect to this metric?

## Answer 5

In [119]:
#adaboost:
print("value using adaboost: " + round(value_function(Y_test, ada_boost.predict(X_test)), 2).astype(str))
print("value using tree: " + round(value_function(Y_test, tree.predict(X_test)), 2).astype(str))


value using adaboost: 1.42
value using tree: 1.08


My diagnostic: value using adaboost: 1.42
value using tree: 1.08
since the value of using adaboost is higher, it performs best with respect to this metric.     

Let's now try to quantify what is the monetary impact of using the `AdaBoostClassifier` as opposed to the `DecisionTreeClassifier` on our marketing campaign.

First off, for the evaluation of a given model, we will assume that in our marketing campaign we will only contact customers that are predicted as subscribers by our model.

With this in mind, let's create a "marketing campaign profit function".

In [120]:
def marketing_profits(model, X, Y, fp_value=-10, tp_value=100):
    tp_contrib = np.sum((model.predict(X) > 0) & (Y > 0)) * tp_value
    fp_contrib = np.sum((model.predict(X) > 0) & (Y < 1)) * fp_value
    return tp_contrib + fp_contrib

## Question 6 (2 points)

Based on the test data, by how much (percent-wise) do the profits for our future marketing campaign will increase (or decrease) if we use the `AdaBoostClassifier` ensemble model as opposed to the conventional `DecisionTreeClassifier`?

## Answer 6

In [124]:
adaboost_profit = marketing_profits(ada_boost, X_test, Y_test) 
print(adaboost_profit)
tree_profit = marketing_profits(tree, X_test, Y_test) 
print(tree_profit)

25900
24000


My diagnostic: profit using adaboost: 25900
profit using tree: 24000
since the value of using adaboost is higher, it performs best with respect to this metric.     
(25900-24000)/25900=7.33%
the profit will increase by 7.33%

## Question 7 (3 points)

1. Build an additional classifier of your choice for this problem.
   Make sure to follow best practices with cross-validation and evaluation on
   the test set!

2. Evaluate it against the two models above with respect to

   - the `value_function` that we defined
   
   - and increase/decrease in marketing campaign profits.

A few notes:

- The dataset is imbalanced, with only about 10% of the observations in the training
  data representing a positive marketing contact (i.e., $Y=1$).
  Is there any way to address this issue when fitting the model?
  See e.g., the `class_weight` parameter of `AdaBoostClassifier` or other
  classification algorithms. Chances are that setting `class_weight="balanced"`
  will improve the results.

- You can also try and use AUC (the area under the ROC curve) as a target metric
  for optimization. If you would like to experiment with that, try to set
  `scoring="roc_auc"` in `GridSearchCV`.

- We could try and optimize our models for our `value_function` directly.
  This may further improve our results. To do so, simply set
  `scoring=value_function_wrapper` in `GridSearchCV`.
  Note that `value_function_wrapper` is defined in the next cell and it is
  simply a version of our `value_function` that can be used as a scoring
  function by `sklearn`.

In [122]:
def _value_function(y, y_pred, **kwargs):
    return value_function(y, y_pred, **kwargs)


value_function_wrapper = make_scorer(_value_function)

## Answer 7

In [125]:
random_forest = RandomForestClassifier(class_weight="balanced", random_state=42, max_depth=5, n_estimators=2000).fit(X_train, Y_train)
# calculate value
print("value using random forest: " + round(value_function(Y_test, random_forest.predict(X_test)), 2).astype(str))
print("profit using random forest: " + marketing_profits(random_forest, X_test, Y_test).astype(str))

value using random forest: 7.13
profit using random forest: 58180


My diagnostic: value using adaboost: 1.42
value using tree: 1.08
since the value of using adaboost is higher, it performs best with respect to this metric.     value using adaboost: 1.42
value using tree: 1.08
value using random forest: 7.13
since the value of using random boost is highest, it performs best with respect to this metric.    

profit using adaboost: 25900
profit using tree: 24000
random forest profit: 58180
the profit for random forest is the highest, it performs best with respect to this metric.

In [126]:
param_grid = {
    'n_estimators': [2000],
    'max_depth': [5],
    'random_state': [42],
    # Add other parameters as needed
}
random_forest_new = RandomForestClassifier(class_weight="balanced")

# Create a GridSearchCV object
grid_search1 = GridSearchCV(random_forest_new, param_grid, scoring="roc_auc", cv=5)

# Fit the GridSearchCV object to your data
random_forest_scoring1 = grid_search1.fit(X_train, Y_train)

print("value using random forest scoring roc auc: " + round(value_function(Y_test, random_forest_scoring1.predict(X_test)),2).astype(str))
print("profit using random forest scoring roc auc: " + marketing_profits(random_forest_scoring1, X_test, Y_test).astype(str))

value using random forest scoring1: 7.13
profit using random forest scoring1: 58180


In [129]:
# Create a GridSearchCV object
grid_search2 = GridSearchCV(random_forest_new, param_grid, scoring=value_function_wrapper, cv=5)

# Fit the GridSearchCV object to your data
random_forest_scoring2 = grid_search2.fit(X_train, Y_train)

print("value using random forest scoring value_function_wrapper: " + round(value_function(Y_test, random_forest_scoring2.predict(X_test)),2).astype(str))
print("profit using random forest scoring value_function_wrapper: " + marketing_profits(random_forest_scoring2, X_test, Y_test).astype(str))

profit using random forest scoring value_function_wrapper: 58180


value using adaboost: 1.42
value using tree: 1.08
value using random forest: 7.13
value using random forest scoring roc auc: 7.13
value using random forest scoring value_function_wrapper: 7.13
since the value of using random forest is highest, it performs best with respect to this metric.    

profit using adaboost: 25900
profit using tree: 24000
random forest profit: 58180
profit using random forest scoring roc auc: 58180
profit using random forest scoring value_function_wrapper: 58180
the profit for random forest is the highest

in general, random forest, the different model, has better performance