## Instructions

One of the simplifying assumptions we will make in this project is that all the customers who answer the phone will purchase a product. (This assumption is actually verified by the data.) To model answered in this case is therefore equivalent to modeling purchased.

There are costs and benefits in this case. We will assume that customers purchase a product for 100 dollars. The company investment in making the sale is $25. Profit is therefore 75 dollars for an answered call, which, we assume, will result in a purchase. In sum:

- Benefit: True positive. The customer is predicted to answer, does answer, and purchases a product for 100 for a profit of 100 - 25 = 75.
- Cost: False positive. The customer is predicted to answer, but does not answer, so there is a loss of 25. (We assume the agent cannot schedule another call at the last minute, or spends the entire time slot trying to make the call.) For this exercise, we propose that customers who are not predicted to answer will not be called, so there would be no benefits and no costs for them.

You should split the data into a train set with 80% of the data and a test set with 20%. Use random_state = 200 in the sample() function when splitting the data. This will ensure that you are working with the correct data.

# Data Prep

- Make an explicit copy
- Remove NAs (there are a couple)
- Leave the categorical features as is
- Remove negative `income`
- Remove outlier in `num_accts`
- Drop `product`. (Since purchase take place after someone has answered the phone it can't be used to predict answering.)

In [None]:
# Load packages

import pandas as pd
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn import tree
import numpy as np
from sklearn.metrics import ConfusionMatrixDisplay


In [None]:
# Download data
ai = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/main/Assignments/DataSets/adviseinvest.csv")

# Create a copy of the original
ai_clean = ai.copy()

# remove NAs from ai
ai_clean = ai.dropna()

# remove outlier
ai_clean = ai_clean[ai_clean.num_accts < 100]

# remove negative income
ai_clean = ai_clean[ai_clean.income >= 0]

# drop product
ai_clean = ai_clean.drop(['product'], axis=1)

In [None]:
# check that the cleaning worked
ai_clean.describe()

In [None]:
# check no NAs
ai_clean.isna().sum()

## Cross validation

Set up train and test sets.

In [None]:
# divide ai_clean into train and test
train = ai_clean.sample(frac=0.8, random_state=200) # 80% of data for training
test = ai_clean.drop(train.index) # the remaining 20%


# Define X and y in train and test
X_train = train.drop(['answered'], axis = 1)
y_train = train.answered

X_test = test.drop(['answered'], axis = 1)
y_test = test.answered



## Q1

Fit a tree model of answered using all the available predictors in the train set. This is the model from the previous project; you are welcome to cut and paste your code.

Create a confusion matrix for this model using predictions from the test set. (The default probability threshold used by the function is .5.)

In [None]:
# Initialize model, specifying max_depth = 5
tree_model = DecisionTreeClassifier(criterion = "entropy", max_depth = 5)

# Fit model to train data
tree_model = tree_model.fit(X_train, y_train)

In [None]:
# predict using predict() and predict_proba()--the results should be the same
# make sure to predict on the test set
pred = tree_model.predict(X_test)
pred[:5,]


In [None]:
# should get exactly the same output using predicted probabilities
pred_prob = tree_model.predict_proba(X_test)
y_prob_labels = np.where(pred_prob[:,1] > 0.5, 1, 0)
y_prob_labels[:5,]

In [None]:
# create confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, pred)

- TP: 2306
- FP: 384
- FN: 973
- TN: 2237


## Q2

Using the confusion matrix in the previous question how much profit (revenue - costs) could be expected with these costs-benefits?

- Benefit: 100
- Cost 25

Hint: multiply the counts in the confusion matrix cells by the the cost-benefit matrix cells. Note: profit should not be negative! Make sure that you have correctly identified the true positives and the false positives in your confusion matrix.

In [None]:
2306 * (100- 25) - 384 * 25

## Q3

How much profit (revenue - costs) could be expected if all customers in the test set are called? We can consider this a baseline case for profit since it does not require a model.

In other words, to calculate profit in this baseline scenario treat the customers who answered as true positives and treat the customers who did not answer as false positives.

In [None]:
y_test.value_counts()

In [None]:
3279 * (100- 25) - 2621 * 25

## Q4
How much profit can be expected if only customers with a predicted probability of answering > .2 are called? Again, use the test set to calculate profit.

In [None]:
y_prob_labels = np.where(pred_prob[:,1] > 0.2, 1, 0)
ConfusionMatrixDisplay.from_predictions(y_test, y_prob_labels)

In [None]:
3279*(100-25) - 2237*25