In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn import metrics
from sklearn.tree import export_graphviz
import graphviz as gp

Throughout this tutorial, we are considering the *Retention modeling at Scholastic Travel Company* case.

# Part 1: Pre-processing the data

The goal of this Break Out is to brainstorm some ideas for data preprocessing/feature engineering.

In [None]:
stca_raw = pd.read_csv("stca_raw_data.csv")
stca_raw.head()

Use this space to investigate the data if need be.

# Part 2: Predicting returning customers using logistic regression

We use here a preprocessed and engineered dataset. Of course you are more than welcome to use your own cleaned dataset if you would like to. The goal now is to proceed with classification and get some practice using the methods seen in class.

In [None]:
stca = pd.read_csv("stca_clean.csv")
stca.head()

Create the labels `y` and the feature matrix `X` as discussed in class. Recall that we are trying to predict the outcome `"Retained.in.2012."`

Separate the data into training/validation/testing with percentages 60/20/20, using `train_test_split`. Why are we creating a validation set here?

Using `scikit` run a logistic regression on `X_train,y_train` with the parameter `max_iter` set to 2000 (that is, use `LogisticRegression(max_iter=2000)`. What are the 5 largest coefficients and the 5 smallest? Do they make sense intuitively?

The code below prints out the five largest coefficients (assuming your training set is called `X_train` and your logistic regression model is called `classifier_LR`):

In [None]:
summary = pd.DataFrame([X_train.columns,classifier_LR.coef_[0]]).T.sort_values(by = 1, ascending = False)
summary.columns = ['Variable','Coefficient']
summary.head(n=5)

The next piece of code prints out the five smallest coefficients

In [None]:
summary.tail(n=5)

Obtain the predicted probabilities for `X_validation`, using `[model-name].predict_proba(X_validation)[:,1]`. What are different ways of measuring how good the method is? We'll take a look at the ROC curve (using `metrics.roc_curve`), then use the area under the curve here given by `metrics.roc_auc_score` to evaluate the quality of our model.

Recall that the area under the curve is generated with different values of the threshold. Our goal is now to think about a threshold. What is a false positive here / a false negative? Which one do you think we should focus on assuming that we adapt our marketing policy based on the output of our algorithm?

Set the threshold to 0.7 using `np.where`. Then, obtain the `metrics.confusion_matrix`, as well as the `metrics.accuracy_score`.

# Part 3: Setting a threshold

A good starting point for setting a threshold is the population average (0.613 in our case). There are other thresholds to set, that try to balance the true positive and false positive rates in an effective way. One example is [Youden’s J statistic](https://en.wikipedia.org/wiki/Youden%27s_J_statistic). This is simply calculated as:

$J = Sensitivity + Specificity – 1  = True Positive Rate – False Positive Rate$

We have this available directly from creating the ROC curve:

In [None]:
J = tpr - fpr
J

The approach involves maximizing $J$. Hence, we simply pick the threshold with the highest $J$:

In [None]:
print("The best threshold according to the J statistic is " + str(thresholds[np.argmax(J)]))

Let's see the confusion matrix at this threshold:

In [None]:
threshold = thresholds[np.argmax(J)]
y_validation_pred = np.where(y_pred_prob < threshold, 0, 1)
metrics.confusion_matrix(y_validation, y_validation_pred)

There are many other similar metrics, of course. But of course, these are not dependent on the specifc costs of false positives and false negatives (recall that in some applications, FP are more expensive, and in others, FN are more expensive).

If we know the cost of any of the outcomes, we can directly compute the cost of our prediction mistakes. To go back to the example above, let's make a few assumptions:

- STC only markets to groups it thinks will not be retained, at a cost of £100 per group
- A non-retained group that receives marketing will be convinced otherwise
- Any group going on a trip (whether retained, or because it receives marketing), brings in a benefit of £1,000

What does this mean for STC's profits?

1. True negative: we market to this group, and it would in fact not have been retained otherwise. The net-profit then is £900
2. False negative: we market to this group, even though it would have been retained. The net-profit for such a group is £900
3. False positive: we assume the group is not retained, so we don't market to it (and lose it). The net-profit here is £0
4. True positive: we correcly assume that the group is retained, and we don't market to it. The net-profit here is £1000

Given this, we can now calculate the profits we would get from the validation customers, using the confusion matrix (we start again with a threshold of 0.7):

In [None]:
threshold = 0.7
y_validation_pred = np.where(y_pred_prob < threshold, 0, 1)
cm = metrics.confusion_matrix(y_validation, y_validation_pred)
profit = cm[0][0] * 900 + cm[1][0] * 900 + cm[0][1] * 0 + cm[1][1] * 1000
print("The profit at threshold " + str(threshold) + " is " + str(profit))

Let's try for the threshold given by the J-statistic:

In [None]:
threshold = thresholds[np.argmax(J)]
y_validation_pred = np.where(y_pred_prob < threshold, 0, 1)
cm = metrics.confusion_matrix(y_validation, y_validation_pred)
profit = cm[0][0] * 900 + cm[1][0] * 900 + cm[0][1] * 0 + cm[1][1] * 1000
print("The profit at threshold " + str(threshold) + " is " + str(profit))

We can see how to optimize this, right?

# Part 4: CART for classification

We now move onto using CART for classification. We will also be using the Area Under the Curve (AUC) to measure how good our model is.

We start by fitting a Classification Tree to the data with `max_leaf_nodes=8`:

In [None]:
classifier_DT = DecisionTreeClassifier(max_leaf_nodes = 8)
classifier_DT.fit(X_train, y_train)

We can use the code below to plot the tree. Which variables seem to intervene? Are they similar to the ones obtained for Logistic Regression?

In [None]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(classifier_DT, feature_names = X_train.columns, filled = True, rounded = True, class_names=["Not Retained","Retained"])
graph = gp.Source(dot_data)
graph

Then we use it to obtain the predicted probabilities of retention (i.e., classifier = 1) on `X_validation` using `.predict_proba`.

In [None]:
y_pred_prob = classifier_DT.predict_proba(X_validation)[:,1] # probabilities

How good is the model? Compute the AUC for this model and the accuracy using the same threshold as above.

In [None]:
metrics.roc_auc_score(y_validation, y_pred_prob)

In [None]:
y_validation_pred = np.where(y_pred_prob > threshold, 1, 0)

In [None]:
metrics.accuracy_score(y_validation,y_validation_pred)

In [None]:
metrics.confusion_matrix(y_validation, y_validation_pred)

# Part 5: Wrapping up

If you were advising this company, what model would you recommend they use? Retrain the model you have selected on the training+validation set, then test it on the test set and report the AUC and profit on the test set. (These could be useful for the company to have.) Comment on the robustness of the approach.

Would there be any other recommendations except for the model and the consequent predictions that you would give to the company based on your analyses?