# Setting a Threshold for Predictive Models

By [*Andrew Wheeler, PhD*](mailto:andrew.wheeler@hms.com)

Most predictive models we either get a predicted probability, $\hat{p}(x)$, or a continuous valued prediction, e.g. $\mathbb{E}[x]$ (ignoring the variance part of either prediction for now). This prediction though does not directly translate into an action you should take.

There are generally two things you need to take into account when determining *what to do* with the prediction from the model. One is the costs & benefits associated with taking a particular action (the costs when you are wrong, and the benefits when you are right). The second is constraints on what you can do with the information (e.g. human auditors can only review so many cases).

This example just focuses on costs/benefits, I will create a notebook with examples of dealing with constraints in the future.

## Cost Simple Example

Say you get a predicted probability of an event occurring that is 5%, does that mean you should do nothing, or take some action? It depends; imagine a scenario where the 5% is the probability that it will rain today, and the decision is whether to take your umbrella with you to work. 

Now say that you are buying a used car, and 5% is the probability it will not run when you buy it. 

For the former carrying umbrella decision, you probably would not worry about carrying it. For the later car buying decision, 5% is a little too high for my tastes for such a large investment. 

The difference between these two examples are the *costs* associated with each. If you forget to carry your umbrella, it is annoying, but not that big of a deal. If you sink several thousand dollars into a car that does not work, that is much worse. 

In general, we need to figure out the costs with making a bad decision and the benefits of making a good decision to be able to reason about what action we should take given a 5% probability of an outcome occurring. This is as true for individual predictions in our daily lives, as it is for evaluating millions of transactions and deciding which ones should be audited.

# Mathy Part 1: Weighing True Positives and False Positives

So many of our business models are predicting a binary outcome, like fraud/soft-denial, etc. In these cases, we can break down the four outcomes into the following:

 - Take Action, Prediction Right (True Positive)
 - Take Action, Prediction Wrong (False Positive)
 - Do Not Take Action, Prediction Right (True Negative)
 - Do Not Take Action, Prediction Wrong (False Negative)

We can subsequently assign costs and benefits to each of these cases. Costs are often only associated with the "Take Action" events, doing nothing often has no costs nor benefits. (Although sometimes you can say the false negatives are potentially leaving money on the table.) 

For a concrete example, we can flag a claim to be reviewed by an auditor. It costs the auditors time no matter what, but if we identify a true positive, we gain additional value. 

Besides identifying those cost/benefits, we need one additional piece of information when determining where to set the threshold, the overall prevalence of the outcome in the population under study. I will give an example below why that is the case, but for now note that for very rare events, you will have many more false positives than true positives, even if the model is very good. 

So lets give a specific example. Say we generated predictions for 1,000 claims, and are attempting to identify fraud. We then have three pieces of information:

 1) We think the overall prevalence of fraud in our sample is around 10%
 2) The Sensitivity of our predictive instrument (the proportion of true cases we capture), is 90%
 3) The False Positive Rate of our predictive instrument is 5%

So given our 1000 cases, we can break them down into the following categories:

 - 1,000 Total Cases
   - 100 Positive Cases (10% of Total Cases)
     - 90 True Positives (90% of Positive Cases)
     - 10 False Negatives
   - 900 Negative Cases
     - 855 True Negatives
     -  45 False Positives (5% of Negative Cases)
     
So using simple example lets also say the *cost* of assigning a case to an auditor is \\$2, and the benefit of identifying a positive case is \\$10 (so a net of \\$8). In this example, using our predictive instrument, we then have (where `TP` is true positives, and `FP` is false positives.

`Utility = (10-2)*TP + -2*FP = 8*90 + -90 = 630`

So we have netted $630 in this example. We pretty much always have the capabilities to change our internal thresholds. So we could lower the bar as to what is flagged, which will increase the sensitivity, but also increase the false positive rate. You can then subsequently graph this relationship given different thresholds

`Utility = N*Prev*Sen*B + N*(1-Prev)*FPR*C`

Where the variables are:

  - `Benefit` is the total benefit of using a particular threshold
  - `N` is the total number of cases we are evaluating
  - `Prev` is the prevalence of the outcome
  - `Sen` is the sensitivity of the test (e.g. captures 90/100 true positives)
  - `FPR` is the false positive rate of the test (e.g. if you do the test on 100 true negatives, you get 5 false positives)
  - `B` is the benefit of identifying a true positive, and `C` is the cost of a false positive
  
So we can factor out the N and place it on the left hand side, making it a per-capita utility measure.

`Utility/N = Prev*Sen*B + (1 - Prev)*FPR*C`

Prevalence is a fixed sample property. The Sensitivity and False-Positive Rate are often not simple functions (although can be estimated with a hold out sample). So you typically need to use a graph (or grid search) to find the best solution. 

It can be the case that different cases have different benefits. Say a cases benefits are related to the total amount of the claims. In that case you may have different thresholds. 

## Example Use Below


In [1]:
#Libraries used and my defined functions
import pandas as pd
import numpy as np
from sklearn import metrics


#Now importing my own functions I made
import sys
import os
locDir = r'C:\Users\e009156\Documents\DataScience_Notes\WhereToSet_Threshold' #Is there a better way to make this relative?
sys.path.append(locDir)
from cut_point_functions import *

#I don't save anything in this example, so no need to change the directory
print('finished')

finished


# Very Simple Example

In [2]:
###########################################################
#This first part is a simple example 
#of calculating the cost/benefit
#of varying cut-points
###########################################################

y = np.array([0, 0, 0, 0, 1, 1, 1, 1])
scores = np.array([0.1, 0.2, 0.05, 0.4, 0.9, 0.3, 0.35, 0.8])

#ROC curve metrics: TPR, FPR, and thresholds, on TESTING data
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=1)

##################
#Assign benefits/costs & prevalance and graph that function
co = -2
be = 10 + co
prev = 0.1  #could also be np.mean(y) to match data here
#################

#Estimating cut point using my function 
my_cut, my_curve = util_cut_point(tp=tpr,fp=fpr,th=thresholds,pr=prev,be=be,co=co,curve=True)   
print("Thresholds, utility curve, and optimal cut point")
print(thresholds)
print(my_curve)
print(my_cut)
    
#Note the cut point will change if different cases have different utility
#Eg maybe a case with a higher claim has more upside, even if prevalence of
#fraud is the same

print("")
print("Cut Point with higher weight for capturing true positives")
print(util_cut_point(tp=tpr,fp=fpr,th=thresholds,pr=prev,be=be+1000,co=co))    
              

#Estimating utility on hold out sample, "my_cut" can be a scaler or a numpy array
tn, fp, fn, tp = metrics.confusion_matrix(y == 1, scores >= my_cut).ravel()
ut = tp*be + fp*co

print("")
print("Confusion Matrix and estimated utility")
print(tn, fp, fn, tp) #confusion matrix
print(ut) #total utility

Thresholds, utility curve, and optimal cut point
[1.9  0.9  0.8  0.4  0.3  0.05]
[ 0.    0.2   0.4  -0.05  0.35 -1.  ]
0.8

Cut Point with higher weight for capturing true positives
0.3

Confusion Matrix and estimated utility
4 0 2 2
16


# Real Data Example

In [3]:
###########################################################
#A LARGER DATA EXAMPLE 
#should do a dataset with a rare outcome to illustrate
###########################################################

###################################
#Stealing example from here
#https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#Train and test
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=44)
clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(X_train, y_train)
y_pred_prob = clf.predict_proba(X_test)[::,1]



In [4]:
#Getting ROC curve stats
fp_canc, tp_canc, thresh_canc = metrics.roc_curve(y_test, y_pred_prob, pos_label=1)

#Setting weights and estimating prevalence from the data
cost = -1
bene = 3 + cost
prev = np.mean(y_train)
print("Prevalence of Breast Cancer in the Training Data")
print(prev)

bc_cut, bc_curve = util_cut_point(tp=tp_canc,fp=fp_canc,th=thresh_canc,pr=prev,be=bene,co=cost,curve=True)  
print("")
print("Cut Point based on orig data")
print(bc_cut)

tn, fp, fn, tp = metrics.confusion_matrix(y_test == 1, y_pred_prob >= bc_cut).ravel()
ut = tp*bene + fp*cost
print("")
print("tn, fp, fn, tp")
print(tn, fp, fn, tp) #confusion matrix
print(ut) #total utility
####################################

Prevalence of Breast Cancer in the Training Data
0.6220472440944882

Cut Point based on orig data
0.09384444559429438

tn, fp, fn, tp
62 6 0 120
234


## ToDo 

 - example sensitivity analysis if costs/benefits and/or prevalence have distributions instead of point estimates
   - basically draw the utility curve over many hypotheticals, and you get a polygon of best cost/benefits
   
 - using k-fold cross-validation on training data to estimate error in threshold estimate
   - will still need to pipe in relevant costs/benefits/prevalence estimates

## Other references for cutpoints

 - The `OptimalCutpoints` [R package](https://cran.r-project.org/web/packages/OptimalCutpoints/OptimalCutpoints.pdf) has many metrics
 - Hand article with adjusted AUC measure to incorporate weights and prevalence in population, *Measuring classifier performance: A coherent alternative to the area under the ROC curve*, ([Hand, 2009](http://www.cs.iastate.edu/~cs573x/Notes/hand-article.pdf))
 - My [blog example using relative weights](https://andrewpwheeler.com/2015/05/27/how-wide-to-make-the-net-in-actuarial-tools-false-positives-versus-false-negatives/) instead of absolute