# Entry 25 - Setting thresholds

I don't always want the default threshold for determining the classification (negative or positive) the way I did in <font color='red'>Entry 24</font>. As discussed in the precision / recall tradeoff section of <font color='red'>Entry 23</font> sometimes there will be a better threshold.

## The Problem

*[Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)* points out that not all models provide a realistic representation of uncertainity. If overfitted, a random forest model will be 100% certain for every prediction, even if it's almost never right.

I need a way to determine the best threshold for my purposes for the specific model I'm looking at and a way to compare various models to determine which one is most appropriate for the use case.

## The Options

### Precision and recall vs thresholds

Precision and recall are plotted as their own lines on the y-axis against threshold on the x-axis in [Hands-On Machine Learning with Scikit-Learn & TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291).

This is the view I used in <font color='red'>Entry 25</font> to illustrate the precision / recall tradeoff. It's very good for visualizing the tradeoff between precision and recall and where the intersection of that tradeoff lies.

The best illustration of it on the smaller real world datasets I ran is in <font color='red'>Entry 25c notebook - Titanic</font>.

I also created a notebook that used a lot of code from *Hands-On Machine Learning with Scikit-Learn & TensorFlow* in <font color='red'>Entry 25e notebook - MNIST</font>. I did this to illustrate the difference between the PR AUC and ROC AUC, but it is a better example of percision and recall vs thresholds as well. It has significantly more data, so the lines are smoother. 

### Precision-recall (PR) curve

Precision is plotted on the y-axis with recall on the x-axis. The pretty, theoritical, lots-of-data-behind-it line is a logarithmic decay where percision starts at 1 in the upper left corner and ends with percision at 0 in the lower right. An example is in <font color='red'>Entry 25e notebook - MNIST</font>.

In the smaller datasets the lines are messy, bumpy, and don't start or stop exactly at 1 or 0.

### Precision-recall area under the curve (PR AUC)

The area under the curve (AUC) is exactly what it sounds like. The PR curve divides the chart into two sides. The closer the curve is to the upper left the more space will be under that curve. AUC calculates the area that lies underneath the curve to provide a general metric for how well the model performs.

PR AUC is the area under the percision / recall curve.

### Receiver operating characteristic (ROC) curve

This plots the true positive rate (also known as recall and sensitivity) on the y-axis and the false positive rate on the x-axis (false positive rate can be represented mathematically as $1 - sensitivity$).  This provides the sensitivity vs specificity comparison that [Applied Predictive Modeling](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn-ebook/dp/B00K15TZU0) recommends.

As long as the model predicts better than random guessing, the plot of ROC is a logarithmic growth curve.  Per *Introduction to Machine Learning with Python*:

> Predicting randomly always produces an AUC of 0.5, no matter how imbalanced the classes in a dataset are. This makes AUC a much better metric for imbalanced classification problems than accuracy.

The better the model, the closer the curve will be to the upper left hand corner of the plot. Random guessing results in a straight diagonal line that runs from the lower left to the upper right. I included the random guessing line in the  ROC charts of all five of the notebooks.

### ROC AUC

The AUC portion of the ROC AUC work the same as the PR AUC - area under the curve literally measures what percentage of the chart falls under the curve. The difference between PR AUC and ROC AUC is the curve that we use to specify the divinding line is the ROC curve instead of the PR curve. It's really that easy.

## The Proposed Solution

### ROC AUC vs PR AUC

*Hands-On Machine Learning with Scikit-Learn & TensorFlow* advises:

> As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise.

*Applied Predictive Modeling* states:

> One advantage of using ROC curves to characterize models is that, since it is a function of sensitivity and specificity, the curve is insensitive to disparities in the class proportions (Provost et al. 1998; Fawcett 2006).

The of course there's also the quote above from *Introduction to Machine Learning with Python*.

I discussed this conflicting information with my coworker [Sabber](https://medium.com/@sabber). He agreed that both metrics are good for imbalanced classes, but that the best choice for imbalanced classes comes down to the definitions of the underlying metrics.

The PR curve is based on precision and recall. As a reminder, *precision* is the rate of correct positive predictions out of all positive predictions (based on prediction population) and *recall* is the rate of correct positive predictions out of all positive observations (based on observed population).

$precision = \frac{TP}{TP+FP} = \frac{TP}{PP}$

$recall = \frac{TP}{TP + FN} = \frac{TP}{AP}$


The ROC curve is based on sensitivity and specifity. As a reminder, *sensitivity* is another name for recall and *specifity* is the rate of correct negative predictions out of all negative observations (based on observed population).

$specificity = \frac{TN}{TN + FP} = \frac{TN}{AN}$

Based on the recap above, it's easy to see that the PR curve considers true positive (TP), false positive (FP), and false negative (FN), whereas the ROC curve considers all four: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).

Because the ROC curve includes true negatives, this means that a majority class can skew the results if the model is good at identifying the majority class.

*Hands-On Machine Learning with Scikit-Learn & TensorFlow* had a nice example of this using the [MNIST digits dataset](https://www.openml.org/d/554). He predicted whether a handwriten digit was a `5`, which turned the dataset into a binary classification problem.

I replicated his example using the pipeline I've been developing and the same dataset as Aurelien Geron. The difference is subtle but noticable when looking at the charts. The ROC AUC would return a higher evaluation of the mode's performance than a PR AUC would

### Decision

I ran all of the metrics and made graphs for all the options listed above. Ultimately, to compare models I believe I'll settle on PR AUC for imbalanced datasets and ROC AUC for all other datasets.

## The Fail

### Validation

*Applied Predictive Modeling* and *Introduction to Machine Learning with Python* both make a point of stating that choosing a threshold should be done on a separate validation set - not the training set or the test set used to evaluate performance.

I'm not exactly sure what this means or how to go about it. When using cross-validation, the test set within the cross-validation splits would be the test set used to evaluate performance. But I don't really want to use my hold-out test set to set the threshold, I'd be out of test sets that the model hadn't seen and thus wouldn't have a way to verify the results.

### Pulling data

Figuring out how to get the data in openml.org was surprisingly difficult. All the datasets are saved as arff files, an ASCII text file type developed for use with Weka. My tool of choice, `pandas`, doesn't have a native way to load arrf files. Enter the `openml` package and `fetch_openml` module within `sklearn.datasets` package.

`fetch_openml` needs the name or id of the dataset on openml.org. Unfortnuately, I couldn't find either of those things listed on the dataset's page on the site. So I moved to the `openml` package. That package was throwing errors, which resolved when I copy and pasted the same code into a different Jupyter notebook when I separated the data pull from the rest of my code to troubleshoot. So there was that.

When trying to figure out exactly what was being returned and my options for working with it, I discovered the documentation was lacking in a list of available functions and explanations of what they do. Jupyter to the rescue. I used the `tab` complete functionality to see a list of available methods and managed to wind my way to the solution.

By using `openml` I can get the names and ids, then I can use either package to pull the actual data. I feel like I'd found the name and version on the openml.org page for a specific dataset, but couldn't find it for the life of me this time. Oh well, using the packages is more programmatic and as an added bonus I can now look for specific types of datasets, like imbalanced binary classification problems.

### Resources

- [Applied Predictive Modeling](https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn-ebook/dp/B00K15TZU0)
- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)
- [Hands-On Machine Learning with Scikit-Learn & TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291)

#### Profit

This was thrown into *Applied Predictive Modeling* as an example of a non-accuracy based criteria.  It allocates a gain from TP and costs to FP and FN to assign a dollar amount. The example in the book was for a direct mail campaign. `x` amount was expected to be gained by customers that responded to the mailer, `y` was spent on each mailer, and `z` was the amount lost for mailers not sent to customers that would have responded.

The same basic equation can be used from a savings perspective in use cases such as fraud. `x` would be the expected savings for each case of fraud sucessfully identified and stopped, `y` would be costs like customers lost due to increased inconvenience, and `z` would be the amount lost for each fraudulent case gone undetected.

$profit = xTP - yFP - zFN$

An equation like this has potential for helping to set a threshold for the prediction.

#### Probability cost function (PCF)

$PCF = \frac{P \times C(fn)}{P \times C(fp) + (1 - P) \times C(fn)}$

Where:

- *P* is the (prior) probability of the event
- *C(fn)* is the cost of a false negative (positive observation predicted as a negative)
- *C(fp)* is the cost of a false positive (negative observation predicted as a positive)

#### Normalized expected cost (NEC)

$NEC = PCF \times (1-TP) + (1-PCF) \times FP$