In [1]:
from __future__ import division
import numpy as np
import itertools
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate



In [1]:
# Use the functions from another notebook in this notebook
%run SharedFunctions.ipynb



# Diagnostic Tests

In the *From_Probability_to_Statistical_Inference* notebook we learned about the classical statistics recipe for accepting or rejecting a hypothesis. For example, we assumed a coin had a bias of, say, 0.3 towards heads (the hypothesis), and calculated the p-value of the hypothesis based on evidence that consisted of the number of heads observed when the coin was tossed a certain number of times. We also looked at how we could actually get what we needed -- the probability of the *hypothesis* given the evidence. Recall that with the classical p-value recipe, we just skirted this calculation and instead kept calculating the probability of the *evidence* given the hypothesis and then tried to make the most of it. Bayes rule gives us the first probability and it's the one we really need.

In the notebook titled *Why_Significance_Tests_Are_Useless* we saw all of the reasons for *not* doing things the p-value way. However, this technique continues to be widely used for prediction and prediction has a number of concepts associated with it that form the core of literacy in statistics. In this notebook we'll cover these concepts.

The central concept we'll talk about is the *diagnostic test*. Think of this as a prediction; to make it concrete it can be the prediction that a coin has bias 0.6 towards heads.

## The Diagnostic Test Table

In [3]:
# Create a prediction table
def createPredTable(colNames, cellVals):
    # len(colNames) should be the same as the length of each element in cellVals
    for cellVal in cellVals:
        if len(colNames) != len(cellVal):
            return "Please ensure that every element of cellVals is the same length as the length of colNames"
        
    predTable = tabulate(cellVals, colNames, tablefmt = "grid")
    
    return predTable

In [4]:
# Create a 2x2 prediction table
colNames = ["", "Is Really X", "Is Really NOT X"]
cellNames = [["Predict X", "True Positive (TP)", "False Positive (FP)"], 
             ["Predict NOT X", "False Negative (FN)", "True Negative (TN)"]]
predTable = tabulate(cellNames, colNames, tablefmt = "grid")
print predTable

+---------------+---------------------+---------------------+
|               | Is Really X         | Is Really NOT X     |
| Predict X     | True Positive (TP)  | False Positive (FP) |
+---------------+---------------------+---------------------+
| Predict NOT X | False Negative (FN) | True Negative (TN)  |
+---------------+---------------------+---------------------+


In general, we may not know if our prediction is true of false. Sometimes we can get this information. For example, the p-value recipe gives us a diagnostic test for whether a coin has a certain bias, let's say 0.6 towards heads. Suppose I manufacture a device for testing if the coin has a bias of 0.6. As a manufacturer of this device, I owe it to you to have tested several coins with known bias of 0.6 and give you a table with numbers that look like the following. That's what you should expect from the manufacturer of any diagnostic test whether it be a pregnancy test or a test to sift out terrorists from non-terrorists in an airport check in line.

The prediction table for our coin-bias device could look like the following:

In [5]:
print(createPredTable(["", "Bias Really Is 0.6", "Bias Really Is NOT 0.6"],
                [["Predict Bias is 0.6", 493, 7], 
                 ["Predict Bias is NOT 0.6", 307, 193]
                ]
               )
     )

+-------------------------+----------------------+--------------------------+
|                         |   Bias Really Is 0.6 |   Bias Really Is NOT 0.6 |
| Predict Bias is 0.6     |                  493 |                        7 |
+-------------------------+----------------------+--------------------------+
| Predict Bias is NOT 0.6 |                  307 |                      193 |
+-------------------------+----------------------+--------------------------+


## Power and Significance

These are the numbers that the manufacturer of the diagnostic test *owes* us. If you manufacture a device that classifies things into two or more types, you have the responsibility to produce such a table so your customers can make an informed decision about how good your device is. 

In this case the test's manufacturer has put 800 coins (493 + 307) *with known bias of 0.6* through their device and found that the device correctly identified  493 of the coins as having a bias of 0.6. However, the device *misidentified* 307 coins by predicting that their bias is not 0.6 while indeed they were. So the manufacturer has given us the True Positive rate - 493/800. To write this out more fully,

- $P(predict\ X\ |\ X) = TP/(TP + FN) = 493/800$ a.k.a. True Positive Rate, $1-\beta$, Power, Sensitivity
- $P(predict\ NOT\ X\ |\ X) = FN/(TP + FN) = 307/800$ a.k.a. False Negative Rate, $\beta$
- $P(predict\ X\ |\ NOT\ X) = FP/(FP + TN) = 7/200$ a.k.a. False Positive Rate, $\alpha$, Significance Level
- $P(predict\ NOT\ X\ |\ NOT\ X) = TN/(FP + TN) = 193/200$ a.k.a. True Negative Rate, $1-\alpha$, Selectivity

Put aside how the manufacturer knows the biases of the coins - but keep in mind that they would *have to* have such a known set of coins in order to calibrate their device. It seems like this device we have is very good at saying when the bias is *not* 0.6 but does quite badly in indentifying the coins that *do* have a bias of 0.6.

Right away, we can see that there's a tight relationship between the True Positive Rate and the False Negative Rate. If we start with a fixed number of coins of known bias of 0.6, then as the True Positive Rate goes up the False Negative Rate *must* come down (and vice versa). Similarly, when the False Positive Rate goes up, the True Negative rate *must* come down (and vice versa). Mathematical consistency demands it.

We can also see that there is no mathematical relationship between the True Positive Rate and the True Negative Rate -- knowing something about one tells us nothing about the other. If I told you that my device had high Sensitivity, it's anyone's guess what it's Selectivity is going to be -- that depends on the details of how the device goes about doing what it does. There is no mathematical relationship between them. It would be great to build the device to have high rates for both Sensitivity and Selectivity.

Similarly, there is no mathematical relationship betweeen the False Positive Rate and the False Negative Rate -- a device could be great at one and terrible at the other. (Can it be terrible at both? Great at both?)

We can think of the p-value recipe for determining whether a coin has a certain bias as a kind of detector. If you remember, the significance level determines the threshold p-value for rejecting the hypothesis. A threshold of 0.05 means that the hypothesis will be falsely rejected 5% of the time -- in other words, this is the false positive rate. The true negative rate would then be 95%.

"A big difference between power and $\alpha$ is that $\alpha$ is something you just decide on. Power ... depends on variation and sample size. Variation is a natural part of the world and you can't change it, but you can choose a sample size for your study." (Vickers, p.73) 



When we build a detector, a device that conducts a diagnostic test, we are acutely aware of the detector's intended purpose.

## Interpreting the Results of Diagnostic Tests

What we really want is not $P(predict\ X\ |\ X)$ but $P(X\ |\ predict\ X)$ 

accuracy = (tp + tn)/tp + tn + fp + fn (Of all the things we identified (both positive and negative), which ones were correctly identified? It's not a completely vacuous metric -- but it doesn't tell the whole story. I can game this metric by ... )

precision = tp/(tp + fp)

recall = tp / (tp + fn)

F1 score = (2 * precision * recall)/(precision + recall)

to continue...