# Activity: Understanding the ROC and PR Curves

In this activity, we'll build on material in the *Performance Measures* lecture to build our own receiver operating characteristic (ROC) curve and precision-recall (PR) curve.

We'll begin by repeating steps we took in our last computational exercise. For details on this step, please refer to the exercise, which may be found [here](https://github.com/mengelhard/bsrt_ml4h/blob/master/notebooks/ce3.ipynb). In the exercise we quantified performance using accuracy and the area under the ROC curve (AUC). Here, we'll plot the ROC and PR curves on the test set for the predictions made by our logistic regression model.

To get started, run the next 6 code blocks. If you're working in Colab, make sure to uncomment the `!pip install` line in the second block before you run it.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid') # IMPROVES FIGURE AESTHETICS

In [2]:
### UNCOMMENT AND RUN THIS LINE IF WORKING IN GOOGLE COLAB ###
#!pip install --upgrade scikit-learn

In [3]:
from sklearn.datasets import load_breast_cancer
df, y_true = load_breast_cancer(return_X_y=True, as_frame=True)
y_true = 1 - y_true # let's set benign to 0 and malignant to 1, in keeping with usual conventions

In [4]:
def flip_some_labels(labels, flip_rate=.1, random_seed=0):
    return (labels + (np.random.RandomState(random_seed).rand(len(labels)) < flip_rate)) % 2

y = flip_some_labels(y_true)

In [5]:
X_train = df[:400]
X_test = df[400:]

train_mean = X_train.mean()
train_std = X_train.std()

X_train = (X_train - train_mean) / train_std
X_test = (X_test - train_mean) / train_std

y_train = y[:400]
y_test = y[400:]

In [6]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state=0, max_iter=10000)
lr_model.fit(X_train, y_train)

y_test_pred_proba = lr_model.predict_proba(X_test)[:, 1]

## Building the confusion matrix

In this section, we'll fill in the values of the 2x2 confusion matrix corresponding to a specific threshold of your choice.
- In the first block, you'll choose a threshold and make binary (i.e. $\{0, 1\}$-valued) predictions about `y_test` by applying that threshold to `y_test_pred_proba`.
- The second block will calculate the number of true positives for the threshold you've chosen. It is important to think carefully about what is happening in this calculation. In short, we are counting all the cases where both `y_test` and `y_test_pred_label` are 1.
- In the third block, you should calculate the number of false positives, true negatives, and false negatives. Each can be calculated by modifying the line used to calculate the number of true positives.

We now have our confusion matrix! Once you've completed these steps, try a few different values of `DECISION_THRESHOLD` to see how it affects the different values.

Note: an alternative, useful method to calculate these values is to cross-tabulate `y_test` and `y_test_pred_label` with `pd.crosstab()`.

In [7]:
DECISION_THRESHOLD = 0.5

y_test_pred_label = (y_test_pred_proba > DECISION_THRESHOLD).astype(int)

In [8]:
tp = np.sum((y_test == 1) & (y_test_pred_label == 1))

print('There are', tp, 'true positives')

There are 32 true positives


In [9]:
# count false positives
# fp = 

# count true negatives
# tn = 

# count false negatives
# fn = 

# print all of the counts

## Calculate TPR, FPR, and PPV
- The first block below calculates the true positive rate (i.e. sensitivity) based on the counts of true positives and false negatives from the previous block.
- In the second, you should calculate the false positive rate (i.e. 1 - specificity) and positive predictive value (i.e. precision) in a similar manner based on counts from your confusion matrix.

In [None]:
tpr = tp / (tp + fn)
print('The true positive rate is', tpr)

In [None]:
# calculate the false positive rate
# fpr = 

# calculate the positive predictive value
# ppv = 

# print both values

## Calculate the TPR, FPR, and PPV across a range of thresholds from 0 to 1

Now we get to the fun part. We'd like to calculate the TPR, FPR, and PPV as before, but while moving the threshold between 0 and 1 in very small increments. We can do this with a simple `for` loop, as illustrated in the next block:

In [None]:
for item in ['keys', 'phone', 'mask']:
    print('Don\'t leave home without your', item, '!')

First, let's define the thresholds we'd like to use. Let's use `np.linspace` to create an array of 1000 evenly spaced values between 0 and 1. To visualize these values, it may help to print them. We'll then be able to use a `for` loop to see what happens when we use each of these values as a threshold.

In [None]:
thresholds = np.linspace(0, 1, 1000)
print(thresholds)

Before starting our `for` loop, let's create empty lists that we'll use to store the TPR, FPR, and PPV values we'll be calculating for each threshold as we move through the loop. We'll then loop over the thresholds; note that each of the indented lines is executed within each loop iteration.

We'll do a few things in our loop:
1. Predict the labels by applying the current threshold value to `y_test_pred_proba`. This line does not need to be modified.
2. Count true positives, false positives, true negatives, and false negatives based on the true labels and predicted labels. You'll need to add your code from previous blocks to do this.
3. Based on these counts, calculate the current TPR, FPR, and PPV. Again, you'll need to add your code from previous blocks to do this.
4. Append the TPR, FPR, and PPV values you just calculated to our growing list of TPRs, FPRs, and PPVs (for all thresholds). These lines do not need to be modified.

**Important Note:** Python will raise a warning, because the PPV will be NaN for at least one of your thresholds (*Challenge*: which one??). Don't worry about this warning.

In [None]:
# Create empty list for the true positive rates we'll be calculating in our loop
tprs = []

# Create empty list for the false positive rates we'll be calculating in our loop
fprs = []

# Create empty list for the positive predictive values we'll be calculating in our loop
ppvs = []

for threshold in thresholds:
    
    y_test_pred_label = (y_test_pred_proba > threshold).astype(int)
    
    # add your code to calculate true positives, false positives, true negatives, and false negatives
    tp = np.sum((y_test == 1) & (y_test_pred_label == 1))
    #fp = 
    #tn = 
    #fn = 
    
    tpr = tp / (tp + fn)
    #fpr = 
    #ppv = 
    
    tprs.append(tpr)
    fprs.append(fpr)
    ppvs.append(ppv)

## Plot the ROC Curve!

The code below is ready to go! `plt.plot()` is doing all the work here; the remaining lines are just adding axis labels and pulling the axes in a bit.

In [None]:
plt.plot(fprs, tprs)
plt.xlabel('False Positive Rate (i.e. 1 - Specificity)', fontsize=16)
plt.ylabel('True Positive Rate (i.e. Sensitivity)', fontsize=16)
plt.xlim([-.01, 1.01])
plt.ylim([-.01, 1.01])
plt.show()

## Plot the PR Curve!

In [None]:
plt.plot(tprs, ppvs)
plt.xlabel('Recall (i.e. Sensitivity, TPR)', fontsize=16)
plt.ylabel('Precision (i.e. PPV)', fontsize=16)
plt.xlim([-.01, 1.01])
plt.ylim([-.01, 1.01])
plt.show()