## Overview

In the previous objective, we examined the probability threshold a classifier uses when determining to which class an observation belongs. We can extend this concept and look at something called the *receiver operating characteristic*, which is usually plotted as a curve and called the ROC curve.

First, we'll go back to the idea of calculating true positives and true negatives and look at a different measurement: the true positive rate (tpr) and the false positive rate (FPR).

$$ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

$$ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives}+\text{True Negatives}}$$

Both of the above measurements are the just the total true or false positives normalized by the total for each.

When we create a ROC curve, we are plotting the TPR against the FPR for a range of threshold values. In the next section, we'll use the scikit-learn `roc_curve()` method to do the calculations for us. From the resulting data, we'll create a plot.

<p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi></mi><annotation encoding="application/x-tex"></annotation></semantics></math></p>

## Follow Along

In order to plot a ROC curve, we need some data and a classifier model fit to that data. Let's use the same data from the previous objective and then create the ROC curve.

In [1]:
# Load modules
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

# Create the data (feature, target)
X, y = make_classification(n_samples=10000, n_features=5,
                          n_classes=2, n_informative=3,
                          random_state=42)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create and fit the model
logreg_classifier = LogisticRegression().fit(X_train, y_train)

# Create predicted probabilities
y_pred_prob = logreg_classifier.predict_proba(X_test)[:,1]

In [2]:
# Create the data for the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# See the results in a table
roccurve_df = pd.DataFrame({
    'False Positive Rate': fpr, 
    'True Positive Rate': tpr, 
    'Threshold': thresholds
})

roccurve_df.head()

Unnamed: 0,False Positive Rate,True Positive Rate,Threshold
0,0.0,0.0,1.999969
1,0.0,0.000786,0.999969
2,0.0,0.291438,0.983222
3,0.000815,0.291438,0.983049
4,0.000815,0.360566,0.970583


In [3]:
# Plot the ROC curve
import matplotlib.pyplot as plt

plt.plot(fpr, tpr)
plt.plot([0,1], ls='--')
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.clf()

<Figure size 432x288 with 0 Axes>

![mod4_obj4_ROC.png](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_2/mod4_obj4_ROC.png)

The above model looks pretty good. In general, the better a model is, the higher the curve and the greater the area under the curve (call AUC). The maximum value for the AUC is equal to one. While we can "eyeball" the area in our curve, there is also a tool used to calculate the AUC.

In [4]:
# Calculate the area under the curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred_prob)

0.9419681927513379

## Challenge

Using a different data set for classification, see if you can construct the ROC curve. Or with the same data set generated above, try using a different classifier, such as a decision tree, and plot the ROC curve and calculate the AUC. Which model performs better?

## Additional Resources

* [Scikit-learn Guide: ROC](https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characterist)
* [Scikit-learn: ROC Curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
* [Scikit-learn: ROC AUC Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)