# DATASCI 503, Group Work 3: ROC Curves and Logistic Regression

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. **During lab, feel free to flag down your GSI to ask questions at any point!**

### Introduction to Logistic Regression

In this lab, we are going to play with some logistic regression models. The data we will be using is synthetic and manually created by sampling from a multivariate Gaussian distribution.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection

%matplotlib inline

np.random.seed(12)
num_observations = 500

x1 = np.random.multivariate_normal([0, 0], [[1, 0.75], [0.75, 1]], num_observations)
x2 = np.random.multivariate_normal([0, 2], [[1, 0.75], [0.75, 1]], num_observations)

simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)
simulated_labels = np.hstack(
    (np.zeros(num_observations, dtype=int), np.ones(num_observations, dtype=int))
)

Let us visualize what the generated data looks like:

In [None]:
plt.figure(figsize=(4, 4))
plt.scatter(
    simulated_separableish_features[:, 0],
    simulated_separableish_features[:, 1],
    c=simulated_labels,
    alpha=0.4,
    s=5,
)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Simulated Data")
plt.show()

We now split the data into training and testing sets. We use a 70%/30% split.

In [None]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    simulated_separableish_features, simulated_labels, test_size=0.3, random_state=0
)

Now import logistic regression model from sklearn and train the model from the training data.

In [None]:
model_log = sklearn.linear_model.LogisticRegression()
model_log.fit(X_train, y_train)

Logistic regression has trained successfully. We now use our fitted model to predict the probability of the binary outcomes in the test dataset.

In [None]:
y_prob = model_log.predict_proba(X_test)

# Show logistic regression's estimated probability mass function along with
# the true value of y_i for each test point
viz_table = pd.DataFrame(
    data=np.c_[y_prob, y_test],
    columns=[r"$\hat p(0|x_i)$", r"$\hat p(1|x_i)$", r"$y_i$"],
)
viz_table

In [None]:
# Compute p(y_i | x_i) for each sample
pygx = y_prob[np.r_[0 : len(y_test)], y_test]
viz_table[r"$\hat p(y_i|x_i)$"] = pygx
viz_table

In [None]:
# Compute average log likelihood
np.mean(np.log(pygx))

In [None]:
# sklearn provides a convenience function for computing the estimated cross-entropy
# i.e., the negative log likelihood (NLL), which is just the negative of what
# we computed above
sklearn.metrics.log_loss(y_test, y_prob)

### Hard Classifiers and Thresholding

Now let's make a *hard classifier*, something that makes a single guess about what the response might be, given any input. This is contrasted with the soft classifier we already have, that estimated the conditional probability of each response given any input.

We'll start by making a hard classifier by thresholding our estimate of the conditional pmf.

$$\hat y(x) = \begin{cases} 1 & \mathrm{if}\ \hat p(1|x)>t \\ 0 & \mathrm{otherwise}\end{cases}$$

We'll start with $t=0.6$, which is pretty arbitrary. You can also play around with different values and see how things change.

In [None]:
thresh = 0.6
y_pred = np.where(y_prob[:, 1] > thresh, 1, 0)
pd.DataFrame({"phat(1|x_i)": y_prob[:, 1], "yhat(x_i)": y_pred})

In [None]:
cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
# You can also create a confusion matrix using Pandas
cm_pd = pd.crosstab(y_test, y_pred, rownames=["Actual"], colnames=["Predicted"])
cm_pd

In [None]:
# Pandas can also create row summaries and column summaries
cm_pd = pd.crosstab(y_test, y_pred, rownames=["Actual"], colnames=["Predicted"], margins=True)
cm_pd

In [None]:
# Look at some metrics that are special for binary classification, where we have
# a "negative" and "positive" class
# Here "positive" is class 1 and "negative" is class 0

FP = cnf_matrix[0, 1]  # it is "negative" (class 0) but we predict +1
FN = cnf_matrix[1, 0]
TP = cnf_matrix[1, 1]
TN = cnf_matrix[0, 0]

FP = float(FP)
FN = float(FN)
TP = float(TP)
TN = float(TN)

metrics_series = pd.Series(
    {
        # Sensitivity, hit rate, recall, or true positive rate
        "TPR": TP / (TP + FN),
        # Specificity or true negative rate
        "TNR": TN / (TN + FP),
        # Precision or positive predictive value
        "PPV": TP / (TP + FP),
        # Negative predictive value
        "NPV": TN / (TN + FN),
        # Fall out or false positive rate
        "FPR": FP / (FP + TN),
        # False negative rate
        "FNR": FN / (TP + FN),
        # False discovery rate
        "FDR": FP / (TP + FP),
    }
)
metrics_series

Let's make a curve, evaluating the precision and recall for many different hard classifiers based on many different thresholds. sklearn makes this easy for you.

In [None]:
prec, rec, thresh = sklearn.metrics.precision_recall_curve(y_test, y_prob[:, 1])

plt.plot(rec, prec, label="Precision and recall for many hard classifiers based on thresholds")

plt.plot(
    [metrics_series.TPR],
    [metrics_series.PPV],
    "o",
    label="Precision and recall for the hard classifier we chose above",
)

plt.legend(bbox_to_anchor=[1, 1])
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.show()

---

## Group Work: Implementing ROC from Scratch

In the lab, we demonstrated how to use sklearn's library to plot the precision-recall curve. For group work, we will implement ROC from scratch. The motivation is two-fold:

1. You better understand how some of the libraries implement the functions you end up using.
2. You build up the muscle memory for knowing what AUC and ROC actually are instead of just a number.

We will use the NHANES dataset for this exercise. The data files are located in `data/NHANES/`.

In [None]:
# Preview the NHANES datasets
bmx_df = pd.read_sas("data/NHANES/BMX_L.xpt")
demo_df = pd.read_sas("data/NHANES/DEMO_L.xpt")
hdl_df = pd.read_sas("data/NHANES/HDL_L.xpt")

print("First few rows of BMX_L.xpt (Body Measures):")
print(bmx_df.head())
print("\nFirst few rows of DEMO_L.xpt (Demographics):")
print(demo_df.head())
print("\nFirst few rows of HDL_L.xpt (HDL Cholesterol):")
print(hdl_df.head())

---

**Problem 1:** Load the NHANES Datasets

Load the three NHANES datasets (HDL, BMX, and DEMO) into dataframes. The data files are located in `data/NHANES/`.

Store the loaded dataframes in variables named `hdl`, `bmx`, and `demo`.

**Hint:** Use `pd.read_sas()` to load SAS .xpt files. See the [Pandas read_sas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_sas.html).

In [None]:
# BEGIN SOLUTION
hdl = pd.read_sas("data/NHANES/HDL_L.xpt")
bmx = pd.read_sas("data/NHANES/BMX_L.xpt")
demo = pd.read_sas("data/NHANES/DEMO_L.xpt")
# END SOLUTION

In [None]:
# Test assertions
assert "hdl" in dir(), "Variable 'hdl' not defined"
assert "bmx" in dir(), "Variable 'bmx' not defined"
assert "demo" in dir(), "Variable 'demo' not defined"
assert isinstance(hdl, pd.DataFrame), "hdl should be a DataFrame"
assert isinstance(bmx, pd.DataFrame), "bmx should be a DataFrame"
assert isinstance(demo, pd.DataFrame), "demo should be a DataFrame"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "SEQN" in hdl.columns, "hdl should have SEQN column"
assert "SEQN" in bmx.columns, "bmx should have SEQN column"
assert "SEQN" in demo.columns, "demo should have SEQN column"
assert "LBDHDD" in hdl.columns, "hdl should have LBDHDD column"
assert len(hdl) > 0, "hdl should not be empty"
# END HIDDEN TESTS

---

**Problem 2:** Join the Datasets

Join the three datasets together using their primary key (`SEQN`) with an inner join. Store the result in a variable named `df`.

**Hint:** Use `pd.merge()` to join dataframes. You may need to call it twice to join all three datasets. See the [Pandas merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).

In [None]:
# BEGIN SOLUTION
# Inner join all three datasets on SEQN
df = pd.merge(hdl, bmx, on="SEQN", how="inner")
df = pd.merge(df, demo, on="SEQN", how="inner")
# END SOLUTION
df.head()

In [None]:
# Test assertions
assert "df" in dir(), "Variable 'df' not defined"
assert isinstance(df, pd.DataFrame), "df should be a DataFrame"
assert "SEQN" in df.columns, "df should have SEQN column"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "LBDHDD" in df.columns, "df should have LBDHDD column from hdl"
assert "BMXBMI" in df.columns, "df should have BMXBMI column from bmx"
assert "RIDAGEYR" in df.columns, "df should have RIDAGEYR column from demo"
assert len(df) > 5000, "df should have a substantial number of rows after inner join"
# END HIDDEN TESTS

---

**Problem 3:** Select Predictor Variables

Pick a set of predictors to run a logistic regression to predict the level of HDL cholesterol.

(a) In the markdown cell below, write the column name of each variable you select along with a short English description of what the variable represents.

(b) In the code cell, filter your dataset to only contain the columns `SEQN`, your predictor variables, and the response variable `LBDHDD`. Rename `LBDHDD` to `HDL` for clarity. Store the result in a variable named `my_df`. You may also rename the other columns to more readable names.

**Required predictors:** You must include at least:
- `BMXBMI` (Body Mass Index)
- `RIDAGEYR` (Age in years at screening)
- `INDFMPIR` (Ratio of family income to poverty)

> BEGIN SOLUTION

**Selected Variables:**
- `LBDHDD`: HDL cholesterol level (mg/dL) - the response variable
- `BMXBMI`: Body Mass Index - a measure of body fat based on height and weight
- `RIDAGEYR`: Age in years at screening - participant's age when examined
- `INDFMPIR`: Ratio of family income to poverty threshold - a socioeconomic indicator
> END SOLUTION


In [None]:
# BEGIN SOLUTION
# Select the required columns and rename them for clarity
my_df = df[["SEQN", "LBDHDD", "BMXBMI", "RIDAGEYR", "INDFMPIR"]].copy()
my_df = my_df.rename(
    columns={
        "LBDHDD": "HDL",
        "BMXBMI": "BMI",
        "RIDAGEYR": "ScreeningAge",
        "INDFMPIR": "RatioToPoverty",
    }
)
# END SOLUTION
my_df.head()

In [None]:
# Test assertions
assert "my_df" in dir(), "Variable 'my_df' not defined"
assert isinstance(my_df, pd.DataFrame), "my_df should be a DataFrame"
assert len(my_df.columns) >= 4, "my_df should have at least 4 columns (SEQN + HDL + predictors)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "SEQN" in my_df.columns, "my_df should have SEQN column"
assert len(my_df) > 0, "my_df should not be empty"
assert "HDL" in my_df.columns, "my_df should have HDL column (renamed from LBDHDD)"
# END HIDDEN TESTS

---

**Problem 4:** Handle Missing Values

Drop all rows with missing values from `my_df` since they do not help us much for modeling.

**Important:** If you remove more than 30% of all rows in the dataset, your set of variables is likely removing too much of the data. Try to find a different combination of variables if this happens.

In [None]:
original_size = len(my_df)
# BEGIN SOLUTION
my_df = my_df.dropna()
# END SOLUTION
print(f"Retained {len(my_df) / original_size:.1%} of the original data")

In [None]:
# Test assertions
assert my_df.isna().sum().sum() == 0, "my_df should have no missing values after dropna"
assert len(my_df) / original_size > 0.7, "Should retain at least 70% of the data"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(my_df) > 5000, "my_df should have a substantial number of rows"
assert my_df.shape[1] >= 4, "my_df should have at least 4 columns"
# END HIDDEN TESTS

---

**Problem 5:** Create Binary HDL Indicator

High-density lipoprotein (HDL) cholesterol is often called the "good" cholesterol because higher levels are generally associated with a lower risk of heart disease.

An HDL of 60 mg/dL or higher is often viewed as protective against heart disease. This is typically the level you'd like to aim for, if possible.

Convert your HDL variable to a binary indicator that is `True` when HDL is **at least** 60 mg/dL and `False` otherwise. Store this in a column named `HDL_healthy`.

In [None]:
# BEGIN SOLUTION
# Create binary indicator for healthy HDL (>= 60 mg/dL)
my_df["HDL_healthy"] = my_df["HDL"] >= 60.0
# END SOLUTION
my_df.head()

In [None]:
# Test assertions
assert "HDL_healthy" in my_df.columns, "my_df should have 'HDL_healthy' column"
assert my_df["HDL_healthy"].dtype == bool, "HDL_healthy should be boolean type"
assert my_df["HDL_healthy"].any(), "Some samples should have HDL_healthy=True"
assert not my_df["HDL_healthy"].all(), "Some samples should have HDL_healthy=False"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check that the binary variable is correctly computed
expected_healthy = my_df["HDL"] >= 60.0
assert (my_df["HDL_healthy"] == expected_healthy).all(), "HDL_healthy should be True when HDL >= 60"
# END HIDDEN TESTS

---

**Problem 6:** Split Data into Training and Test Sets

Separate the data into training and testing splits. Use an 80%/20% split with `random_state=8`.

Store the results in variables named `X_train`, `X_test`, `y_train`, and `y_test`.

**Hint:** Use `sklearn.model_selection.train_test_split()`. See the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
from sklearn.model_selection import train_test_split

# BEGIN SOLUTION
# Identify predictor columns (exclude SEQN, HDL, and HDL_healthy)
predictor_cols = [col for col in my_df.columns if col not in ["SEQN", "HDL", "HDL_healthy"]]

X_train, X_test, y_train, y_test = train_test_split(
    my_df[predictor_cols],
    my_df["HDL_healthy"],
    test_size=0.2,
    random_state=8,
)
# END SOLUTION
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Test assertions
assert "X_train" in dir(), "Variable 'X_train' not defined"
assert "X_test" in dir(), "Variable 'X_test' not defined"
assert "y_train" in dir(), "Variable 'y_train' not defined"
assert "y_test" in dir(), "Variable 'y_test' not defined"
assert len(X_train) > len(X_test), "Training set should be larger than test set"
print("All tests passed!")

# BEGIN HIDDEN TESTS
total_size = len(X_train) + len(X_test)
test_ratio = len(X_test) / total_size
assert 0.15 < test_ratio < 0.25, f"Test set should be ~20% of total, got {test_ratio:.1%}"
assert len(X_train) == len(y_train), "X_train and y_train should have same length"
assert len(X_test) == len(y_test), "X_test and y_test should have same length"
# END HIDDEN TESTS

---

**Problem 7:** Train a Logistic Regression Model

Train a logistic regression model on the training data. Store the trained model in a variable named `model`.

**Hint:** Use `sklearn.linear_model.LogisticRegression()`. See the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
from sklearn.linear_model import LogisticRegression

# BEGIN SOLUTION
model = LogisticRegression()
model.fit(X_train, y_train)
# END SOLUTION

In [None]:
# Test assertions
assert "model" in dir(), "Variable 'model' not defined"
assert hasattr(model, "predict"), "model should have a predict method"
assert hasattr(model, "predict_proba"), "model should have a predict_proba method"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model, "coef_"), "model should be fitted (have coef_ attribute)"
assert model.coef_.shape[1] == X_train.shape[1], "model should have correct number of coefficients"
# END HIDDEN TESTS

---

**Problem 8:** Generate Predictions

Generate both the predicted probabilities and the default binary predictions for the test set.

Store the predicted probabilities in a variable named `predicted_probs` and the binary predictions in a variable named `predicted_hdl_healthy`.

**Hint:** Use `model.predict_proba()` for probabilities and `model.predict()` for binary predictions.

In [None]:
# BEGIN SOLUTION
predicted_probs = model.predict_proba(X_test)
predicted_hdl_healthy = model.predict(X_test)
# END SOLUTION
print(f"Average predicted probability: {predicted_probs.mean(axis=0)}")

In [None]:
# Test assertions
assert "predicted_probs" in dir(), "Variable 'predicted_probs' not defined"
assert "predicted_hdl_healthy" in dir(), "Variable 'predicted_hdl_healthy' not defined"
assert predicted_probs.shape[0] == len(X_test), "predicted_probs should have same length as X_test"
assert predicted_probs.shape[1] == 2, "predicted_probs should have 2 columns (one per class)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(predicted_hdl_healthy) == len(
    X_test
), "predicted_hdl_healthy should have same length as X_test"
assert np.allclose(
    predicted_probs.sum(axis=1), 1.0
), "Probabilities should sum to 1 for each sample"
assert predicted_hdl_healthy.dtype in (
    bool,
    np.bool_,
), "Binary predictions should be boolean type"
# END HIDDEN TESTS

---

**Problem 9:** Evaluate with sklearn's ROC

Use sklearn's `roc_curve` and `auc` functions to evaluate the logistic regression model. This will serve as a sanity check that our own implementation (in the next problem) is close to accurate.

Store the results in variables named `fpr`, `tpr`, `thresholds`, and `roc_auc`.

**Hint:** See the [roc_curve documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) and [auc documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html).

In [None]:
from sklearn.metrics import auc, roc_curve

# BEGIN SOLUTION
# Compute ROC curve using sklearn
fpr, tpr, thresholds = roc_curve(y_test, predicted_probs[:, 1])
roc_auc = auc(fpr, tpr)
# END SOLUTION

print(f"AUC: {roc_auc:.4f}")

plt.figure()
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (area = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC)")
plt.legend(loc="lower right")
plt.show()

In [None]:
# Test assertions
assert "fpr" in dir(), "Variable 'fpr' not defined"
assert "tpr" in dir(), "Variable 'tpr' not defined"
assert "roc_auc" in dir(), "Variable 'roc_auc' not defined"
assert 0 <= roc_auc <= 1, "AUC should be between 0 and 1"
assert roc_auc > 0.5, "AUC should be better than random (> 0.5)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(fpr) == len(tpr), "fpr and tpr should have same length"
assert fpr[0] == 0, "FPR should start at 0"
assert tpr[0] == 0, "TPR should start at 0"
assert fpr[-1] == 1, "FPR should end at 1"
assert tpr[-1] == 1, "TPR should end at 1"
# END HIDDEN TESTS

---

**Problem 10:** Implement ROC from Scratch

Implement your own ROC function and evaluate your logistic regression on it. You may **not** wrap functions that already compute ROC (e.g., `sklearn.metrics.roc_curve`). You may use `sklearn.metrics.confusion_matrix` to compute the confusion matrix for each threshold.

Your function should:
1. Take `y_true` (true labels) and `predicted_probs_positive` (predicted probabilities for the positive class) as inputs
2. Use at least 1000 thresholds evenly spaced between 0 and 1
3. For each threshold, compute the FPR and TPR from the confusion matrix
4. Return lists of FPR values, TPR values, and thresholds

**Recall:**
- TPR (True Positive Rate) = TP / (TP + FN)
- FPR (False Positive Rate) = FP / (FP + TN)

Store your results in `my_fprs`, `my_tprs`, and `my_thresholds`.

In [None]:
from sklearn.metrics import confusion_matrix


def compute_roc(y_true, predicted_probs_positive):
    """
    Compute the ROC curve from scratch.

    Parameters
    ----------
    y_true : array-like
        The true labels of the test set.
    predicted_probs_positive : array-like
        The predicted probabilities of the positive class for the test set.

    Returns
    -------
    fpr_list : list
        A list of false positive rates for each threshold.
    tpr_list : list
        A list of true positive rates for each threshold.
    thresholds : ndarray
        The thresholds used to calculate the rates.
    """
    # BEGIN SOLUTION
    # Create 1000 evenly spaced thresholds from 0 to 1
    thresholds = np.linspace(0, 1, 1000)
    fpr_list = []
    tpr_list = []

    for threshold in thresholds:
        # Create binary predictions based on threshold
        y_pred = predicted_probs_positive > threshold

        # Compute confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        tn, fp, fn, tp = cm.ravel()

        # Calculate FPR and TPR
        current_fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        current_tpr = tp / (tp + fn) if (tp + fn) > 0 else 0

        fpr_list.append(current_fpr)
        tpr_list.append(current_tpr)

    return fpr_list, tpr_list, thresholds
    # END SOLUTION


my_fprs, my_tprs, my_thresholds = compute_roc(y_test, predicted_probs[:, 1])

# Plot the ROC curve
my_roc_auc = auc(my_fprs, my_tprs)
print(f"AUC (custom implementation): {my_roc_auc:.4f}")

plt.figure()
plt.plot(my_fprs, my_tprs, color="darkorange", lw=2, label=f"ROC curve (area = {my_roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) - Custom Implementation")
plt.legend(loc="lower right")
plt.show()

In [None]:
# Test assertions
assert "my_fprs" in dir(), "Variable 'my_fprs' not defined"
assert "my_tprs" in dir(), "Variable 'my_tprs' not defined"
assert "my_thresholds" in dir(), "Variable 'my_thresholds' not defined"
assert len(my_thresholds) >= 1000, "Should use at least 1000 thresholds"
assert (
    len(my_fprs) == len(my_tprs) == len(my_thresholds)
), "FPR, TPR, and thresholds should have same length"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check that the custom AUC is close to sklearn's AUC
custom_auc = auc(my_fprs, my_tprs)
assert (
    abs(custom_auc - roc_auc) < 0.05
), f"Custom AUC ({custom_auc:.3f}) should be close to sklearn AUC ({roc_auc:.3f})"
# Check that FPR and TPR are in valid range
assert all(0 <= f <= 1 for f in my_fprs), "All FPR values should be between 0 and 1"
assert all(0 <= t <= 1 for t in my_tprs), "All TPR values should be between 0 and 1"
# END HIDDEN TESTS

---

**Problem 11:** Find the Optimal Threshold

A common approach for selecting a final threshold is to find the point on the ROC curve that is closest to the ideal point (0, 1), which represents perfect classification (FPR=0, TPR=1).

Write code that finds this optimal threshold and the corresponding FPR and TPR values. Store them in variables named `optimal_threshold`, `optimal_fpr`, and `optimal_tpr`.

Then, plot the ROC curve with a marker at the optimal point.

**Hint:** The Euclidean distance from a point (fpr, tpr) to (0, 1) is: $\sqrt{fpr^2 + (1 - tpr)^2}$

In [None]:
# BEGIN SOLUTION
# Convert lists to arrays for easier computation
fpr_array = np.array(my_fprs)
tpr_array = np.array(my_tprs)

# Compute distance from each point to (0, 1)
distances = np.sqrt(fpr_array**2 + (1 - tpr_array) ** 2)

# Find the index of the minimum distance
min_index = np.argmin(distances)

# Extract optimal values
optimal_threshold = my_thresholds[min_index]
optimal_fpr = my_fprs[min_index]
optimal_tpr = my_tprs[min_index]
# END SOLUTION

print(f"Optimal threshold: {optimal_threshold:.4f}")
print(f"Optimal FPR: {optimal_fpr:.4f}")
print(f"Optimal TPR: {optimal_tpr:.4f}")

# Plot the ROC curve with the optimal point
plt.figure()
plt.plot(my_fprs, my_tprs, color="darkorange", lw=2, label=f"ROC curve (area = {my_roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.plot(
    optimal_fpr,
    optimal_tpr,
    "ro",
    markersize=10,
    label=f"Optimal point (t={optimal_threshold:.3f})",
)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve with Optimal Threshold")
plt.legend(loc="lower right")
plt.show()

In [None]:
# Test assertions
assert "optimal_threshold" in dir(), "Variable 'optimal_threshold' not defined"
assert "optimal_fpr" in dir(), "Variable 'optimal_fpr' not defined"
assert "optimal_tpr" in dir(), "Variable 'optimal_tpr' not defined"
assert 0 <= optimal_threshold <= 1, "Optimal threshold should be between 0 and 1"
assert 0 <= optimal_fpr <= 1, "Optimal FPR should be between 0 and 1"
assert 0 <= optimal_tpr <= 1, "Optimal TPR should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check that the optimal point is actually the closest to (0, 1)
optimal_distance = np.sqrt(optimal_fpr**2 + (1 - optimal_tpr) ** 2)
for i in range(len(my_fprs)):
    dist = np.sqrt(my_fprs[i] ** 2 + (1 - my_tprs[i]) ** 2)
    assert (
        dist >= optimal_distance - 1e-10
    ), "Found a point closer to (0,1) than the 'optimal' point"
# END HIDDEN TESTS