# DATASCI 503, Homework 3: Logistic Regression

Logistic regression is one of the most widely used methods for binary classification. It models the probability that an observation belongs to a particular class. In this assignment, you'll explore odds, log-odds, and the softmax generalization to multiple classes, while training classifiers and evaluating performance using ROC curves and AUC. But first, a problem about the curse of dimensionality.

---

**Problem 1 (ISLP Ch 4, Exercise 4):** The Curse of Dimensionality

When the number of features $p$ is large, there tends to be a deterioration in the performance of KNN and other *local* approaches that perform prediction using only observations that are *near* the test observation for which a prediction must be made. This phenomenon is known as the *curse of dimensionality*, and it ties into the fact that non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.

**(a)** Suppose that we have a set of observations, each with measurements on $p = 1$ feature, $X$. We assume that $X$ is uniformly (evenly) distributed on $[0, 1]$. Associated with each observation is a response value. Suppose that we wish to predict a test observation's response using only observations that are within 10% of the range of $X$ closest to that test observation. For instance, in order to predict the response for a test observation with $X = 0.6$, we will use observations in the range $[0.55, 0.65]$. On average, what fraction of the available observations will we use to make the prediction?

> BEGIN SOLUTION

For $p = 1$, $X \sim \text{Uniform}[0,1]$. The fraction of the available observations used to make each prediction is always **10%**.

> END SOLUTION

**(b)** Now suppose that we have a set of observations, each with measurements on $p = 2$ features, $X_1$ and $X_2$. We assume that $(X_1, X_2)$ are uniformly distributed on $[0, 1] \times [0, 1]$. We wish to predict a test observation's response using only observations that are within 10% of the range of $X_1$ *and* within 10% of the range of $X_2$ closest to that test observation. For instance, in order to predict the response for a test observation with $X_1 = 0.6$ and $X_2 = 0.35$, we will use observations in the range $[0.55, 0.65]$ for $X_1$ and in the range $[0.3, 0.4]$ for $X_2$. On average, what fraction of the available observations will we use to make the prediction?


> BEGIN SOLUTION

For $p = 2$, the fraction is $0.1 \times 0.1 = 0.01$, or **1%**.

> END SOLUTION

**(c)** Now suppose that we have a set of observations on $p = 100$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation's response using observations within the 10% of each feature's range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

> BEGIN SOLUTION

For $p = 100$, the fraction is $0.1^{100}$, which is essentially **0%**.

> END SOLUTION

**(d)** Using your answers to parts (a)â€“(c), argue that a drawback of KNN when $p$ is large is that there are very few training observations "near" any given test observation.

> BEGIN SOLUTION

The drawback of KNN with large $p$ is that as $p$ increases, the fraction of available observations used for prediction approaches 0. KNN's simple idea that similar inputs lead to similar outputs is not powerful enough for high-dimensional problems.

> END SOLUTION

**(e)** Now suppose that we wish to make a prediction for a test observation by creating a $p$-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For $p = 1, 2,$ and $100$, what is the length of each side of the hypercube? Comment on your answer.

> *Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When $p = 1$, a hypercube is simply a line segment, when $p = 2$ it is a square, and when $p = 100$ it is a 100-dimensional cube.*

> BEGIN SOLUTION

Since Volume $= s^p = 0.1$, we have $s = 0.1^{1/p}$.
- For $p = 1$: $s = 0.1$
- For $p = 2$: $s \approx 0.316$
- For $p = 100$: $s \approx 0.977$ (nearly the full range)

> END SOLUTION

---

**Problem 2 (ISLP Ch 4, Exercise 1):** Deriving the Odds Formula

Using a little bit of algebra, prove that

$$p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}$$

is equivalent to

$$\frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X}.$$

In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.

> BEGIN SOLUTION

From the logistic regression formula:

$$
p(x) = \frac{e^{\beta_0 + \beta_1 X}}{1+ e^{\beta_0 + \beta_1 X}}
$$

So

$$
1-p(x) = 1-\frac{e^{\beta_0 + \beta_1 X}}{1+ e^{\beta_0 + \beta_1 X}} = \frac{1}{1+ e^{\beta_0 + \beta_1 X}}
$$

Therefore

$$
\frac{p(x)}{1-p(x)} = \frac{e^{\beta_0 + \beta_1 X} / (1+ e^{\beta_0 + \beta_1 X})}{1 / (1+ e^{\beta_0 + \beta_1 X})} = e^{\beta_0 + \beta_1 X}
$$

> END SOLUTION

---

**Problem 3 (ISLP Ch 4, Exercise 9):** Odds and Probability Conversion

This problem has to do with odds.

**(a)** On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

> BEGIN SOLUTION

Given: Odds $= \frac{p}{1-p} = 0.37$. Solving: $p = \frac{0.37}{1.37} \approx 0.27$

> END SOLUTION

**(b)** Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

> BEGIN SOLUTION

Given: $p = 0.16$. Odds $= \frac{0.16}{0.84} \approx 0.19$

> END SOLUTION

---

**Problem 4 (ISLP Ch 4, Exercise 12, parts a, b, and c):** Softmax vs Binary Logistic Regression

Suppose that you wish to classify an observation $X \in \mathbb{R}$ into apples and oranges. You fit a logistic regression model and find that

$$\widehat{\Pr}(Y = \text{orange} | X = x) = \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1 x)}{1 + \exp(\hat{\beta}_0 + \hat{\beta}_1 x)}.$$

Your friend fits a logistic regression model to the same data using the *softmax* formulation, and finds that

$$\widehat{\Pr}(Y = \text{orange} | X = x) = \frac{\exp(\hat{\alpha}_{\text{orange}0} + \hat{\alpha}_{\text{orange}1} x)}{\exp(\hat{\alpha}_{\text{orange}0} + \hat{\alpha}_{\text{orange}1} x) + \exp(\hat{\alpha}_{\text{apple}0} + \hat{\alpha}_{\text{apple}1} x)}.$$

**(a)** What is the log odds of orange versus apple in your model?

> BEGIN SOLUTION

From the log-odds formulation: $\frac{p(\text{orange}|x)}{p(\text{apple}|x)} = e^{\beta_0 + \beta_1 X}$

> END SOLUTION

**(b)** What is the log odds of orange versus apple in your friend's model?

> BEGIN SOLUTION

Taking the ratio of probabilities and simplifying:
$$\log \left( \frac{p(\text{orange} | x)}{p(\text{apple}|x)} \right) = (\hat{\alpha}_{\text{orange},0} - \hat{\alpha}_{\text{apple},0}) + (\hat{\alpha}_{\text{orange},1} - \hat{\alpha}_{\text{apple},1}) X$$

> END SOLUTION

**(c)** Suppose that in your model, $\hat{\beta}_0 = 2$ and $\hat{\beta}_1 = -1$. What are the coefficient estimates in your friend's model? Be as specific as possible.

> BEGIN SOLUTION

The relationship is: $\hat{\beta}_0 = \hat{\alpha}_{\text{orange},0} - \hat{\alpha}_{\text{apple},0}$ and $\hat{\beta}_1 = \hat{\alpha}_{\text{orange},1} - \hat{\alpha}_{\text{apple},1}$. With the given values, the friend's model gives different predictions since $1.2 - 3 = -1.8 \neq 2$.

> END SOLUTION

---

**Problem 5:** Predicting Academic Success

Suppose we fit logistic regression to predict the probability that a STATS 503 student gets an A in the class, from two variables. The variables are average hours of study per week ($X_1$) and GPA in other statistics courses taken ($X_2$). The fitted coefficients are $\beta_0 = -4$, $\beta_1 = 0.05$, and $\beta_2 = 1$.

**(a)** Predict the probability of getting an A for a student who studies 5 hours a week and has a GPA of 3.5 in other statistics courses.

> BEGIN SOLUTION

Log-odds $= -4 + 0.05(5) + 3.5 = -0.25$, so $p = \frac{e^{-0.25}}{1+e^{-0.25}} \approx 0.438$ or **43.8%**.

> END SOLUTION

(b) What are the odds this student gets an A?

> BEGIN SOLUTION

Odds $= e^{-0.25} \approx 0.779$

> END SOLUTION

(c) How many hours would this student need to study to have a 50% chance of getting an A?

> BEGIN SOLUTION

For $p = 0.5$, log-odds $= 0$. Solving $0 = -4 + 0.05 X_1 + 3.5$ gives $X_1 = 10$ hours.

> END SOLUTION

## Applied Problems: College Classification

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
import sklearn.metrics

%matplotlib inline

In [None]:
college_train = pd.read_csv("./data/college_train.csv")
college_test = pd.read_csv("./data/college_test.csv")

In [None]:
X_train = college_train.drop(["Private", "Name"], axis=1)
Y_train = np.where(college_train["Private"] == "Yes", 1, 0)
X_test = college_test.drop(["Private", "Name"], axis=1)
Y_test = np.where(college_test["Private"] == "Yes", 1, 0)

---

**Problem 6:** Training a Classifier

Train a logistic regression model to classify colleges as Private or Public using the provided training data. Store the trained model in a variable called `model_log`.

**Hint:** Use [`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). You may need to increase `max_iter` for convergence.

In [None]:
# BEGIN SOLUTION
model_log = sklearn.linear_model.LogisticRegression(max_iter=30000)
model_log.fit(X_train, Y_train)
# END SOLUTION

In [None]:
# Test assertions
assert model_log is not None, "model_log should be defined"
assert hasattr(model_log, "predict"), "model_log should have a predict method"
assert hasattr(model_log, "predict_proba"), "model_log should have a predict_proba method"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model_log, "coef_"), "Model should be fitted (have coef_ attribute)"
num_features = X_train.shape[1]
assert model_log.coef_.shape[1] == num_features, "Model should have correct coefficients"
# END HIDDEN TESTS

---

**Problem 7:** Negative Log-Likelihood

For each sample in the `college_test` dataset, use the fit model to predict the probability that `Private` is `Yes` and use the fit model to predict the probability that `Private` is `No`. Use these probabilities to calculate the negative log likelihood of the testing dataset for the model:

$$-\sum_{i=1}^{n} \log \hat{p}(y_i \mid x_i),$$

where $n$ is the number of samples in `college_test` and each $(x_i, y_i)$ pair is a sample from that dataset.
Store the predicted probabilities in a variable called `y_pred_prob` and the mean NLL in a variable called `mean_nll`.

In [None]:
# BEGIN SOLUTION
y_pred_prob = model_log.predict_proba(X_test)
mean_nll = sklearn.metrics.log_loss(Y_test, y_pred_prob)
print(f"Mean NLL: {mean_nll:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert y_pred_prob is not None, "y_pred_prob should be defined"
assert y_pred_prob.shape == (len(Y_test), 2), "y_pred_prob should have shape (n_samples, 2)"
assert mean_nll is not None, "mean_nll should be defined"
assert 0 < mean_nll < 1, "mean_nll should be between 0 and 1 for a reasonable model"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert np.allclose(y_pred_prob.sum(axis=1), 1.0), "Probabilities should sum to 1"
assert mean_nll < 0.5, "Model should achieve NLL less than 0.5"
# END HIDDEN TESTS

---

**Problem 8:** Classification Metrics

Now use $\hat{p}$ to create a hard classifier, also known as a decision rule. Specifically, consider the rule

$$\hat{y}(x) = \begin{cases} \text{Yes} & \text{if } \hat{p}(\text{Yes} \mid x) > 0.5 \\ \text{No} & \text{otherwise.} \end{cases}$$

Using the test data, compute the false positive rate (FPR), true positive rate (TPR), false negative rate (FNR), and true negative rate (TNR) made by this decision rule.
Store these in variables `fpr`, `tpr`,  `fnr`, and `tnr`, respectively.

In [None]:
# BEGIN SOLUTION
thresh = 0.5
y_pred = np.where(y_pred_prob[:, 1] > thresh, 1, 0)

# Calculate using confusion matrix
cnf_matrix = sklearn.metrics.confusion_matrix(Y_test, y_pred)
fp_count = cnf_matrix[0, 1].astype(float)
fn_count = cnf_matrix[1, 0].astype(float)
tp_count = cnf_matrix[1, 1].astype(float)
tn_count = cnf_matrix[0, 0].astype(float)

tpr = tp_count / (tp_count + fn_count)
fpr = fp_count / (fp_count + tn_count)
tnr = tn_count / (tn_count + fp_count)
fnr = fn_count / (tp_count + fn_count)

print(f"TPR: {tpr:.4f}")
print(f"FPR: {fpr:.4f}")
print(f"TNR: {tnr:.4f}")
print(f"FNR: {fnr:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert 0 <= tpr <= 1, "TPR should be between 0 and 1"
assert 0 <= fpr <= 1, "FPR should be between 0 and 1"
assert 0 <= tnr <= 1, "TNR should be between 0 and 1"
assert 0 <= fnr <= 1, "FNR should be between 0 and 1"
assert np.isclose(tpr + fnr, 1.0), "TPR + FNR should equal 1"
assert np.isclose(fpr + tnr, 1.0), "FPR + TNR should equal 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert tpr > 0.9, "TPR should be greater than 0.9 for this model"
assert fpr < 0.15, "FPR should be less than 0.15 for this model"
# END HIDDEN TESTS

---

**Problem 9:** Threshold Tradeoffs

We will now use $\hat{p}$ to create a different hard classifier.

$$\hat{y}(x) = \begin{cases} \text{Yes} & \text{if } \hat{p}(\text{Yes} \mid x) > 0.9 \\ \text{No} & \text{otherwise.} \end{cases}$$

Using the test data, compute the FPR, TPR, FNR, and TNR for this new decision rule.
Store the results in variables called `fpr_high`, `tpr_high`, `fnr_high`, and `tnr_high`.

In [None]:
# BEGIN SOLUTION
thresh_high = 0.9
y_pred_high = np.where(y_pred_prob[:, 1] > thresh_high, 1, 0)

tpr_high = sklearn.metrics.recall_score(Y_test, y_pred_high)
fpr_high = 1 - sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_high)
tnr_high = sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_high)
fnr_high = 1 - sklearn.metrics.recall_score(Y_test, y_pred_high)

print(f"TPR at 0.9: {tpr_high:.4f}")
print(f"FPR at 0.9: {fpr_high:.4f}")
print(f"TNR at 0.9: {tnr_high:.4f}")
print(f"FNR at 0.9: {fnr_high:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert 0 <= tpr_high <= 1, "TPR should be between 0 and 1"
assert 0 <= fpr_high <= 1, "FPR should be between 0 and 1"
assert np.isclose(tpr_high + fnr_high, 1.0), "TPR + FNR should equal 1"
assert np.isclose(fpr_high + tnr_high, 1.0), "FPR + TNR should equal 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert tpr_high <= tpr, "TPR should decrease when threshold increases"
assert fpr_high <= fpr, "FPR should decrease when threshold increases"
# END HIDDEN TESTS

---

**Problem 10:** ROC Curve

We will now use $\hat{p}$ to create a family of different hard classifiers.

$$\hat{y}_t(x) = \begin{cases} \text{Yes} & \text{if } \hat{p}(\text{Yes} \mid x) > t \\ \text{No} & \text{otherwise.} \end{cases}$$

Using the test data, for each value of $t \in \{0.0, 1/100, 2/100, \ldots, 99/100, 1.0\}$, compute the FPR and TPR of the decision rule $\hat{y}_t$, saving them in variables named `fpr_list` and `tpr_list`. Plot your results as an ROC curve.

*Note: It is possible to plot the ROC curve with sci-kit's `roc_curve` function, but for this problem, we want you to do it "by hand."

In [None]:
# BEGIN SOLUTION
thresholds = np.arange(0, 1.001, 0.01)
fpr_list = []
tpr_list = []

for thresh in thresholds:
    y_pred_thresh = np.where(y_pred_prob[:, 1] > thresh, 1, 0)
    current_tpr = sklearn.metrics.recall_score(Y_test, y_pred_thresh, zero_division=0)
    current_fpr = 1 - sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_thresh, zero_division=0)
    tpr_list.append(current_tpr)
    fpr_list.append(current_fpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr_list, tpr_list)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.plot([0, 1], [0, 1], "k--", label="Random classifier")
plt.legend()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert len(fpr_list) > 0, "fpr_list should not be empty"
assert len(tpr_list) > 0, "tpr_list should not be empty"
assert len(fpr_list) == len(tpr_list), "fpr_list and tpr_list should have same length"
assert all(0 <= x <= 1 for x in fpr_list), "All FPR values should be between 0 and 1"
assert all(0 <= x <= 1 for x in tpr_list), "All TPR values should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(fpr_list) == 101, "Should have 101 threshold values"
assert fpr_list[0] == 1.0 and tpr_list[0] == 1.0, "At threshold 0, should predict all positive"
# END HIDDEN TESTS

---

**Problem 11:** AUC Score

Use $\hat{p}$, the test data, and `sklearn.metrics.roc_auc_score` to compute the area under your ROC curve. The first argument should be the true labels from the testing dataset. The second argument should be the probability that `Private` is `Yes`, as predicted by your model. Store the result in a variable called `auc_score`.

In [None]:
# BEGIN SOLUTION
auc_score = sklearn.metrics.roc_auc_score(Y_test, y_pred_prob[:, 1])
print(f"AUC Score: {auc_score:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert auc_score is not None, "auc_score should be defined"
assert 0 <= auc_score <= 1, "AUC should be between 0 and 1"
assert auc_score > 0.5, "AUC should be greater than 0.5 (better than random)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert auc_score > 0.95, "AUC should be greater than 0.95 for this model"
# END HIDDEN TESTS