# DATASCI 503, Homework 3: Logistic Regression

Logistic regression is one of the most fundamental methods for binary classification. Unlike linear regression, which predicts continuous values, logistic regression models the probability that an observation belongs to a particular class using the logistic function. This assignment covers the mathematical foundations of logistic regression—including odds, log-odds, and the softmax generalization to multiple classes—along with practical skills in training classifiers and evaluating their performance using metrics like ROC curves and AUC.

In [None]:
import numpy as np

---

**Problem 1:** Deriving the Odds from Logistic Regression

Logistic regression models the probability as:

$$
p(x) = \frac{e^{\beta_0 + \beta_1 X}}{1+ e^{\beta_0 + \beta_1 X}}
$$

Show that the odds $\frac{p(x)}{1-p(x)} = e^{\beta_0 + \beta_1 X}$.

> BEGIN SOLUTION

$$\require{cancel}$$  

From the logistic regression formula:

$$
p(x)  = \frac{e^{\beta_0 + \beta_1 X}}{1+ e^{\beta_0 + \beta_1 X}}
$$

So

$$
1-p(x) = 1-\frac{e^{\beta_0 + \beta_1 X}}{1+ e^{\beta_0 + \beta_1 X}} = \frac{1}{1+ e^{\beta_0 + \beta_1 X}}
$$

Therefore
$$
\frac{p(x)}{1-p(x)} = \frac{e^{\beta_0 + \beta_1 X}/\cancel{(1+ e^{\beta_0 + \beta_1 X}})}{1/\cancel{(1+ e^{\beta_0 + \beta_1 X})}} = e^{\beta_0 + \beta_1 X}
$$
> END SOLUTION


---

**Problem 2:** The Curse of Dimensionality in KNN

Consider the K-nearest neighbors (KNN) algorithm where we use 10% of the available observations for each prediction. The features $X_1, X_2, \ldots, X_p$ are independently and uniformly distributed on $[0,1]$.

(a) For $p = 1$, what fraction of available observations will be used to make each prediction?

> BEGIN SOLUTION

For $p = 1$, $X \sim \text{Uniform}[0,1]$. The fraction of the available observations used to make each prediction is always **10%**.

> END SOLUTION

(b) For $p = 2$, what fraction of available observations will be used?

> BEGIN SOLUTION

For $p = 2$, the fraction is $0.1 \times 0.1 = 0.01$, or **1%**.

> END SOLUTION

(c) For $p = 100$, what fraction will be used?

> BEGIN SOLUTION

For $p = 100$, the fraction is $0.1^{100}$, which is essentially **0%**.

> END SOLUTION

(d) What is the drawback of KNN when $p$ is large?

> BEGIN SOLUTION

The drawback of KNN with large $p$ is that as $p$ increases, the fraction of available observations used for prediction approaches 0. KNN's simple idea that similar inputs lead to similar outputs is not powerful enough for high-dimensional problems.

> END SOLUTION

(e) If we want to use 10% of the observations (by volume), what side length is needed for a hypercube in $p$ dimensions? Calculate for $p = 1, 2, 100$.

> BEGIN SOLUTION

Since Volume $= s^p = 0.1$, we have $s = 0.1^{1/p}$.
- For $p = 1$: $s = 0.1$
- For $p = 2$: $s \approx 0.316$
- For $p = 100$: $s \approx 0.977$ (nearly the full range)

> END SOLUTION

---

**Problem 3:** Odds and Probability Conversion

(a) If the odds of an event are 0.37, what is the probability of the event?

> BEGIN SOLUTION

Given: Odds $= \frac{p}{1-p} = 0.37$. Solving: $p = \frac{0.37}{1.37} \approx 0.27$

> END SOLUTION

(b) If the probability of an event is 0.16, what are the odds?

> BEGIN SOLUTION

Given: $p = 0.16$. Odds $= \frac{0.16}{0.84} \approx 0.19$

> END SOLUTION

---

**Problem 4:** Softmax vs Binary Logistic Regression

Consider classifying fruit as either orange or apple based on a single predictor $X$.

I fit a binary logistic regression model:
$$\log\left(\frac{p(\text{orange}|x)}{p(\text{apple}|x)}\right) = \hat{\beta}_0 + \hat{\beta}_1 X$$

with $\hat{\beta}_0 = 2$ and $\hat{\beta}_1 = -1$.

My friend fits a softmax regression model:
$$p(\text{orange}|x) = \frac{\exp(\hat{\alpha}_{\text{orange},0} + \hat{\alpha}_{\text{orange},1} x)}{\exp(\hat{\alpha}_{\text{orange},0} + \hat{\alpha}_{\text{orange},1} x) + \exp(\hat{\alpha}_{\text{apple},0} + \hat{\alpha}_{\text{apple},1} x)}$$

with $\hat{\alpha}_{\text{orange},0} = 1.2$, $\hat{\alpha}_{\text{orange},1} = -2$, $\hat{\alpha}_{\text{apple},0} = 3$, $\hat{\alpha}_{\text{apple},1} = 0.6$.

(a) In my model, what is $\frac{p(\text{orange}|x)}{p(\text{apple}|x)}$?

> BEGIN SOLUTION

From the log-odds formulation: $\frac{p(\text{orange}|x)}{p(\text{apple}|x)} = e^{\beta_0 + \beta_1 X}$

> END SOLUTION

(b) In my friend's model, derive an expression for $\log\left(\frac{p(\text{orange}|x)}{p(\text{apple}|x)}\right)$.

> BEGIN SOLUTION

Taking the ratio of probabilities and simplifying:
$$\log \left( \frac{p(\text{orange} | x)}{p(\text{apple}|x)} \right) = (\hat{\alpha}_{\text{orange},0} - \hat{\alpha}_{\text{apple},0}) + (\hat{\alpha}_{\text{orange},1} - \hat{\alpha}_{\text{apple},1}) X$$

> END SOLUTION

(c) Show the relationship between $\hat{\beta}_0, \hat{\beta}_1$ and my friend's $\hat{\alpha}$ coefficients.

> BEGIN SOLUTION

The relationship is: $\hat{\beta}_0 = \hat{\alpha}_{\text{orange},0} - \hat{\alpha}_{\text{apple},0}$ and $\hat{\beta}_1 = \hat{\alpha}_{\text{orange},1} - \hat{\alpha}_{\text{apple},1}$. With the given values, the friend's model gives different predictions since $1.2 - 3 = -1.8 \neq 2$.

> END SOLUTION

---

**Problem 5:** Predicting Academic Success

Suppose we are predicting whether a student gets an A in a class based on hours studied ($X_1$) and GPA ($X_2$). The logistic regression model is:

$$\log\left(\frac{p}{1-p}\right) = -4 + 0.05 X_1 + X_2$$

(a) A student with a GPA of 3.5 studies 5 hours per week. What is the probability they get an A?

> BEGIN SOLUTION

Log-odds $= -4 + 0.05(5) + 3.5 = -0.25$, so $p = \frac{e^{-0.25}}{1+e^{-0.25}} \approx 0.438$ or **43.8%**.

> END SOLUTION

(b) What are the odds this student gets an A?

> BEGIN SOLUTION

Odds $= e^{-0.25} \approx 0.779$

> END SOLUTION

(c) How many hours would this student need to study to have a 50% chance of getting an A?

> BEGIN SOLUTION

For $p = 0.5$, log-odds $= 0$. Solving $0 = -4 + 0.05 X_1 + 3.5$ gives $X_1 = 10$ hours.

> END SOLUTION

## Applied Problems: College Classification

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.linear_model
import sklearn.metrics

%matplotlib inline

In [None]:
college_train = pd.read_csv("./data/college_train.csv")
college_test = pd.read_csv("./data/college_test.csv")

In [None]:
X_train = college_train.drop(["Private", "Name"], axis=1)
Y_train = np.where(college_train["Private"] == "Yes", 1, 0)
X_test = college_test.drop(["Private", "Name"], axis=1)
Y_test = np.where(college_test["Private"] == "Yes", 1, 0)

---

**Problem 6:** Training a Logistic Regression Classifier

Train a logistic regression model to classify colleges as Private or Public using the provided training data.

Store the trained model in a variable called `model_log`.

In [None]:
# BEGIN SOLUTION
model_log = sklearn.linear_model.LogisticRegression(max_iter=1000)
model_log.fit(X_train, Y_train)
# END SOLUTION

In [None]:
# Test assertions
assert model_log is not None, "model_log should be defined"
assert hasattr(model_log, "predict"), "model_log should have a predict method"
assert hasattr(model_log, "predict_proba"), "model_log should have a predict_proba method"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(model_log, "coef_"), "Model should be fitted (have coef_ attribute)"
num_features = X_train.shape[1]
assert model_log.coef_.shape[1] == num_features, "Model should have correct coefficients"
# END HIDDEN TESTS

---

**Problem 7:** Computing Negative Log-Likelihood

Compute the mean negative log-likelihood (NLL) on the test set. Store the predicted probabilities in a variable called `y_pred_prob` and the mean NLL in a variable called `mean_nll`.

**Hint:** You can use `model.predict_proba()` to get probabilities and `sklearn.metrics.log_loss()` to compute NLL.

In [None]:
# BEGIN SOLUTION
y_pred_prob = model_log.predict_proba(X_test)
mean_nll = sklearn.metrics.log_loss(Y_test, y_pred_prob)
print(f"Mean NLL: {mean_nll:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert y_pred_prob is not None, "y_pred_prob should be defined"
assert y_pred_prob.shape == (len(Y_test), 2), "y_pred_prob should have shape (n_samples, 2)"
assert mean_nll is not None, "mean_nll should be defined"
assert 0 < mean_nll < 1, "mean_nll should be between 0 and 1 for a reasonable model"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert np.allclose(y_pred_prob.sum(axis=1), 1.0), "Probabilities should sum to 1"
assert mean_nll < 0.5, "Model should achieve NLL less than 0.5"
# END HIDDEN TESTS

---

**Problem 8:** Classification Metrics at Threshold 0.5

Using a threshold of 0.5, compute the following classification metrics:
- True Positive Rate (TPR, also called Recall or Sensitivity)
- False Positive Rate (FPR)
- True Negative Rate (TNR, also called Specificity)
- False Negative Rate (FNR)

Store these in variables `tpr`, `fpr`, `tnr`, and `fnr` respectively.

In [None]:
# BEGIN SOLUTION
thresh = 0.5
y_pred = np.where(y_pred_prob[:, 1] > thresh, 1, 0)

# Calculate using confusion matrix
cnf_matrix = sklearn.metrics.confusion_matrix(Y_test, y_pred)
fp_count = cnf_matrix[0, 1].astype(float)
fn_count = cnf_matrix[1, 0].astype(float)
tp_count = cnf_matrix[1, 1].astype(float)
tn_count = cnf_matrix[0, 0].astype(float)

tpr = tp_count / (tp_count + fn_count)
fpr = fp_count / (fp_count + tn_count)
tnr = tn_count / (tn_count + fp_count)
fnr = fn_count / (tp_count + fn_count)

print(f"TPR: {tpr:.4f}")
print(f"FPR: {fpr:.4f}")
print(f"TNR: {tnr:.4f}")
print(f"FNR: {fnr:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert 0 <= tpr <= 1, "TPR should be between 0 and 1"
assert 0 <= fpr <= 1, "FPR should be between 0 and 1"
assert 0 <= tnr <= 1, "TNR should be between 0 and 1"
assert 0 <= fnr <= 1, "FNR should be between 0 and 1"
assert np.isclose(tpr + fnr, 1.0), "TPR + FNR should equal 1"
assert np.isclose(fpr + tnr, 1.0), "FPR + TNR should equal 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert tpr > 0.9, "TPR should be greater than 0.9 for this model"
assert fpr < 0.15, "FPR should be less than 0.15 for this model"
# END HIDDEN TESTS

---

**Problem 9:** Classification Metrics at Threshold 0.9

Repeat the classification metrics computation using a threshold of 0.9. Store the results in variables called `tpr_high`, `fpr_high`, `tnr_high`, and `fnr_high`.

Compare these to the metrics at threshold 0.5. What happens to TPR and FPR when we increase the threshold?

In [None]:
# BEGIN SOLUTION
thresh_high = 0.9
y_pred_high = np.where(y_pred_prob[:, 1] > thresh_high, 1, 0)

tpr_high = sklearn.metrics.recall_score(Y_test, y_pred_high)
fpr_high = 1 - sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_high)
tnr_high = sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_high)
fnr_high = 1 - sklearn.metrics.recall_score(Y_test, y_pred_high)

print(f"TPR at 0.9: {tpr_high:.4f}")
print(f"FPR at 0.9: {fpr_high:.4f}")
print(f"TNR at 0.9: {tnr_high:.4f}")
print(f"FNR at 0.9: {fnr_high:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert 0 <= tpr_high <= 1, "TPR should be between 0 and 1"
assert 0 <= fpr_high <= 1, "FPR should be between 0 and 1"
assert np.isclose(tpr_high + fnr_high, 1.0), "TPR + FNR should equal 1"
assert np.isclose(fpr_high + tnr_high, 1.0), "FPR + TNR should equal 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert tpr_high < tpr, "TPR should decrease when threshold increases"
assert fpr_high < fpr, "FPR should decrease when threshold increases"
# END HIDDEN TESTS

Based on your results above, explain what happens to TPR and FPR when we increase the threshold from 0.5 to 0.9.

> BEGIN SOLUTION

When we increase the threshold from 0.5 to 0.9, **both TPR and FPR decrease**. This is because the model becomes more conservative about predicting the positive class:

- **TPR decreases**: With a higher threshold, fewer instances are predicted as positive, so we miss more actual positives (more false negatives), reducing the true positive rate.
- **FPR decreases**: Similarly, we also make fewer false positive predictions since the model requires stronger evidence to predict positive.

This illustrates the fundamental tradeoff in classification: a higher threshold reduces false positives but at the cost of missing true positives. The choice of threshold depends on the relative costs of false positives vs. false negatives in the application.

> END SOLUTION

---

**Problem 10:** ROC Curve

Plot the ROC (Receiver Operating Characteristic) curve by varying the threshold from 0 to 1. Store the lists of FPR and TPR values across thresholds in variables called `fpr_list` and `tpr_list`.

**Hint:** Create thresholds using `np.arange(0, 1.001, 0.01)` and compute TPR/FPR at each threshold.

In [None]:
# BEGIN SOLUTION
thresholds = np.arange(0, 1.001, 0.01)
fpr_list = []
tpr_list = []

for thresh in thresholds:
    y_pred_thresh = np.where(y_pred_prob[:, 1] > thresh, 1, 0)
    current_tpr = sklearn.metrics.recall_score(Y_test, y_pred_thresh, zero_division=0)
    current_fpr = 1 - sklearn.metrics.recall_score(1 - Y_test, 1 - y_pred_thresh, zero_division=0)
    tpr_list.append(current_tpr)
    fpr_list.append(current_fpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr_list, tpr_list)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.plot([0, 1], [0, 1], "k--", label="Random classifier")
plt.legend()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert len(fpr_list) > 0, "fpr_list should not be empty"
assert len(tpr_list) > 0, "tpr_list should not be empty"
assert len(fpr_list) == len(tpr_list), "fpr_list and tpr_list should have same length"
assert all(0 <= x <= 1 for x in fpr_list), "All FPR values should be between 0 and 1"
assert all(0 <= x <= 1 for x in tpr_list), "All TPR values should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(fpr_list) == 101, "Should have 101 threshold values"
assert fpr_list[0] == 1.0 and tpr_list[0] == 1.0, "At threshold 0, should predict all positive"
# END HIDDEN TESTS

---

**Problem 11:** AUC Score

Compute the Area Under the ROC Curve (AUC) score. Store the result in a variable called `auc_score`.

**Hint:** Use `sklearn.metrics.roc_auc_score()`.

In [None]:
# BEGIN SOLUTION
auc_score = sklearn.metrics.roc_auc_score(Y_test, y_pred_prob[:, 1])
print(f"AUC Score: {auc_score:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert auc_score is not None, "auc_score should be defined"
assert 0 <= auc_score <= 1, "AUC should be between 0 and 1"
assert auc_score > 0.5, "AUC should be greater than 0.5 (better than random)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert auc_score > 0.95, "AUC should be greater than 0.95 for this model"
# END HIDDEN TESTS