# DATASCI 503, Homework 5: Resampling Methods

Resampling methods are techniques that repeatedly draw samples from a training set and refit a model to each sample to obtain additional information about the fitted model. In this assignment, we explore two key resampling methods: **bootstrap** (for estimating the variability of an estimator) and **cross-validation** (for estimating test error and selecting models). These methods connect directly to the bias-variance tradeoff—cross-validation helps us choose models that balance underfitting and overfitting, while bootstrap helps us quantify uncertainty in our estimates.

In [None]:
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import KFold

warnings.filterwarnings("ignore")

---

**Problem 1 (ISLP Ch 5, Exercise 2):** Bootstrap Sampling Probability

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of $n$ observations.

**(a)** What is the probability that the first bootstrap observation is not the $j$-th observation from the original sample? Justify your answer.

> BEGIN SOLUTION

*Solutions in this assignment are adapted from work by John Elmer Loretizo.*

We note that observing the $j$-th observation from $n$ samples is given as $P(j\text{th observation})=1/n$. Therefore, we take the complement and $P(\text{not observing }j\text{th observation})=1-1/n$.

> END SOLUTION

**(b)** What is the probability that the second bootstrap observation is not the $j$-th observation from the original sample?

> BEGIN SOLUTION

Since bootstrap sampling is performed with replacement, we still note the same probability for every $j$-th observation as $P(j\text{th observation})=1/n$ and the $P(\text{not observing the second }j\text{th observation})=1-1/n$.

> END SOLUTION

**(c)** Argue that the probability that the $j$-th observation is not in the bootstrap sample is $(1 - \frac{1}{n})^n$.

> BEGIN SOLUTION

Both (a) and (b) suggest the property of independence that is present in a bootstrap method. This stems from the fact that we draw with replacement and therefore every observation will have a $1/n$ chance of getting picked at every draw. Due to independence, we can simply see that for every draw until the $n$-th draw, the probability of not having the $j$-th observation is simply $(1 - \frac{1}{n})$ and we multiply each of these probabilities giving us $(1 - \frac{1}{n})^n$.

> END SOLUTION

**(d)** When $n = 5$, what is the probability that the $j$-th observation is in the bootstrap sample? Store your answer in a variable called `prob_in_sample_n5`. Then explain your calculation.

In [None]:
# BEGIN SOLUTION
# Using the probability P(observing the jth observation) = 1 - ((1 - 1/n)^n)
n = 5
prob_in_sample_n5 = 1 - (1 - (1 / n)) ** n
print(f"Probability: {prob_in_sample_n5:.4f} (about 67.23%)")
# END SOLUTION

> BEGIN SOLUTION

Using the result from (c), the probability that the $j$-th observation is in the bootstrap sample is $1 - (1 - 1/n)^n = 1 - (4/5)^5 \approx 0.6723$.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(prob_in_sample_n5, float), "Result should be a float"
assert 0 < prob_in_sample_n5 < 1, "Probability must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(prob_in_sample_n5 - 0.6723) < 0.001, f"Expected ~0.6723, got {prob_in_sample_n5}"
assert abs(prob_in_sample_n5 - (1 - (1 - 1 / 5) ** 5)) < 1e-10, "Formula not applied correctly"
# END HIDDEN TESTS

**(e)** When $n = 100$, what is the probability that the $j$-th observation is in the bootstrap sample? Store your answer in a variable called `prob_in_sample_n100`.

In [None]:
# BEGIN SOLUTION
# Following the same formula, we get about 63.4%
n = 100
prob_in_sample_n100 = 1 - (1 - (1 / n)) ** n
print(f"Probability: {prob_in_sample_n100:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(prob_in_sample_n100, float), "Result should be a float"
assert 0 < prob_in_sample_n100 < 1, "Probability must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(prob_in_sample_n100 - 0.634) < 0.01, f"Expected ~0.634, got {prob_in_sample_n100}"
# END HIDDEN TESTS

**(f)** When $n = 10{,}000$, what is the probability that the $j$-th observation is in the bootstrap sample? Store your answer in a variable called `prob_in_sample_n10000`.

In [None]:
# BEGIN SOLUTION
# Following the same formula, we get about 63.2%
# Note: as n -> infinity, this probability converges to 1 - 1/e ≈ 0.632
n = 10000
prob_in_sample_n10000 = 1 - (1 - (1 / n)) ** n
print(f"Probability: {prob_in_sample_n10000:.4f}")
print(f"Limit as n->inf: {1 - 1 / np.e:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(prob_in_sample_n10000, float), "Result should be a float"
assert 0 < prob_in_sample_n10000 < 1, "Probability must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(prob_in_sample_n10000 - 0.632) < 0.01, f"Expected ~0.632, got {prob_in_sample_n10000}"
# END HIDDEN TESTS

---

**Problem 2 (ISLP Ch 5, Exercise 4):** Estimating Standard Deviation of Predictions

Suppose that we use some statistical learning method to make a prediction for the response $Y$ for a particular value of the predictor $X$. Carefully describe how we might estimate the standard deviation of our prediction.

> BEGIN SOLUTION

We can perform bootstrap by repeatedly sampling the dataset and fitting the model from these new datasets and then predicting the response $Y$ at the particular value of predictor $X$. After repeating this over the number of bootstrap samples, we will have a distribution of predicted values of $Y$ given a particular value of $X$. We can then use this distribution to compute the standard deviation of our prediction.

> END SOLUTION

---

**Problem 3:** Cross-Validation on Simulated Data

We will now perform cross-validation on a simulated data set.

**(a)** Generate a simulated data set as follows:

```python
rng = np.random.default_rng(2024)
x = rng.normal(size=100)
y = x - 2 * x**2 + rng.normal(size=100)
```
    
In this data set, what is $n$ and what is $p$? Write out the model used to generate the data in equation form.

In [None]:
# BEGIN SOLUTION
# Generate the data
# n = 100, p = 1, equation: y = x - 2x^2 + epsilon
rng = np.random.default_rng(2024)
x = rng.normal(size=100)
y = x - 2 * x**2 + rng.normal(size=100)
print(f"n = {len(x)}, p = 1")
print("Model: y = x - 2x² + ε")
# END SOLUTION

> BEGIN SOLUTION

In this data set, $n = 100$ and $p = 1$ (the single predictor $X$). The model is $Y = X - 2X^2 + \epsilon$, where $\epsilon \sim N(0, 1)$.

> END SOLUTION

**(b)** Create a scatterplot of $X$ against $Y$. Comment on what you find.

In [None]:
# BEGIN SOLUTION
# The scatterplot shows a nonlinear (quadratic) relationship between X and Y,
# which is consistent with the data generating process that includes a quadratic term.
sns.scatterplot(x=x, y=y)
# END SOLUTION

> BEGIN SOLUTION

The scatterplot shows a clear nonlinear (quadratic) relationship between $X$ and $Y$, consistent with the data generating process that includes a quadratic term. The curve opens downward due to the negative coefficient on $X^2$.

> END SOLUTION

**(c)** Construct 5 folds using `sklearn.model_selection.KFold`, and specify `random_state=3`. Using these folds, compute the cross-validation errors that result from fitting the following four models using least squares. Use the same five folds for all four models.

i. $Y = \beta_0 + \beta_1 X + \epsilon$

ii. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon$

iii. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$

iv. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon$

For each fold and each model, report the mean-squared error. Which of the models had the smallest average error? Is this what you expected? Explain your answer.

In [None]:
# BEGIN SOLUTION
# Prepare polynomial features
X = x.reshape(-1, 1)
X_poly = np.hstack([X**i for i in range(1, 5)])

# Perform 5-fold cross-validation for each model
kf = KFold(n_splits=5, shuffle=True, random_state=3)
errors = np.zeros((4, 5))

for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    for i in range(4):
        X_train_poly = X_poly[train_index, : i + 1]
        X_test_poly = X_poly[test_index, : i + 1]

        model = LinearRegression().fit(X_train_poly, y_train)
        y_pred = model.predict(X_test_poly)
        errors[i, fold] = mean_squared_error(y_test, y_pred)

# Create a DataFrame to display the results
errors_df = pd.DataFrame(errors)
errors_df.columns = ["Fold " + str(col + 1) for col in errors_df.columns]
errors_df["Mean"] = errors_df.mean(axis=1)
errors_df.index = errors_df.index + 1
print(errors_df)
print(f"\nModel {errors_df['Mean'].idxmin()} has the smallest average error.")
print("This is expected since the data was generated with a quadratic relationship.")
# END SOLUTION

> BEGIN SOLUTION

Model 2 (quadratic) has the smallest average cross-validation error. This is expected since the true data generating process is quadratic ($Y = X - 2X^2 + \epsilon$). Adding higher-order polynomial terms (cubic, quartic) does not meaningfully improve the fit.

> END SOLUTION

In [None]:
# Test assertions
assert errors_df.shape == (4, 6), f"Expected shape (4, 6), got {errors_df.shape}"
assert errors_df["Mean"].idxmin() == 2, "Model 2 should have the lowest mean error"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert errors_df.loc[1, "Mean"] > 4, "Linear model should have high error for quadratic data"
assert errors_df.loc[2, "Mean"] < 2, "Quadratic model should have low error"
# END HIDDEN TESTS

---

**Problem 4:** LOOCV and Random Seeds

Consider estimating log odds using logistic regression with ridge penalties, setting the regularization strength via LOOCV. True or false: the result will depend on a random seed that we use as part of the LOOCV process. Explain your answer.

> BEGIN SOLUTION

False. One can easily see this by the fact that the LOOCV uses $n-1$ observations as part of its training set and the $n$-th observation as its validation set, repeating this process over all the observations. Such a process is deterministic and there is no randomness in the splitting due to the fact that every iteration is just removing the $i$-th observation (a single observation) and setting it as the validation set. Therefore, there is no dependence on the random seed.

> END SOLUTION

---

**Problem 5 (ISLP Ch 5, Exercise 9):** Boston Housing Data Analysis

We will now consider the Boston housing data set, which contains information about housing in the Boston area collected in the 1970s. The target variable `medv` represents the median value of owner-occupied homes in $1000s for each census tract.

**(a)** Based on this data set, provide an estimate for the population mean of `medv`. Call this estimate $\hat{\mu}$ and store it in a variable called `mu_hat`.

In [None]:
boston_df = pd.read_csv("./data/boston.csv")
# BEGIN SOLUTION
mu_hat = boston_df["medv"].mean()
print(f"Estimated mean (mu_hat): {mu_hat:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(mu_hat, float), "mu_hat should be a float"
assert 0 < mu_hat < 100, "Mean should be a reasonable housing value"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(mu_hat - 22.533) < 0.01, f"Expected ~22.533, got {mu_hat}"
# END HIDDEN TESTS

**(b)** Provide an estimate of the standard error of $\hat{\mu}$ and store it in a variable called `se_mu_hat`. Interpret this result.

**Hint:** We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

In [None]:
# BEGIN SOLUTION
se_mu_hat = boston_df["medv"].std() / np.sqrt(boston_df.shape[0])
print(f"Standard error of mean: {se_mu_hat:.4f}")
# END SOLUTION

> BEGIN SOLUTION

The standard error of approximately 0.41 indicates that the sample mean estimate of median home value has relatively low variability, suggesting our estimate is reasonably precise.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(se_mu_hat, float), "se_mu_hat should be a float"
assert se_mu_hat > 0, "Standard error must be positive"
assert se_mu_hat < mu_hat, "Standard error should be smaller than the mean"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(se_mu_hat - 0.409) < 0.01, f"Expected ~0.409, got {se_mu_hat}"
# END HIDDEN TESTS

**(c)** Now estimate the standard error of $\hat{\mu}$ using the bootstrap. Store the result in a variable called `bootstrap_sem`. How does this compare to your answer from (b)?

In [None]:
# BEGIN SOLUTION
def get_mean(data):
    return data.mean()


def boot_se(func, df, colname, n=None, num_bootstrap=1000, seed=0):
    rng = np.random.default_rng(seed)
    first_moment, second_moment = 0, 0
    n = n or df.shape[0]
    for _ in range(num_bootstrap):
        bootstrap_sample = rng.choice(df[colname], size=n, replace=True)
        value = func(bootstrap_sample)
        first_moment += value
        second_moment += value**2
    return np.sqrt(second_moment / num_bootstrap - (first_moment / num_bootstrap) ** 2)


bootstrap_sem = boot_se(get_mean, boston_df, "medv", num_bootstrap=10000, seed=2024)
print(f"Bootstrap SE: {bootstrap_sem:.4f}")
print(f"Formula SE:   {se_mu_hat:.4f}")
print("The bootstrap estimate is very close to the formula-based estimate.")
# END SOLUTION

> BEGIN SOLUTION

The bootstrap estimate of the standard error is very close to the formula-based estimate from (b). This is expected since the Central Limit Theorem guarantees that the sample mean is approximately normally distributed for large samples, making the analytical formula accurate.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(bootstrap_sem, float), "bootstrap_sem should be a float"
assert bootstrap_sem > 0, "Standard error must be positive"
assert bootstrap_sem < 1, "Bootstrap SE should be reasonable for this dataset"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(bootstrap_sem - 0.408) < 0.02, f"Expected ~0.408, got {bootstrap_sem}"
# END HIDDEN TESTS

**(d)** Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of `medv`. Compare it to the results obtained by using `boston_df['medv'].std()` and the two standard error rule.

**Hint:** You can approximate a 95% confidence interval using the formula $[\hat{\mu} - 2\text{SE}(\hat{\mu}), \hat{\mu} + 2\text{SE}(\hat{\mu})]$.

In [None]:
# BEGIN SOLUTION
confidence_interval_bootstrap = (mu_hat - 2 * bootstrap_sem, mu_hat + 2 * bootstrap_sem)
confidence_interval_standard_error = (mu_hat - 2 * se_mu_hat, mu_hat + 2 * se_mu_hat)
print(f"Bootstrap CI: {confidence_interval_bootstrap}")
print(f"Formula CI:   {confidence_interval_standard_error}")
print("The confidence intervals are very similar, suggesting robustness of the estimator.")
# END SOLUTION

> BEGIN SOLUTION

The bootstrap-based and formula-based 95% confidence intervals are very similar, both suggesting the population mean of `medv` lies approximately between 21.7 and 23.4 thousand dollars.

> END SOLUTION

In [None]:
# Test assertions
ci_low, ci_high = confidence_interval_bootstrap
assert ci_low < mu_hat < ci_high, "Mean should be within CI"
ci_se_low, ci_se_high = confidence_interval_standard_error
assert abs(ci_low - ci_se_low) < 0.1, "Bootstrap and standard CI should be similar"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert ci_high - ci_low < 2, "CI width should be less than 2"
# END HIDDEN TESTS

**(e)** Based on this data set, provide an estimate, $\hat{m}$, for the median value of `medv` in the population. Store it in a variable called `median_hat`.

In [None]:
# BEGIN SOLUTION
median_hat = boston_df["medv"].median()
print(f"Median: {median_hat}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(median_hat, float), "median_hat should be a float"
assert 0 < median_hat < 100, "Median should be a reasonable housing value"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(median_hat - 21.2) < 0.01, f"Expected 21.2, got {median_hat}"
# END HIDDEN TESTS

**(f)** We now would like to estimate the standard error of $\hat{m}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap and store it in a variable called `bootstrap_se_median`. Comment on your findings.

In [None]:
# BEGIN SOLUTION
def get_median(data):
    return np.median(data)


bootstrap_se_median = boot_se(get_median, boston_df, "medv", num_bootstrap=10000, seed=2024)
print(f"Bootstrap SE of median: {bootstrap_se_median:.4f}")
print(f"Relative to median ({median_hat}), this SE is small, suggesting high precision.")
# END SOLUTION

> BEGIN SOLUTION

The bootstrap standard error of the median (~0.38) is small relative to the median itself (21.2), suggesting the median is estimated with good precision. The SE of the median is slightly smaller than the SE of the mean, which is consistent with the median being more robust to the right-skewed distribution of housing values.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(bootstrap_se_median, float), "bootstrap_se_median should be a float"
assert bootstrap_se_median > 0, "Standard error must be positive"
assert bootstrap_se_median < 1, "Bootstrap SE should be reasonable for this dataset"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(bootstrap_se_median - 0.379) < 0.05, f"Expected ~0.379, got {bootstrap_se_median}"
# END HIDDEN TESTS

**(g)** Based on this data set, provide an estimate for the tenth percentile of `medv` in Boston census tracts. Call this quantity $\hat{p}_{0.1}$ and store it in a variable called `percentile_10_hat`.

**Hint:** You can use `np.percentile()`.

In [None]:
# BEGIN SOLUTION
percentile_10_hat = np.percentile(boston_df["medv"], q=10)
print(f"10th percentile: {percentile_10_hat}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(percentile_10_hat, int | float), "percentile_10_hat should be numeric"
assert 0 < percentile_10_hat < median_hat, "10th percentile should be less than the median"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(percentile_10_hat - 12.75) < 0.1, f"Expected 12.75, got {percentile_10_hat}"
# END HIDDEN TESTS

**(h)** Use the bootstrap to estimate the standard error of $\hat{p}_{0.1}$ and store it in a variable called `bootstrap_se_percentile`. Comment on your findings.

In [None]:
# BEGIN SOLUTION
def get_tenth_percentile(data):
    return np.percentile(data, q=10)


bootstrap_se_percentile = boot_se(
    get_tenth_percentile, boston_df, "medv", num_bootstrap=10000, seed=2024
)
print(f"Bootstrap SE of 10th percentile: {bootstrap_se_percentile:.4f}")
print("This suggests strong precision in estimating the tenth percentile.")
# END SOLUTION

> BEGIN SOLUTION

The bootstrap standard error of the 10th percentile (~0.50) is somewhat larger than the SE of the median (~0.38). This makes sense since extreme quantiles are estimated from fewer effective observations than central quantiles like the median.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(bootstrap_se_percentile, float), "bootstrap_se_percentile should be a float"
assert bootstrap_se_percentile > 0, "Standard error must be positive"
assert bootstrap_se_percentile < 2, "Bootstrap SE should be reasonable for this dataset"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(bootstrap_se_percentile - 0.50) < 0.1, f"Expected ~0.50, got {bootstrap_se_percentile}"
# END HIDDEN TESTS

---

**Problem 6:** LOOCV for Logistic Regression

In this problem, you will practice computing the LOOCV error for a logistic regression model on the Weekly data set.

**(a)** Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2`. Report the misclassification rate on the training data and comment on the result.

In [None]:
weekly_df = pd.read_csv("./data/weekly.csv")
# BEGIN SOLUTION
weekly_df["Coded_Direction"] = pd.get_dummies(weekly_df["Direction"], dtype=int, drop_first=True)

X = weekly_df[["Lag1", "Lag2"]]
y = weekly_df["Coded_Direction"]
model = LogisticRegression().fit(X, y)

misclassification_rate_full = 1 - accuracy_score(y, model.predict(X))
print(f"Misclassification rate: {misclassification_rate_full:.4f}")
print("This is relatively poor performance - almost half are incorrectly classified.")
# END SOLUTION

> BEGIN SOLUTION

The misclassification rate on the training data is approximately 44.5%, which is quite poor—barely better than random guessing. This suggests that `Lag1` and `Lag2` alone are not strong predictors of market direction.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(
    misclassification_rate_full, float
), "misclassification_rate_full should be a float"
assert 0 < misclassification_rate_full < 1, "Misclassification rate must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(misclassification_rate_full - 0.445) < 0.01
), f"Expected ~0.445, got {misclassification_rate_full}"
# END HIDDEN TESTS

**(b)** Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2` using all but the first observation. Report the misclassification rate on the full data and comment on the result.

In [None]:
# BEGIN SOLUTION
X_without_first = X[1:]
y_without_first = y[1:]

model_without_first = LogisticRegression().fit(X_without_first, y_without_first)
misclassification_rate_without_first = 1 - accuracy_score(y, model_without_first.predict(X))
print(f"Misclassification rate: {misclassification_rate_without_first:.4f}")
print("Very similar to (a) - removing one observation has minimal effect.")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(
    misclassification_rate_without_first, float
), "misclassification_rate_without_first should be a float"
assert (
    0 < misclassification_rate_without_first < 1
), "Misclassification rate must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(misclassification_rate_without_first - 0.444) < 0.01
), f"Expected ~0.444, got {misclassification_rate_without_first}"
diff = abs(misclassification_rate_without_first - misclassification_rate_full)
assert diff < 0.01, "Rates should be similar"
# END HIDDEN TESTS

**(c)** Use the model from (b) to predict the direction of the first observation. Was this observation correctly classified?

In [None]:
# BEGIN SOLUTION
prediction = model_without_first.predict(X.iloc[0, :].values.reshape(1, -1))[-1]
first_observation_correct = prediction == y.iloc[0]
print(f"Prediction: {prediction}, Actual: {y.iloc[0]}")
print(f"Correctly classified: {first_observation_correct}")
# END SOLUTION

In [None]:
# Test assertions
assert not first_observation_correct, "First observation should be incorrectly classified"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert isinstance(first_observation_correct, bool | np.bool_), "Result should be a boolean"
assert prediction == 1, "Model should predict 'Up' (1) for the first observation"
# END HIDDEN TESTS

**(d)** Write a for loop from `i = 0` to `i = n-1`, where `n` is the number of observations in the data set, that performs each of the following steps:

i. Fit a logistic regression model using all but the `i`th observation to predict `Direction` using `Lag1` and `Lag2`.

ii. Use this model to predict the direction for the `i`th observation.

iii. Determine whether or not an error was made in predicting the direction for the `i`th observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

In [None]:
# BEGIN SOLUTION
loocv_errors = []
for i in range(X.shape[0]):
    X_train = X.drop(i)
    y_train = y.drop(i)

    model_loocv = LogisticRegression().fit(X_train, y_train)

    pred = model_loocv.predict(X.iloc[i, :].values.reshape(1, -1))[-1]
    prediction_error = 1 if pred != y.iloc[i] else 0
    loocv_errors.append(prediction_error)

print(f"First 5 errors: {loocv_errors[:5]}")
# END SOLUTION

In [None]:
# Test assertions
assert len(loocv_errors) == len(y), f"Expected {len(y)} errors, got {len(loocv_errors)}"
assert all(e in [0, 1] for e in loocv_errors), "Errors should be 0 or 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 400 < sum(loocv_errors) < 600, f"Expected ~490 total errors, got {sum(loocv_errors)}"
assert loocv_errors[0] == 1, "First observation should be misclassified (consistent with part c)"
# END HIDDEN TESTS

**(e)** Take the average of the `n` numbers obtained in part (d)(iii) of this problem in order to obtain the LOOCV estimate for the test error. Comment on the results.

In [None]:
# BEGIN SOLUTION
loocv_error_rate = sum(loocv_errors) / len(loocv_errors)
print(f"LOOCV error rate: {loocv_error_rate:.4f}")
print("About 45% error rate - Lag1 and Lag2 are not effective predictors.")
# END SOLUTION

> BEGIN SOLUTION

The LOOCV error rate is approximately 45%, very similar to the training error rate from (a). This confirms that logistic regression with `Lag1` and `Lag2` has limited predictive ability for market direction.

> END SOLUTION

In [None]:
# Test assertions
assert isinstance(loocv_error_rate, float), "loocv_error_rate should be a float"
assert 0 < loocv_error_rate < 1, "Error rate must be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(loocv_error_rate - 0.45) < 0.01, f"Expected ~0.45, got {loocv_error_rate}"
assert (
    abs(loocv_error_rate - sum(loocv_errors) / len(loocv_errors)) < 1e-10
), "Error rate should be derived from loocv_errors"
# END HIDDEN TESTS