# DATASCI 503, Homework 5: Resampling Methods

Resampling methods are techniques that repeatedly draw samples from a training set and refit a model to each sample to obtain additional information about the fitted model. In this assignment, we explore two key resampling methods: **bootstrap** (for estimating the variability of an estimator) and **cross-validation** (for estimating test error and selecting models). These methods connect directly to the bias-variance tradeoffâ€”cross-validation helps us choose models that balance underfitting and overfitting, while bootstrap helps us quantify uncertainty in our estimates.

In [None]:
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import KFold

warnings.filterwarnings("ignore")

### Problem 1: Bootstrap Sampling Probability

We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of $n$ observations.

**(a)** What is the probability that the first bootstrap observation is *not* the $j$th observation from the original sample? Justify your answer.

> BEGIN SOLUTION

*Solutions in this assignment are adapted from work by John Elmer Loretizo.*

We note that observing the $j$th observation from $n$ samples is given as $P(j\text{th observation})=1/n$. Therefore, we take the complement and $P(\text{not observing }j\text{th observation})=1-1/n$.

> END SOLUTION

**(b)** What is the probability that the second bootstrap observation is *not* the $j$th observation from the original sample?

> BEGIN SOLUTION

Since bootstrap sampling is performed with replacement, we still note the same probability for every $j$th observation as $P(j\text{th observation})=1/n$ and the $P(\text{not observing the second }j\text{th observation})=1-1/n$.
> END SOLUTION


**(c)** Argue that the probability that the $j$th observation is *not* in the bootstrap sample is $(1 - \frac{1}{n})^n$.

> BEGIN SOLUTION

Both (a) and (b) suggest the property of independence that is present in a bootstrap method. This stems from the fact that we draw with replacement and therefore every observation will have a $1/n$ chance of getting picked at every draw. Due to independence, we can simply see that for every draw until the $n$th draw, the probability of not having the $j$th observation is simply $(1 - \frac{1}{n})$ and we multiply each of these probabilities giving us $(1 - \frac{1}{n})^n$.
> END SOLUTION


**(d)** When $n = 5$, what is the probability that the $j$th observation is in the bootstrap sample?

In [None]:
# BEGIN SOLUTION
n = 5
prob_in_sample_n5 = 1 - (1 - (1 / n)) ** n
prob_in_sample_n5
# END SOLUTION

In [None]:
# Test assertions
assert abs(prob_in_sample_n5 - 0.6723) < 0.001, f"Expected ~0.6723, got {prob_in_sample_n5}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.67 < prob_in_sample_n5 < 0.68, "Probability should be between 0.67 and 0.68"
assert prob_in_sample_n5 == 1 - (1 - 1 / 5) ** 5, "Formula not applied correctly"
# END HIDDEN TESTS

> BEGIN SOLUTION

We use the probability $P(\text{observing the }j\text{th observation in the bootstrap sample})=1-((1 - \frac{1}{n})^n)$, which gives us a 67.232% chance that the $j$th observation is in the bootstrap sample.
> END SOLUTION


**(e)** When $n = 100$, what is the probability that the $j$th observation is in the bootstrap sample?

In [None]:
# BEGIN SOLUTION
n = 100
prob_in_sample_n100 = 1 - (1 - (1 / n)) ** n
prob_in_sample_n100
# END SOLUTION

In [None]:
# Test assertions
assert abs(prob_in_sample_n100 - 0.634) < 0.01, f"Expected ~0.634, got {prob_in_sample_n100}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.63 < prob_in_sample_n100 < 0.64, "Probability should be between 0.63 and 0.64"
# END HIDDEN TESTS

> BEGIN SOLUTION

Following the same formula above, we have a 63.397% chance that the $j$th observation is in the bootstrap sample.
> END SOLUTION


**(f)** When $n = 10{,}000$, what is the probability that the $j$th observation is in the bootstrap sample?

In [None]:
# BEGIN SOLUTION
n = 10000
prob_in_sample_n10000 = 1 - (1 - (1 / n)) ** n
prob_in_sample_n10000
# END SOLUTION

In [None]:
# Test assertions
assert abs(prob_in_sample_n10000 - 0.632) < 0.01, f"Expected ~0.632, got {prob_in_sample_n10000}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.63 < prob_in_sample_n10000 < 0.64, "Probability should converge to 1 - 1/e"
# END HIDDEN TESTS

> BEGIN SOLUTION

Following the same formula above, we have a 63.214% chance that the $j$th observation is in the bootstrap sample. Note that as $n \to \infty$, this probability converges to $1 - 1/e \approx 0.632$.
> END SOLUTION


### Problem 2: Estimating Standard Deviation of Predictions

Suppose that we use some statistical learning method to make a prediction for the response $Y$ for a particular value of the predictor $X$. Carefully describe how we might estimate the standard deviation of our prediction.

> BEGIN SOLUTION

We can perform bootstrap by repeatedly sampling the dataset and fitting the model from these new datasets and then predicting the response $Y$ at the particular value of predictor $X$. After repeating this over the number of bootstrap samples, we will have a distribution of predicted values of $Y$ given a particular value of $X$. We can then use this distribution to compute the standard deviation of our prediction.
> END SOLUTION


### Problem 3: Cross-Validation on Simulated Data

We will now perform cross-validation on a simulated data set.

**(a)** Generate a simulated data set as follows:

```python
rng = np.random.default_rng(1)
x = rng.normal(size=100)
y = x - 2 * x**2 + rng.normal(size=100)
```
    
In this data set, what is $n$ and what is $p$? Write out the model used to generate the data in equation form.

In [None]:
# BEGIN SOLUTION
rng = np.random.default_rng(1)
x = rng.normal(size=100)
y = x - 2 * x**2 + rng.normal(size=100)
# END SOLUTION

> BEGIN SOLUTION

We have $n=100$ and $p=1$ with the equation given as $y = x - 2x^2 + \epsilon$.
> END SOLUTION


**(b)** Create a scatterplot of $X$ against $Y$. Comment on what you find.

In [None]:
# BEGIN SOLUTION
sns.scatterplot(x=x, y=y)
# END SOLUTION

> BEGIN SOLUTION

We note the nonlinear relationship between $X$ and $Y$ with the curve plot suggesting a quadratic relationship. This is consistent given that the equation for the data generating process has a quadratic term included.
> END SOLUTION


**(c)** Construct 5 folds using `sklearn.model_selection.KFold`, and specify `random_state=3`. Using these folds, compute the cross-validation errors that result from fitting the following four models using least squares. Use the same five folds for all four models.

i. $Y = \beta_0 + \beta_1 X + \epsilon$

ii. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon$

iii. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$

iv. $Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \epsilon$

For each fold and each model, report the mean-squared error. Which of the models had the smallest average error? Is this what you expected? Explain your answer.

In [None]:
# BEGIN SOLUTION
# Prepare polynomial features
X = x.reshape(-1, 1)
X_poly = np.hstack([X**i for i in range(1, 5)])
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Perform 5-fold cross-validation for each model
kf = KFold(n_splits=5, shuffle=True, random_state=3)
errors = np.zeros((4, 5))

for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    for i in range(4):
        X_train_poly = X_poly[train_index, : i + 1]
        X_test_poly = X_poly[test_index, : i + 1]

        model = LinearRegression().fit(X_train_poly, y_train)
        y_pred = model.predict(X_test_poly)
        errors[i, fold] = mean_squared_error(y_test, y_pred)
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Create a DataFrame to display the results
errors_df = pd.DataFrame(errors)
errors_df.columns = ["Fold " + str(col + 1) for col in errors_df.columns]
errors_df["Mean"] = errors_df.mean(axis=1)
errors_df.index = errors_df.index + 1
errors_df
# END SOLUTION

In [None]:
# Test assertions
assert errors_df.shape == (4, 6), f"Expected shape (4, 6), got {errors_df.shape}"
assert errors_df["Mean"].idxmin() == 2, "Model 2 should have the lowest mean error"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert errors_df.loc[1, "Mean"] > 5, "Linear model should have high error for quadratic data"
assert errors_df.loc[2, "Mean"] < 2, "Quadratic model should have low error"
# END HIDDEN TESTS

> BEGIN SOLUTION

Model 2 has the smallest average error across all models. This result is expected since the data generating process is similar to the one identified in model 2. Clearly, model 1 is underfitting since it suggests a linear relationship for a nonlinear relationship as we have seen above. Subsequent models (3 and 4) offered more flexibility but may have suffered from overfitting and thus a poorer performance for out-of-sample prediction.
> END SOLUTION


### Problem 4: LOOCV and Random Seeds

Consider estimating log odds using logistic regression with ridge penalties, setting the regularization strength via LOOCV. True or false: the result will depend on a random seed that we use as part of the LOOCV process. Explain your answer.

> BEGIN SOLUTION

False. One can easily see this by the fact that the LOOCV uses $n-1$ observations as part of its training set and the $n$th observation as its validation set, repeating this process over all the observations. Such a process is deterministic and there is no randomness in the splitting due to the fact that every iteration is just removing the $i$th observation (a single observation) and setting it as the validation set. Therefore, there is no dependence on the random seed.
> END SOLUTION


### Problem 5: Boston Housing Data Analysis

We will now consider the Boston housing data set.

**(a)** Based on this data set, provide an estimate for the population mean of `medv`. Call this estimate $\hat{\mu}$.

In [None]:
# BEGIN SOLUTION
boston_df = pd.read_csv("./data/boston.csv")
mu_hat = boston_df["medv"].mean()
mu_hat
# END SOLUTION

In [None]:
# Test assertions
assert abs(mu_hat - 22.533) < 0.01, f"Expected ~22.533, got {mu_hat}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 22 < mu_hat < 23, "Mean should be around 22.5"
# END HIDDEN TESTS

> BEGIN SOLUTION

We have the mean of `medv` at $\hat{\mu} = 22.5328$.
> END SOLUTION


**(b)** Provide an estimate of the standard error of $\hat{\mu}$. Interpret this result.

**Hint:** We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

In [None]:
# BEGIN SOLUTION
se_mu_hat = boston_df["medv"].std() / np.sqrt(boston_df.shape[0])
se_mu_hat
# END SOLUTION

In [None]:
# Test assertions
assert abs(se_mu_hat - 0.409) < 0.01, f"Expected ~0.409, got {se_mu_hat}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.4 < se_mu_hat < 0.42, "Standard error should be around 0.41"
# END HIDDEN TESTS

> BEGIN SOLUTION

The standard error of `medv` is approximately 0.4089.
> END SOLUTION


**(c)** Now estimate the standard error of $\hat{\mu}$ using the bootstrap. How does this compare to your answer from (b)?

In [None]:
# BEGIN SOLUTION
def get_mean(data):
    return data.mean()


def boot_se(func, df, colname, n=None, num_bootstrap=1000, seed=0):
    rng = np.random.default_rng(seed)
    first_moment, second_moment = 0, 0
    n = n or df.shape[0]
    for _ in range(num_bootstrap):
        bootstrap_sample = rng.choice(df[colname], size=n, replace=True)
        value = func(bootstrap_sample)
        first_moment += value
        second_moment += value**2
    return np.sqrt(second_moment / num_bootstrap - (first_moment / num_bootstrap) ** 2)


# END SOLUTION

In [None]:
# BEGIN SOLUTION
bootstrap_sem = boot_se(get_mean, boston_df, "medv", num_bootstrap=10000, seed=2024)
bootstrap_sem
# END SOLUTION

In [None]:
# Test assertions
assert abs(bootstrap_sem - 0.408) < 0.02, f"Expected ~0.408, got {bootstrap_sem}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.38 < bootstrap_sem < 0.43, "Bootstrap SE should be close to analytic SE"
# END HIDDEN TESTS

> BEGIN SOLUTION

Estimating the standard error of `medv` using bootstrap gives us approximately 0.408, which is very close to the answer from (b). This suggests that the bootstrap method gives us a reliable estimate of the standard error for the given dataset.
> END SOLUTION


**(d)** Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of `medv`. Compare it to the results obtained by using `boston_df['medv'].std()` and the two standard error rule.

**Hint:** You can approximate a 95% confidence interval using the formula $[\hat{\mu} - 2\text{SE}(\hat{\mu}), \hat{\mu} + 2\text{SE}(\hat{\mu})]$.

In [None]:
# BEGIN SOLUTION
confidence_interval_bootstrap = (mu_hat - 2 * bootstrap_sem, mu_hat + 2 * bootstrap_sem)
confidence_interval_bootstrap
# END SOLUTION

In [None]:
# BEGIN SOLUTION
confidence_interval_standard_error = (mu_hat - 2 * se_mu_hat, mu_hat + 2 * se_mu_hat)
confidence_interval_standard_error
# END SOLUTION

In [None]:
# Test assertions
ci_low, ci_high = confidence_interval_bootstrap
assert ci_low < mu_hat < ci_high, "Mean should be within CI"
ci_se_low, ci_se_high = confidence_interval_standard_error
assert abs(ci_low - ci_se_low) < 0.1, "Bootstrap and standard CI should be similar"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert ci_high - ci_low < 2, "CI width should be less than 2"
# END HIDDEN TESTS

> BEGIN SOLUTION

The confidence intervals for both methods are very similar, suggesting robustness of the estimator and that the data distribution is very similar to what is assumed when we compute using the formula in (b).
> END SOLUTION


**(e)** Based on this data set, provide an estimate, $\hat{m}$, for the median value of `medv` in the population.

In [None]:
# BEGIN SOLUTION
median_hat = boston_df["medv"].median()
median_hat
# END SOLUTION

In [None]:
# Test assertions
assert abs(median_hat - 21.2) < 0.01, f"Expected 21.2, got {median_hat}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 21 < median_hat < 22, "Median should be around 21.2"
# END HIDDEN TESTS

> BEGIN SOLUTION

The median of `medv` is 21.2.
> END SOLUTION


**(f)** We now would like to estimate the standard error of $\hat{m}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

In [None]:
# BEGIN SOLUTION
def get_median(data):
    return np.median(data)


bootstrap_se_median = boot_se(get_median, boston_df, "medv", num_bootstrap=10000, seed=2024)
bootstrap_se_median
# END SOLUTION

In [None]:
# Test assertions
assert abs(bootstrap_se_median - 0.379) < 0.05, f"Expected ~0.379, got {bootstrap_se_median}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.3 < bootstrap_se_median < 0.5, "Bootstrap SE of median should be reasonable"
# END HIDDEN TESTS

> BEGIN SOLUTION

The standard error of $\hat{m}$ is approximately 0.379, which relative to the median of 21.2, is small. This suggests a high degree of precision when estimating the median, and repeated sampling from the same population is likely to result in very similar estimates of the median.
> END SOLUTION


**(g)** Based on this data set, provide an estimate for the tenth percentile of `medv` in Boston census tracts. Call this quantity $\hat{p}_{0.1}$.

**Hint:** You can use `np.percentile()`.

In [None]:
# BEGIN SOLUTION
percentile_10_hat = np.percentile(boston_df["medv"], q=10)
percentile_10_hat
# END SOLUTION

In [None]:
# Test assertions
assert abs(percentile_10_hat - 12.75) < 0.1, f"Expected 12.75, got {percentile_10_hat}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 12 < percentile_10_hat < 14, "10th percentile should be around 12.75"
# END HIDDEN TESTS

> BEGIN SOLUTION

The tenth percentile is $\hat{p}_{0.1} = 12.75$.
> END SOLUTION


**(h)** Use the bootstrap to estimate the standard error of $\hat{p}_{0.1}$. Comment on your findings.

In [None]:
# BEGIN SOLUTION
def get_tenth_percentile(data):
    return np.percentile(data, q=10)


bootstrap_se_percentile = boot_se(
    get_tenth_percentile, boston_df, "medv", num_bootstrap=10000, seed=2024
)
bootstrap_se_percentile
# END SOLUTION

In [None]:
# Test assertions
assert abs(bootstrap_se_percentile - 0.50) < 0.1, f"Expected ~0.50, got {bootstrap_se_percentile}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.4 < bootstrap_se_percentile < 0.6, "Bootstrap SE of 10th percentile should be reasonable"
# END HIDDEN TESTS

> BEGIN SOLUTION

The standard error for $\hat{p}_{0.1}$ is approximately 0.50, suggesting strong precision in the estimation of the tenth percentile from the given dataset.
> END SOLUTION


### Problem 6: LOOCV for Logistic Regression

In this problem, you will practice computing the LOOCV error for a logistic regression model on the Weekly data set.

**(a)** Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2`. Report and comment on the result.

In [None]:
# BEGIN SOLUTION
weekly_df = pd.read_csv("./data/weekly.csv")
weekly_df["Coded_Direction"] = pd.get_dummies(weekly_df["Direction"], dtype=int, drop_first=True)

X = weekly_df[["Lag1", "Lag2"]]
y = weekly_df["Coded_Direction"]
model = LogisticRegression().fit(X, y)

misclassification_rate_full = 1 - accuracy_score(y, model.predict(X))
misclassification_rate_full
# END SOLUTION

In [None]:
# Test assertions
expected_rate = 0.445
actual_rate = misclassification_rate_full
assert abs(actual_rate - expected_rate) < 0.01, f"Expected ~0.445, got {actual_rate}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.4 < misclassification_rate_full < 0.5, "Misclassification rate should be around 44%"
# END HIDDEN TESTS

> BEGIN SOLUTION

The misclassification rate is at 44.54%. This suggests a relatively poor performance since almost half of the observations are incorrectly classified.

> END SOLUTION

**(b)** Fit a logistic regression model that predicts `Direction` using `Lag1` and `Lag2` using all but the first observation. Report and comment on the result.

In [None]:
# BEGIN SOLUTION
X_without_first = X[1:]
y_without_first = y[1:]

model_without_first = LogisticRegression().fit(X_without_first, y_without_first)
misclassification_rate_without_first = 1 - accuracy_score(y, model_without_first.predict(X))
misclassification_rate_without_first
# END SOLUTION

In [None]:
# Test assertions
expected_rate = 0.444
actual_rate = misclassification_rate_without_first
assert abs(actual_rate - expected_rate) < 0.01, f"Expected ~0.444, got {actual_rate}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
diff = abs(misclassification_rate_without_first - misclassification_rate_full)
assert diff < 0.01, "Rates should be similar"
# END HIDDEN TESTS

> BEGIN SOLUTION

Similarly, the misclassification rate is at 44.36%, suggesting a very small improvement (most likely by chance) by removing the first observation.
> END SOLUTION


**(c)** Use the model from (b) to predict the direction of the first observation. Was this observation correctly classified?

In [None]:
# BEGIN SOLUTION
prediction = model_without_first.predict(X.iloc[0, :].values.reshape(1, -1))[-1]
first_observation_correct = prediction == y.iloc[0]
first_observation_correct
# END SOLUTION

In [None]:
# Test assertions
assert not first_observation_correct, "First observation should be incorrectly classified"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert not first_observation_correct, "First observation should be misclassified"
# END HIDDEN TESTS

> BEGIN SOLUTION

The observation was incorrectly classified.
> END SOLUTION


**(d)** Write a for loop from `i = 0` to `i = n-1`, where `n` is the number of observations in the data set, that performs each of the following steps:

i. Fit a logistic regression model using all but the `i`th observation to predict `Direction` using `Lag1` and `Lag2`.

ii. Use this model to predict the direction for the `i`th observation.

iii. Determine whether or not an error was made in predicting the direction for the `i`th observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

In [None]:
# BEGIN SOLUTION
loocv_errors = []
for i in range(X.shape[0]):
    X_train = X.drop(i)
    y_train = y.drop(i)

    model_loocv = LogisticRegression().fit(X_train, y_train)

    pred = model_loocv.predict(X.iloc[i, :].values.reshape(1, -1))[-1]
    prediction_error = 1 if pred != y.iloc[i] else 0
    loocv_errors.append(prediction_error)

loocv_errors[:5]
# END SOLUTION

In [None]:
# Test assertions
assert len(loocv_errors) == len(y), f"Expected {len(y)} errors, got {len(loocv_errors)}"
assert all(e in [0, 1] for e in loocv_errors), "Errors should be 0 or 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert sum(loocv_errors) > 400, "There should be many misclassifications"
# END HIDDEN TESTS

**(e)** Take the average of the `n` numbers obtained in part (d)(iii) of this problem in order to obtain the LOOCV estimate for the test error. Comment on the results.

In [None]:
# BEGIN SOLUTION
loocv_error_rate = sum(loocv_errors) / len(loocv_errors)
loocv_error_rate
# END SOLUTION

In [None]:
# Test assertions
assert abs(loocv_error_rate - 0.45) < 0.01, f"Expected ~0.45, got {loocv_error_rate}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.44 < loocv_error_rate < 0.46, "LOOCV error rate should be around 45%"
# END HIDDEN TESTS

> BEGIN SOLUTION

We get the test error rate at approximately 45%, suggesting that both `Lag1` and `Lag2` are not effective predictors for the direction of the current day's returns.
> END SOLUTION
