# DATASCI 503, Group Work 5: Cross Validation

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. During lab, feel free to flag down your GSI to ask questions at any point!

## Getting Started

In this assignment, we will explore logistic regression and $k$-fold cross validation. In $k$-fold cross-validation, we partition a dataset into $k$ equally sized non-overlapping subsets $S$. For each subset $S_i$, a model is trained on $S \setminus S_i$ and evaluated on $S_i$. The cross-validation estimator of the prediction's error is the average of the prediction errors obtained on each fold.

Let's start with a simple example that will help us understand how CV works. We'll first import the relevant packages.

In [None]:
import numpy as np
import pandas as pd
import sklearn.metrics
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsClassifier

For demonstration purposes, let's create a synthetic binary classification dataset using scikit-learn's `make_classification` function.

In [None]:
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

We'll look at two ways to do cross-validation. First we'll do it "by hand." Then we'll show how sklearn makes it easy.

## K-Fold CV by Hand

First, we divide the data into 10 parts (also known as folds).

In [None]:
n_splits = 10  # 10-fold
rng = np.random.default_rng(0)  # make a random number generator with fixed seed
permutation = rng.permutation(len(X))  # create a shuffling of the indices of our data
splits = np.split(permutation, n_splits)  # make folds

In [None]:
# e.g., these are the *indices* of the datapoints in "fold 3"
splits[3]

In [None]:
# and these are the y values for the corresponding samples
y[splits[3]]

Now, we'll assess predictive performance 10 times. Each time we'll hold out a different fold and use the rest of the data for training.

In [None]:
# we'll look at two metrics of predictive performance,
# and store the results in these arrays
missclass = np.zeros(len(splits))
aurocs = np.zeros(len(splits))  # since this is binary classification, we can assess via auroc

# iterate through folds
for i in range(len(splits)):
    # create test/train split for this held-out fold
    folds_list = list(splits)  # copy list
    test = folds_list.pop(i)  # pop out the ith fold
    training = np.concatenate(folds_list)  # combine all remaining data for training

    # fit model
    model = LogisticRegression()  # make estimator
    model.fit(X[training], y[training])  # fit estimator using the training data

    # assess metric for predictive performance using test data
    missclass[i] = np.mean(model.predict(X[test]) != y[test])  # get misclassification on test data
    aurocs[i] = sklearn.metrics.roc_auc_score(
        y[test], model.predict_proba(X[test])[:, 1]
    )  # get auroc on test data

In [None]:
missclass  # misclassification rate for each of the 10 test/train splits

In [None]:
aurocs  # auroc for each of the 10 test/train splits

In [None]:
# report conclusions: mean(scores) +/- sqrt(tau^2/K)
print(
    f"Mean misclassification rate: {missclass.mean():.4f}, "
    f"plus minus {np.sqrt(missclass.var() / n_splits):.4f}"
)
print(f"Auroc: {aurocs.mean():.4f}, plus minus {np.sqrt(aurocs.var() / n_splits):.4f}")

## K-Fold CV Using sklearn.model_selection

In [None]:
model = LogisticRegression()

Set up the $k$-fold cross-validation configuration. For example, using 10 folds:

* `n_splits=10` means 10-fold
* `shuffle=True` meaning random folds
* `random_state=42` means we'll get the same random folds each time

In [None]:
cv = KFold(n_splits=10, random_state=42, shuffle=True)

In [None]:
# look at folds generated by this cross-validation splitter
for i, (train, test) in enumerate(cv.split(X)):
    # this automatically gives you the train indices and test indices
    # without having to construct them yourselves by combining folds
    print(f"\nsplit {i}")
    print("   test:   [", " ".join(test[:10].astype("U")), "...]")  # test indices
    print("   train:  [", " ".join(train[:10].astype("U")), "...]")  # train indices

In fact, you don't even have to run the for loop yourself!

With a single call to `cross_val_score`, we can evaluate the model using cross-validation. Here, we'll use accuracy as the performance metric, but you can choose other metrics like precision, recall, etc.

In [None]:
# Calculate accuracies across folds
scores = cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)
scores

In [None]:
print(f"Mean accuracy: {scores.mean():.4f}, plus minus {np.sqrt(scores.var() / len(scores)):.4f}")
print(
    f"Mean misclassification rate: {1 - scores.mean():.4f}, "
    f"plus minus {np.sqrt(scores.var() / len(scores)):.4f}"
)

Note that conclusions are not exactly the same, because CV used different splits.

If you want to use multiple metrics, use `sklearn.model_selection.cross_validate` instead.

In [None]:
# info for each split
results = pd.DataFrame(
    sklearn.model_selection.cross_validate(
        model,
        X,
        y,
        cv=cv,
        scoring=("accuracy", "roc_auc"),
        return_train_score=True,
    )
)
results

In [None]:
print(
    f"Auroc: {results.test_roc_auc.mean():.4f}, "
    f"plus minus {np.sqrt(results.test_roc_auc.var() / len(results)):.4f}"
)
print(
    f"Accuracy: {results.test_accuracy.mean():.4f}, "
    f"plus minus {np.sqrt(results.test_accuracy.var() / len(results)):.4f}"
)

Note the results also include train performance (because we set `return_train_score=True`). It can sometimes be interesting to see discrepancy between train and test performance. Usually train performance is a bit better. If train performance is *a lot* better, your estimator may have "too much flexibility" (though sometimes you may also be experiencing so-called "benign overfitting" in which case your estimator is actually just fine...).

In [None]:
print(
    f"Train auroc: {results.train_roc_auc.mean():.4f}, "
    f"plus minus {np.sqrt(results.train_roc_auc.var() / len(results)):.4f}"
)
print(
    f"Train accuracy: {results.train_accuracy.mean():.4f}, "
    f"plus minus {np.sqrt(results.train_accuracy.var() / len(results)):.4f}"
)

## Compare with Another Estimator (Using the Same Splits)

In [None]:
model2 = KNeighborsClassifier(5)
scores2 = cross_val_score(
    model2, X, y, scoring="accuracy", cv=cv, n_jobs=-1
)  # note, using same folds, cv

print(
    f"Mean Accuracy: {scores2.mean():.4f}, plus minus {np.sqrt(scores2.var() / len(scores2)):.4f}"
)

It seems that 5-NN is worse, by a margin that is well in excess of the spread.

## Bias-Variance Tradeoff

It is very helpful to think about the bias-variance tradeoff in cross-validation. In CV, the number of folds to use (the value of $k$) is an important decision. Imagine repeating the learning procedure on multiple datasets. The lower the value for $k$, the higher the bias in the error estimates and the less variance **across datasets**. Conversely, when $k$ is set equal to the training+val sample size, the error estimate is then very low in bias but has the possibility of high variance **across datasets**.

Why? Some intuitions (but not mathematically rigorous proof) here:

While there is no overlap between the test sets on which the models are evaluated, there is overlap between the training sets for all $k>2$. The overlap is largest for leave-one-out cross-validation. This means that the learned models are correlated, i.e., dependent, and the variance of the sum of correlated variables increases with the amount of covariance:

$$
\text{Var}\left(\sum_i X_i\right) = \sum_i \sum_j \text{Cov}(X_i, X_j)
$$

Therefore, leave-one-out cross-validation has large variance in comparison to CV with smaller $k$. To summarize, larger $k$ means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) but higher variance and higher running time (as you are getting closer to the limit case: Leave-One-Out CV).

For more fun facts and simulation about bias-variance tradeoff and cross validation, please see [this post](https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation/357749#357749).

---

## Group Work Problems

But WHY do we even bother with cross-validation? What is the point?

In this group work assignment, you'll perform K-fold cross validation on several classification models on the NHANES dataset. Additionally, you will be asked to write answers to explain WHY we do the things we do. Please feel free to ask instructors for guidance if cross-validation is still new to you.

---

**Problem 1: Dataset Setup**

Our favorite (and only) dataset we have used in the group work assignments is back! Please take the appropriate time to load in the datasets.

In [None]:
# BEGIN SOLUTION
# Load the three NHANES datasets and merge them on SEQN
bmx_df = pd.read_sas("data/NHANES/BMX_L.xpt")
demo_df = pd.read_sas("data/NHANES/DEMO_L.xpt")
hdl_df = pd.read_sas("data/NHANES/HDL_L.xpt")

# inner join on SEQN
df = pd.merge(hdl_df, bmx_df, on="SEQN", how="inner")
df = pd.merge(df, demo_df, on="SEQN", how="inner")
df.head()
# END SOLUTION

In [None]:
# Test assertions
assert "df" in dir(), "The variable 'df' should be defined"
assert hasattr(df, "shape"), "df should be a DataFrame"
assert df.shape[0] > 5000, "The merged dataframe should have more than 5000 rows"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "SEQN" in df.columns, "SEQN column should be in the merged dataframe"
assert "LBDHDD" in df.columns, "LBDHDD (HDL) should be in the merged dataframe"
assert "BMXWT" in df.columns, "BMXWT (weight) should be in the merged dataframe"
assert "RIDAGEYR" in df.columns, "RIDAGEYR (age) should be in the merged dataframe"
# END HIDDEN TESTS

---

**Problem 2: Variable Setup and Selection**

For this problem, we will again try to predict individuals that have high-density lipoprotein (HDL) cholesterol of greater than 60. An HDL of 60 **mg/dL** or higher is often viewed as protective against heart diseaseâ€”this is typically the level you'd like to aim for, if possible. 

For this task please do the following:

1. Create the binary indicator variable called `HDL>60`. Also, use the following features for predictive purposes: Gender, Age, Weight, Height, BMI, WaistSize, Household Size, and Ethnicity. You may need to refer to the docs to figure out their variable names:
   - [HDL_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/HDL_L.htm)
   - [DEMO_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.htm)
   - [BMX_L](https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/BMX_L.htm)

2. Rename the variable names to be English-legible but still in Python variable style (e.g., `BMXWT` becomes `Weight`).

3. Drop all missing values.

Store the resulting DataFrame in a variable called `my_df`.

In [None]:
# BEGIN SOLUTION
# Select relevant columns and rename them to English-legible names
selected_columns = [
    "LBDHDD",
    "RIAGENDR",
    "RIDAGEYR",
    "BMXWT",
    "BMXHT",
    "DMDHHSIZ",
    "BMXBMI",
    "BMXWAIST",
    "RIDRETH1",
]
filtered_data = df[selected_columns].copy()
my_df = filtered_data.rename(
    columns={
        "LBDHDD": "HDL",
        "RIAGENDR": "Gender",
        "RIDAGEYR": "Age",
        "BMXWT": "Weight",
        "BMXHT": "Height",
        "BMXBMI": "BMI",
        "BMXWAIST": "WaistSize",
        "DMDHHSIZ": "HouseholdSize",
        "RIDRETH1": "Ethnicity",
    }
)
my_df = my_df.dropna()
my_df["HDL"] = my_df["HDL"] > 60.0
my_df = my_df.rename(columns={"HDL": "HDL>60"})
my_df.head()
# END SOLUTION

In [None]:
# Test assertions
assert "my_df" in dir(), "The variable 'my_df' should be defined"
assert my_df.shape[0] / df.shape[0] > 0.8, "At least 80% of rows should remain after dropping NaN"
assert "HDL>60" in my_df.columns, "my_df should have an 'HDL>60' column"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert my_df["HDL>60"].dtype == bool, "HDL>60 should be a boolean column"
assert "Gender" in my_df.columns, "my_df should have a 'Gender' column"
assert "Age" in my_df.columns, "my_df should have an 'Age' column"
assert "Weight" in my_df.columns, "my_df should have a 'Weight' column"
assert "Height" in my_df.columns, "my_df should have a 'Height' column"
assert "BMI" in my_df.columns, "my_df should have a 'BMI' column"
assert "WaistSize" in my_df.columns, "my_df should have a 'WaistSize' column"
assert "HouseholdSize" in my_df.columns, "my_df should have a 'HouseholdSize' column"
assert "Ethnicity" in my_df.columns, "my_df should have an 'Ethnicity' column"
assert not my_df.isna().any().any(), "my_df should have no missing values"
# END HIDDEN TESTS

---

**Problem 3: Training and Testing Split**

Please split your data into a train and test set, with 70% of observations in the train set and 30% in the test set.

* Use the `train_test_split` function from sklearn.
* Use `random_state=42`.
* Stratify the sampling to include roughly the same distribution of response values in each set. [Why should we stratify?](https://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold)

Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# BEGIN SOLUTION
# Split data into features (X) and target (y), then into train/test sets
X = my_df.drop(columns=["HDL>60"])
y = my_df["HDL>60"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
# END SOLUTION

In [None]:
# Test assertions
assert "X_train" in dir(), "X_train should be defined"
assert "X_test" in dir(), "X_test should be defined"
assert "y_train" in dir(), "y_train should be defined"
assert "y_test" in dir(), "y_test should be defined"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check approximate split ratio (70/30)
total_samples = len(X_train) + len(X_test)
train_ratio = len(X_train) / total_samples
assert 0.69 < train_ratio < 0.71, f"Train ratio should be ~0.7, got {train_ratio:.3f}"
# Check stratification worked (similar proportions)
train_pos_rate = y_train.mean()
test_pos_rate = y_test.mean()
assert abs(train_pos_rate - test_pos_rate) < 0.02, "Stratification should preserve class balance"
# END HIDDEN TESTS

---

**Problem 4: Implementing K-Fold CV**

Write a function called `KFoldCV` that takes in 4 arguments:

1. `X`: The predictors array.
2. `y`: The response variable array.
3. `model`: An sklearn model object on which we can call `fit` and `predict`.
4. `K`: An integer representing the number of folds (default: 10).

The function should return an array of classification accuracies for each fold.

**Hint:** You can use sklearn's `KFold` and `cross_val_score` to simplify your implementation. Use `random_state=42` and `shuffle=True` in your `KFold` object.

In [None]:
def KFoldCV(X, y, model, K=10):  # noqa: N802, N803
    # BEGIN SOLUTION
    # Use sklearn's KFold and cross_val_score for clean implementation
    cv = KFold(n_splits=K, random_state=42, shuffle=True)
    return cross_val_score(model, X, y, scoring="accuracy", cv=cv, n_jobs=-1)
    # END SOLUTION

In [None]:
# Test assertions
test_model = LogisticRegression(max_iter=500)
test_scores = KFoldCV(X_train, y_train, test_model, K=5)
assert len(test_scores) == 5, "Should return 5 scores for K=5"
assert all(0 <= score <= 1 for score in test_scores), "Scores should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test with K=10
test_scores_10 = KFoldCV(X_train, y_train, LogisticRegression(max_iter=500), K=10)
assert len(test_scores_10) == 10, "Should return 10 scores for K=10"
# Mean accuracy should be reasonable (better than random)
assert test_scores_10.mean() > 0.6, "Mean accuracy should be better than random"
# Test with different model
knn_scores = KFoldCV(X_train, y_train, KNeighborsClassifier(n_neighbors=5), K=5)
assert len(knn_scores) == 5, "Should work with KNN model too"
# END HIDDEN TESTS

---

**Problem 5a: Execution**

Run your `KFoldCV` function using Logistic Regression with $K=10$.

Store the result in a variable called `cv_scores` and make sure the output is visible in the notebook.

**Note:** If you get a warning that the model has not reached convergence, add the argument `max_iter=1000` to the `LogisticRegression` instance.

In [None]:
# BEGIN SOLUTION
# Run 10-fold CV with logistic regression
model = LogisticRegression(max_iter=1000)
cv_scores = KFoldCV(X_train, y_train, model)
cv_scores
# END SOLUTION

In [None]:
# Test assertions
assert "cv_scores" in dir(), "cv_scores should be defined"
assert len(cv_scores) == 10, "Should have 10 fold scores"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert cv_scores.mean() > 0.7, "Mean CV accuracy should be above 0.7"
assert cv_scores.std() < 0.1, "CV scores should not have too much variance"
# END HIDDEN TESTS

---

**Problem 5b: Generalization**

One motivation for cross-validation is to assess the generalizability of your model on unseen data.

Explain in at most two sentences whether you believe your model generalizes well on unseen data. Reference the output of your function to make your case.

> BEGIN SOLUTION

The model shows reasonably consistent performance across all 10 folds, with accuracies ranging from approximately 71% to 80% and a mean around 76%. This consistency suggests the model generalizes fairly well to unseen data, as there are no individual folds with dramatically lower performance that would indicate overfitting to particular subsets of the training data.
> END SOLUTION


---

**Problem 6a: Working with Regularization**

It turns out that if you look at the LogisticRegression module in sklearn, it uses an L2 penalty by default! You can see for yourself on the [documentation page for sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

One problem: we just accepted the default regularization term `C=1.0`. How do we know this was the right choice? Let's find out.

Write a function called `KFoldCV_L2` that performs K-Fold validation with different values of the regularization parameter $C$. The values to consider should be $10^{-5}, 10^{-4}, \ldots, 10^{4}$ (i.e., 10 values total).

The function should take as input `X`, `y`, and `K` (default 10) and return a dictionary with C values as keys and mean validation accuracy as values.

In [None]:
def KFoldCV_L2(X, y, K=10):  # noqa: N802, N803
    """
    Perform K-fold CV for logistic regression with different L2 regularization strengths.

    Parameters:
        X: pandas DataFrame of features
        y: pandas Series of labels
        K: number of folds (default 10)

    Returns:
        dict: C values as keys, mean validation accuracy as values
    """
    # BEGIN SOLUTION
    # Perform K-fold CV for each regularization strength C
    cv = KFold(n_splits=K, random_state=42, shuffle=True)

    results = {}
    C_values = [10**i for i in range(-5, 5)]
    for C in C_values:
        validation_acc = []
        for train_idx, val_idx in cv.split(X):
            model = LogisticRegression(C=C, max_iter=500)
            model.fit(X.iloc[train_idx], y.iloc[train_idx])
            y_pred = model.predict(X.iloc[val_idx])
            validation_acc.append(accuracy_score(y.iloc[val_idx], y_pred))

        results[C] = np.mean(validation_acc)
    return results
    # END SOLUTION

In [None]:
# Test assertions
l2_results = KFoldCV_L2(X_train, y_train)
assert isinstance(l2_results, dict), "Should return a dictionary"
assert len(l2_results) == 10, "Should have results for 10 C values"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 1e-05 in l2_results, "Should include C=1e-05"
assert 10000 in l2_results, "Should include C=10000"
assert all(0 <= v <= 1 for v in l2_results.values()), "All accuracies should be between 0 and 1"
# Higher C (less regularization) should generally perform at least as well for this data
assert l2_results[1] >= l2_results[1e-05] - 0.05, "Weak regularization should not hurt much"
# END HIDDEN TESTS

In [None]:
# Display the results
l2_results

---

**Problem 6b: Evaluation**

Based on your cross-validation, do you believe that regularization plays an effect on the predictive accuracy? What level of regularization should we choose?

> BEGIN SOLUTION

Yes, regularization does affect predictive accuracy, but mainly for very strong regularization (small C values like 1e-05 to 0.01). The accuracy improves as C increases (weaker regularization) up to around C=1, after which it plateaus. We should choose C=1 (or any value from 1 to 10000) since they all achieve similar peak performance, and C=1 is the simplest default choice that provides adequate regularization without sacrificing accuracy.
> END SOLUTION


---

**Problem 7a: Train and Test Evaluation**

Please now retrain your final model (with the best regularization value you found in the last part) on all training data. Then, evaluate your model's performance on the test set.

Store the trained model in `final_model` and the test accuracy in `test_accuracy`.

In [None]:
# BEGIN SOLUTION
# Train the final model with best C on all training data and evaluate on test set
final_model = LogisticRegression(C=1, max_iter=500)
final_model.fit(X_train, y_train)
test_accuracy = final_model.score(X_test, y_test)
test_accuracy
# END SOLUTION

In [None]:
# Test assertions
assert "final_model" in dir(), "final_model should be defined"
assert "test_accuracy" in dir(), "test_accuracy should be defined"
assert 0 <= test_accuracy <= 1, "test_accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(final_model, "predict"), "final_model should be a trained model"
assert test_accuracy > 0.7, "Test accuracy should be above 0.7"
# END HIDDEN TESTS

---

**Problem 7b: Retraining Justification**

Explain in 1-2 sentences **maximum** why we retrain the model on the full training set (all data except the test set) before final evaluation.

> BEGIN SOLUTION

We retrain on the full training data because more training data generally leads to better model performance. Cross-validation was only used to select the best hyperparameters (regularization strength), and once that decision is made, we want to use all available training data to fit the final model.
> END SOLUTION


---

**Problem 7c: Performance Evaluation**

Is logistic regression performing well on this data? Explain in 1-2 sentences **maximum**. You may need some supporting code for your argument.

**Hint:** Think about what a trivial baseline would be.

In [None]:
# BEGIN SOLUTION
# Calculate baseline accuracy (always predicting the majority class)
baseline_accuracy = 1 - y_test.mean()
print(f"Baseline accuracy (always predict False): {baseline_accuracy:.4f}")
print(f"Logistic regression test accuracy: {test_accuracy:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert "baseline_accuracy" in dir(), "baseline_accuracy should be defined"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.6 < baseline_accuracy < 0.8, "Baseline accuracy should be between 0.6 and 0.8"
# END HIDDEN TESTS

> BEGIN SOLUTION

Logistic regression is performing modestly well but not impressively. It achieves around 76% accuracy compared to a trivial baseline of about 73% (always predicting the majority class), so it is only marginally better than simply predicting that nobody has high HDL.
> END SOLUTION


---

**Problem 8: K-Folds on K-Nearest Neighbors**

Using your `KFoldCV` function you created in Problem 4, please run CV on 2-NN classification (K-Nearest Neighbors with `n_neighbors=2`).

Store the result in a variable called `knn_cv_scores`.

In [None]:
# BEGIN SOLUTION
# Run 10-fold CV with 2-NN classifier
knn_cv_scores = KFoldCV(
    X_train, y_train, KNeighborsClassifier(n_neighbors=2, weights="distance"), K=10
)
knn_cv_scores
# END SOLUTION

In [None]:
# Test assertions
assert "knn_cv_scores" in dir(), "knn_cv_scores should be defined"
assert len(knn_cv_scores) == 10, "Should have 10 fold scores"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert all(0 <= s <= 1 for s in knn_cv_scores), "All scores should be between 0 and 1"
assert knn_cv_scores.mean() > 0.5, "Mean accuracy should be better than random"
# END HIDDEN TESTS

---

**Problem 9: Finding the Right K for KNN**

Similar to what you did for regularized logistic regression, please run a 10-fold CV that assesses model performance for 10 different values of `n_neighbors`: [1, 3, 5, 10, 15, 20, 25, 35, 50, 100]. You should evaluate each fold with each `n_neighbors` value.

Write a function called `KFoldCV_NN` that takes as input `X`, `y`, and `K` (default 10) and returns a dictionary. The dictionary should have the number of neighbors considered as keys and the mean validation accuracy as values.

In [None]:
def KFoldCV_NN(X, y, K=10):  # noqa: N802, N803
    """
    Perform K-fold CV for KNN with different numbers of neighbors.

    Parameters:
        X: pandas DataFrame of features
        y: pandas Series of labels
        K: number of folds (default 10)

    Returns:
        dict: n_neighbors values as keys, mean validation accuracy as values
    """
    # BEGIN SOLUTION
    # Perform K-fold CV for each number of neighbors
    cv = KFold(n_splits=K, random_state=42, shuffle=True)

    results = {}
    num_neighbors = [1, 3, 5, 10, 15, 20, 25, 35, 50, 100]
    for k in num_neighbors:
        validation_acc = []
        for train_idx, val_idx in cv.split(X):
            model = KNeighborsClassifier(n_neighbors=k, weights="distance")
            model.fit(X.iloc[train_idx], y.iloc[train_idx])
            y_pred = model.predict(X.iloc[val_idx])
            validation_acc.append(accuracy_score(y.iloc[val_idx], y_pred))

        results[k] = np.mean(validation_acc)
    return results
    # END SOLUTION

In [None]:
# Test assertions
nn_results = KFoldCV_NN(X_train, y_train)
assert isinstance(nn_results, dict), "Should return a dictionary"
assert len(nn_results) == 10, "Should have results for 10 n_neighbors values"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 1 in nn_results, "Should include n_neighbors=1"
assert 100 in nn_results, "Should include n_neighbors=100"
assert all(0 <= v <= 1 for v in nn_results.values()), "All accuracies should be between 0 and 1"
# 1-NN typically has lower accuracy due to overfitting
assert nn_results[5] >= nn_results[1] - 0.05, "5-NN should be comparable or better than 1-NN"
# END HIDDEN TESTS

In [None]:
# Display the results
nn_results

---

**Problem 10: Retraining KNN on Training Data**

Retrain your KNN model on the full training set. Make sure you use the best number of neighbors as determined from the previous part.

Store the best number of neighbors in `BEST_K` and the trained model in `knn_final_model`.

In [None]:
# BEGIN SOLUTION
# Find best K and train final KNN model
BEST_K = max(nn_results, key=nn_results.get)  # SOLUTION

knn_final_model = KNeighborsClassifier(n_neighbors=BEST_K, weights="distance")
knn_final_model.fit(X_train, y_train)
# END SOLUTION

In [None]:
# Test assertions
assert "BEST_K" in dir(), "BEST_K should be defined"
assert "knn_final_model" in dir(), "knn_final_model should be defined"
assert BEST_K in [1, 3, 5, 10, 15, 20, 25, 35, 50, 100], "BEST_K should be one of the tested values"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert hasattr(knn_final_model, "predict"), "knn_final_model should be a trained model"
assert knn_final_model.n_neighbors == BEST_K, "Model should use BEST_K neighbors"
# END HIDDEN TESTS

---

**Problem 11: Test Set Evaluation for KNN**

Evaluate your final KNN model's performance on the test set.

Store the test accuracy in `knn_test_accuracy`.

In [None]:
# BEGIN SOLUTION
# Evaluate KNN on test set
knn_test_accuracy = knn_final_model.score(X_test, y_test)
knn_test_accuracy
# END SOLUTION

In [None]:
# Test assertions
assert "knn_test_accuracy" in dir(), "knn_test_accuracy should be defined"
assert 0 <= knn_test_accuracy <= 1, "knn_test_accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert knn_test_accuracy > 0.65, "KNN test accuracy should be reasonable"
# END HIDDEN TESTS