In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Discussion 10

# Cross-Validation and Regularization

### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- The bias-variance tradeoff.
- Cross-validation.
- Regularization.

### The bias-variance tradeoff

- In the real world, we're concerned with our model's ability to **generalize** on **different datasets** drawn from the same population.

<center><img src='imgs/tt-errors.png' width=800></center>

- In lecture, we trained three different polynomial regression models – degree 1, 3, and 25 – each on two different datasets, <span style="color:orange"><b>Sample 1</b></span> and <span style="color:purple"><b>Sample 2</b></span>.<br><small>The points in <span style="color:blue"><b>blue</b></span> come from Sample 1.

<center><img src='imgs/bias-variance.png' width=900></img></center>

- The degree 1 polynomials have the highest bias – on average, they are **consistently wrong** – while the degree 25 polynomial has the lowest bias – on average, they are **consistently good**.

$$\text{low complexity} \rightarrow \text{underfits the training data} \rightarrow \text{high bias and low variance}$$


- The degree 25 polynomials have the highest variance – from training set to training set, they vary more than the degree 1 and 3 polynomials.

$$\text{high complexity} \rightarrow \text{overfits the training data} \rightarrow \text{low bias and high variance}$$

### Cross-validation

- Cross-validation, as we talked about in lecture, is one way we can split our data into training and validation sets. We can create $k$ <span style='color: green'><b>validation</b></span> sets, where $k$ is some positive integer (5 in the example below).

<center><img src='imgs/k-fold.png' width=500></center> 

- Suppose we're choosing between **10** different hyperparameter values for our model and decide to use **5**-fold cross-validation to determine which hyperparameter performs best. 

- First, we divide the entire dataset into 5 equally-sized "slices".

- For each of the **10** hyperparameters, we perform **5** training rounds, for a total of **5 x 10 = 50** trainings.<br><small>**In each training**, we'll use 4 folds to train the model and the remaining 1 fold to validate (test) it. This gives us **5** test error measurements per hyperparameter choice.</small>

- Finally, we calculate the average validation error for each of our 10 hyperparameters, and choose the one with the lowest error.

- **Aside**: Some of the worksheet questions use the term "accuracy". Although we haven't covered it yet, accuracy is one of the ways to evaluate a classification model, where **higher accuracy is better**.

### Regularization

- In general, the larger the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ are, the more overfit our model is.<br>We can prevent large parameter values by minimizing mean squared error with **regularization**.

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \mathbf{+} \underbrace{\lambda \sum_{j = 1}^d w_j^2}_{\text{regularization penalty!}}$$

- Linear regression with $L_2$ regularization is called **ridge regression**.<br><small>Linear regression with $L_1$ regularization is called LASSO.</small>

- Intuition: Instead of just minimizing mean squared error, we balance minimizing mean squared error and a penalty on the size of the fit coefficients, $w_1^*$, $w_2^*$, ..., $w_d^*$.<br><small>We don't regularize the intercept term!</small>

- $\lambda$ is a **hyperparameter**, which we choose through cross-validation.
  - Higher $\lambda$ → stronger penalty, coefficients shrink more → higher bias, lower variance (underfitting).  
  - Lower $\lambda$ → weaker penalty, coefficients can grow → lower bias, higher variance (overfitting).

## Attendance 🙋

<center><img src='imgs/disc10.png' width="500"></img></center> 

---

## <a href='https://study.practicaldsc.org/disc10/index.html'>Worksheet</a> 📝

---