In [None]:
from lec_utils import *
import lec19_util as util
from IPython.display import Markdown
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, FunctionTransformer, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

<div class="alert alert-info" markdown="1">

#### Lecture 19

# Regularization

### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- Recap: Model selection.
- Ridge regression 🏔️.
- LASSO 📿.

Remember to look at the [Machine Learning section of the Resources tab](https://practicaldsc.org/resources#machine-learning) of the course website.

In addition, this lecture has an associated [Guide](https://practicaldsc.org/guides/machine-learning/ridge-regression/), which we'll refer to in lecture and you'll need to read for Homework 9.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Recap: Model selection

---

### Example: Commute times

- Last class, we used $k$-fold cross-validation to choose between the following five models that predict commute time in `'minutes'`.

<center><img src="imgs/five-pipelines.png" width=900></center>

- Which model has the highest **model bias**?<br><small>See the annotated slides for the answer.</small>

- Which model has the highest **model variance**?

- Which model is most likely to perform best in practice? Why?

## Ridge regression 🏔️

---

### Motivation

- So far, to us, "model complexity" has essentially meant "number of features."<br><small>The main hyperparameter we've tuned is polynomial degree. For instance, a polynomial of degree 5 has 5 features – an $x$, $x^2$, $x^3$, $x^4$, and $x^5$ feature.<br>In the more recent example, we <b>manually</b> created several different pipelines, each of which used different combinations of features from the commute times dataset.</small>

- Once we've created several different candidate models, we've used cross-validation to choose the one that best generalizes to unseen data.

- Another approach: **instead of manually choosing which features to include, put some constraint on the optimal parameters, $w_0^*, w_1^*, ..., w_d^*$**.<br><small>This would save us time from having to think of combinations of features that might be relevant.</small>

- Intuition: **The bigger the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ are, the more _overfit_ the model is to the training data.**<br><small>Why?</small>

### Polynomial regression returns

- Last class, we fit various polynomial regression models to Sample 1, shown below.

In [None]:
sample_1 = util.sample_from_pop()
X_train, X_test, y_train, y_test = train_test_split(sample_1[['x']], sample_1['y'], random_state=23)
px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data", width=800, height=600)

- As we increase the degree of the polynomial, the resulting <b><span style="color:#ff7f0f">fit polynomial</span></b> overfits the **training data** more and more.

In [None]:
interact(lambda d: util.fit_and_show_fit(X_train, y_train, d)[1], d=(1, 25));

### Inspecting the fit degree 25 polynomial

- Let's consider the degree 25 polynomial.

In [None]:
model, fig = util.fit_and_show_fit(X_train, y_train, d=25)
fig

- What does the <b><span style="color:#ff7f0f">fit polynomial</span></b> actually look like, as an equation?

In [None]:
# These coefficients are rounded to two decimal places.
# The coefficient on x^25 is not 0.00, but is something very small.
util.display_features(model.named_steps['linearregression'])

In [None]:
util.plot_coefficient_magnitudes(model.named_steps['linearregression'])

- `sklearn` assigned **really large** coefficients to many features.<br><small>This means that if $x$ changes a little, the output is going to change **a lot**. It seems like some of the terms are trying to "cancel" each other out – some have large negative coefficients, some have large positive coefficients.</small>

- Intuition: In general, **the larger the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ are, the more _overfit_ the model is to the training data.**

### Ridge regression

- **Idea**: In addition to just minimizing mean squared error, what if we could **also** try and prevent large parameter values?<br><small>Maybe this would lead to less overfitting!</small>

- **Regularization** is the act of adding a penalty on the norm of the parameter vector, $\vec w$, to the objective function.

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \:\: +  \underbrace{\lambda \sum_{j = 1}^d w_j^2}_{{\substack{\text{regularization} \\ \text{penalty}}}}$$

- Linear regression with $L_2$ regularization – as shown above – is called **ridge regression**.<br><small>You'll explore the reason why in Homework 9!</small>


- Intuition: Instead of just minimizing mean squared error, we balance minimizing mean squared error and a penalty on the size of the fit coefficients, $w_1^*$, $w_2^*$, ..., $w_d^*$.<br><small>We don't regularize the intercept term!</small>

- $\lambda$ is a **hyperparameter**, which we choose through cross-validation.

- The $\vec{w}_\text{ridge}^*$ that minimizes $R_\text{ridge}(\vec{w})$ is not necessarily the same as $\vec{w}_\text{OLS}^*$, which minimizes $R_\text{sq}(\vec{w})$!

<div class="alert alert-success">

### Activity
    
The objective function we minimize to find $\vec{w}_\text{ridge}^*$ in **ridge regression** is:
    
$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$
    
$\lambda$ is a **hyperparameter**, which we choose through cross-validation. Discuss the following points with those near you:
    
- What if we pick $\lambda = 0$ – what is $\vec{w}_\text{ridge}^*$ then?
- What happens to $\vec{w}_\text{ridge}^*$ as $\lambda \rightarrow \infty$?
- Can $\lambda$ be negative?
    
</div>

### Another interpretation of ridge regression

- As $\lambda$ increases, the penalty on the size of $\vec{w}_\text{ridge}^*$ increases, meaning that each $w_j^*$ inches closer to 0.

- An equivalent way of formulating the ridge regression objective function,

    $$\text{minimize} \:\:\:\: \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$

    is as a **constrained** optimization problem:
    
    $$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{  such that   } \sum_{j = 1}^d w_j^2 \leq Q$$

- $Q$ and $\lambda$ are **inversely related**: the larger $Q$ is, the less of a penalty we're putting on size of $\vec{w}_\text{ridge}^*$, so the smaller $\lambda$ is.

    $$\lambda \approx \frac{1}{Q}$$

    <br><small>The exact relationship between $Q$ and $\lambda$ is outside of the scope of this course, as is the proof of this fact.</small>

### Aside: Contour plots

- First, let's look at the loss surface for mean squared error **without** regularization, for some two feature regression model.

In [None]:
util.show_ols_surface()

- We can equivalently visualize this 3D surface as a **contour plot**, which can be thought of as a projection of the surface above into two dimensions, with the colors preserved.<br><small>Learn more about contour plots in [this video](https://youtu.be/WsZj5Rb6do8?si=SmJCSAAJOT2O5m7J).</small>

In [None]:
util.show_ols_contour()

### Visualizing ridge regression as a constrained optimization problem

$$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{  such that   } \sum_{j = 1}^d w_j^2 \leq Q; \qquad \lambda \approx \frac{1}{Q}$$

- Intuitively:

    - The **contour plot of the loss surface** for just the mean squared error component is in <span style="color:#440154"><b>v</b></span><span style="color:#482778"><b>i</b></span><span style="color:#3e4a89"><b>r</b></span><span style="color:#31688e"><b>i</b></span><span style="color:#26828e"><b>d</b></span><span style="color:#1f9a8c"><b>i</b></span><span style="color:#a3d8cf"><b>s</b></span>.
    - The constraint, $\sum_{j = 1}^d w_j^2 \leq Q$, is in <span style="color:red"><b>red</b></span>. Ridge regression says, minimize mean squared error, <span style="color:red"><b>while staying in the red circle</b></span>.<br><small>The larger $Q$ is, the larger the radius of the circle is.</small>

In [None]:
util.show_ridge_contour()

- The smaller $Q$ is – so, the larger $\lambda$ is – the smaller the <span style="color:red"><b>red circle</b></span> is!

- As $\lambda$ increases, the constrained solution $\vec w_\text{ridge}^*$ shrinks closer to $\vec 0$.

### Finding $\vec{w}_\text{ridge}^*$

- We know that the $\vec{w}_\text{OLS}^*$ that minimizes mean squared error,
    $$R_\text{sq}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2$$
  is the one that satisfies the normal equations, $X^TX \vec{w} = X^T \vec{y}$.<br><small>Recall, linear regression that minimizes mean squared error, without any other constraints, is called **ordinary least squares (OLS)**.</small>

- Sometimes, $\vec{w}^*_\text{OLS}$ is unique, and sometimes there are infinitely many possible $\vec{w}^*_\text{OLS}$.<br><small>There are infinitely many possible $\vec{w}^*_\text{OLS}$ when the design matrix, $X$, is not full rank! All of these infinitely many solutions minimize mean squared error.</small>

- Which vector $\vec{w}_\text{ridge}^*$ minimizes the ridge regression objective function?

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$

- It turns out there is **always** a unique solution for $\vec{w}_\text{ridge}^*$, even if $X$ is not full rank. It is:
    $$\vec{w}_\text{ridge}^* = (X^TX + n \lambda I)^{-1} X^T \vec{y}$$
    <br><small>You'll prove this in Homework 9!</small>

- Since there is **always** a unique solution, ridge regression is often used in the presence of multicollinearity!

### Visualizing ridge regression as an unconstrained optimization problem

- In the new [**Ridge Regression guide**](https://practicaldsc.org/guides/machine-learning/ridge-regression/), we dive deeper into some of the theory of ridge regression.

- Scroll to the section titled [**Loss surfaces**](https://practicaldsc.org/guides/machine-learning/ridge-regression/#loss-surfaces) to see an interactive visualization on the effect of $\lambda$ on the shape of the objective function's loss surface:

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$

<center><img src="imgs/drag.png" width=600></center>

### Taking a step back

- $\vec{w}_\text{ridge}^*$ **doesn't** minimize mean squared error – it minimizes a slightly different objective function.

- So, why would we use ever use ridge regression?

### Ridge regression in `sklearn`

- Fortunately, `sklearn` can perform ridge regression for us.

In [None]:
from sklearn.linear_model import Ridge

- Just to experiment, let's set $\lambda$ to something extremely large and look at the resulting predictions.

In [None]:
# The name of the lambda hyperparameter in sklearn is alpha.
model_large_lambda = make_pipeline(PolynomialFeatures(25, include_bias=False), 
                                   Ridge(alpha=1000000000000000000000000000))
model_large_lambda.fit(X_train, y_train)

### Visualizing the extremely regularized model

- What do the <b><span style="color:purple">resulting predictions</span></b> look like?

In [None]:
util.plot_given_model_dict(X_train, y_train, {'Extremely Regularized Polynomial of Degree 25': (model_large_lambda, 'purple')})

- What do you notice?

In [None]:
model_large_lambda.named_steps['ridge'].intercept_

In [None]:
# All 0!
model_large_lambda.named_steps['ridge'].coef_

In [None]:
y_train.mean()

### Using `GridSearchCV` to choose $\lambda$

- In general, we won't just arbitrarily choose a value of $\lambda$.

- Instead, we'll perform $k$-fold cross-validation to choose the $\lambda$ that leads to predictions that work best on unseen test data.<br><small>The value of $\lambda$ depends on the specific dataset and model you've chosen; there's no universally "best" $\lambda$.</small>

In [None]:
hyperparams = {
    'ridge__alpha': 10.0 ** np.arange(-2, 15) # Try 0.01, 0.1, 1, 10, 100, 1000, ... 
}
model_regularized = GridSearchCV(
    estimator=make_pipeline(PolynomialFeatures(25, include_bias=False), Ridge()),
    param_grid=hyperparams,
    scoring='neg_mean_squared_error'
)
model_regularized.fit(X_train, y_train)

- Let's check the optimal $\lambda$ it found!

In [None]:
model_regularized.best_params_

- While we used `GridSearchCV` here, note that `RidgeCV` also exists, which performs automatic cross-validation.

### Visualizing the regularized degree 25 model

- What do the <b><span style="color:green">resulting predictions</span></b> look like?

In [None]:
util.plot_given_model_dict(X_train, y_train,
                           {'Unregularized Polynomial of Degree 25': (model, '#ff7f0f'),
                            'Regularized Polynomial of Degree 25': (model_regularized, 'green')})

- It seems that the <b><span style="color:green">regularized polynomial</span></b> is _less_ overfit to the specific noise in the training data than the <b><span style="color:#ff7f0f">unregularized polynomial</span></b>!

- The largest coefficients are all much smaller now, too.
<br><small>The coefficient on $x^{20}$ is -0.000136.</small>

In [None]:
util.display_features(model_regularized.best_estimator_.named_steps['ridge'], precision=8)

In [None]:
util.plot_coefficient_magnitudes(model_regularized.best_estimator_.named_steps['ridge'])

- Note that none of them are exactly 0, but many of them are close!<br><small>This will be important later.</small>

### Tuning multiple hyperparameters at once

- What if we don't want to fix a polynomial degree in advance, and instead want to choose **both** the degree and value of $\lambda$ using cross-validation?

- No problem – we can still grid search.<br><small>Note that the next cell takes much longer than the previous call to `fit` took, since it needs to try every combination of $\lambda$ and polynomial degree.</small>

In [None]:
hyperparams = {
    'ridge__alpha': 10.0 ** np.arange(-2, 15),
    'polynomialfeatures__degree': range(1, 26)
}
model_regularized_degree = GridSearchCV(
    estimator=make_pipeline(PolynomialFeatures(include_bias=False), Ridge()),
    param_grid=hyperparams,
    scoring='neg_mean_squared_error'
)
model_regularized_degree.fit(X_train, y_train)

- Now, let's check the optimal $\lambda$ **and** polynomial degree it found!

In [None]:
model_regularized_degree.best_params_

### Visualizing the regularized degree 3 model

- What do the <b><span style="color:skyblue">resulting predictions</span></b> look like?

In [None]:
polyfig = util.plot_given_model_dict(X_train, y_train,
                                     {'Unregularized Polynomial of Degree 25': (model, '#ff7f0f'),
                                      'Regularized Polynomial of Degree 25': (model_regularized, 'green'),
                                      'Regularized Polynomial of Degree 3': (model_regularized_degree, 'skyblue')})
polyfig

In [None]:
util.display_features(model_regularized_degree.best_estimator_.named_steps['ridge'])

Run the cell below to set up the next slide.

In [None]:
from sklearn.metrics import mean_squared_error
unregularized_train = mean_squared_error(y_train, model.predict(X_train))
unregularized_test = mean_squared_error(y_test, model.predict(X_test))
regularized_lambda_train = mean_squared_error(y_train, model_regularized.predict(X_train))
regularized_lambda_validation = (-model_regularized.cv_results_['mean_test_score']).min()
regularized_lambda_test = mean_squared_error(y_test, model_regularized.predict(X_test))
regularized_lambda_degree_train = mean_squared_error(y_train, model_regularized_degree.predict(X_train))
regularized_lambda_degree_validation = (-model_regularized_degree.cv_results_['mean_test_score']).min()
regularized_lambda_degree_test = mean_squared_error(y_test, model_regularized_degree.predict(X_test))
results_df = pd.DataFrame(index=['training MSE', 'average validation MSE (across all folds)', 'test MSE']).assign(
    unregularized=[unregularized_train, np.nan, unregularized_test],
    regularized_lambda_only=[regularized_lambda_train, regularized_lambda_validation, regularized_lambda_test],
    regularized_lambda_and_degree=[regularized_lambda_degree_train, regularized_lambda_degree_validation, regularized_lambda_degree_test]
)

In [None]:
reprs = {'unregularized': '<b><span style="color:#ff7f0f">Unregularized (Degree 25)</span></b>',
         'regularized_lambda_only': '<b><span style="color:green">Regularized (Degree 25)<br><small>Used cross-validation to choose $\lambda$</span></b>',
         'regularized_lambda_and_degree': '<b><span style="color:skyblue">Regularized (Degree 3)<br><small>Used cross-validation to choose $\lambda$ and degree</small></span></b>'}

In [None]:
results_df_str = results_df.to_html()
for rep in reprs:
    results_df_str = results_df_str.replace(rep, reprs[rep])

### Comparing training, validation, and test errors

- Let's compare the training and testing error of the three polynomials below.

In [None]:
polyfig

In [None]:
display(HTML(results_df_str))

- It seems that the <b><span style="color:skyblue">regularized polynomial, in which we used cross-validation to choose both the regularization penalty $\lambda$ **and** degree</span></b>, generalizes best to unseen data.

### What's next?

- Could we have chosen a different method of penalizing each $w_j$ other than $w_j^2$?<br><small>We're about to see another option!</small>

- Ridge regression's objective function happened to have a closed-form solution.<br>What if we want to minimize a function that **can't** be minimized by hand?<br><small>We'll talk about how in the next lecture!</small>

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have about ridge regression?

## LASSO 📿

---

### Penalizing large parameters

- The ridge regression objective function,
    $$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$
    minimizes mean squared error, **plus** a **squared** penalty on the size of the fit coefficients, $w_1^*, w_2^*, ..., w_d^*$.

- Could we have **regularized**, or penalized the coefficients, in some other way?

- The **LASSO** objective function penalizes the **absolute value** of each coefficient:

    $$R_\text{LASSO}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d |w_j| $$

- LASSO stands for "least absolute shrinkage and selection operator."<br><small>We'll make sense of this name shortly.</small>

### Aside: Vector norms

- The $L_2$ norm, or Euclidean norm, of a vector $\vec v \in \mathbb{R}^n$ is defined as:

$$\lVert \vec v \rVert = \lVert \vec v \rVert_2 = \sqrt{v_1^2 + v_2^2 + ... + v_n^2} = \big(v_1^2 + v_2^2 + ... + v_n^2 \big)^\frac{1}{2} $$

<center><small>The $L_2$ norm is the default norm, which is why the subscript 2 is often omitted.</small></center>

- The $L_p$ norm of $\vec v$, for $p \geq 1$, is:

$$\lVert \vec v \rVert_p = \big(|v_1|^p + |v_2|^p + ... + |v_n|^p \big)^\frac{1}{p}$$

- Ridge regression is said to use $L_2$ regularization because it penalizes the (squared) $L_2$ norm of $\vec w$, ignoring the intercept term:

    $$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$


- LASSO is said to use $L_1$ regularization because it penalizes the $L_1$ norm of $\vec w$, ignoring the intercept term:

    $$R_\text{LASSO}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d |w_j| $$

### LASSO in `sklearn`

- Unlike with ridge regression or ordinary least squares, there is no general closed-form solution for $\vec{w}_\text{LASSO}^*$.

- But, it can be estimated using numerical methods, which `sklearn` uses under-the-hood. Let's test it out.<br><small>More on numerical methods soon!</small>

In [None]:
from sklearn.linear_model import Lasso

- Let's use LASSO to fit a degree 25 polynomial to Sample 1.<br><small>Here, we'll **fix** the degree, and cross-validate to find $\lambda$.</small>

In [None]:
hyperparams = {
    'lasso__alpha': 10.0 ** np.arange(-2, 15)
}
model_regularized_lasso = GridSearchCV(
    estimator=make_pipeline(PolynomialFeatures(25, include_bias=False), Lasso()),
    param_grid=hyperparams,
    scoring='neg_mean_squared_error'
)
model_regularized_lasso.fit(X_train, y_train)

- Our cross-validation routine ends up choosing $\lambda = 0.1$, though on its own, this doesn't really tell us anything.

In [None]:
model_regularized_lasso.best_params_

### Visualizing the regularized degree 25 model, fit with LASSO

- What do the <b><span style="color:red">resulting predictions</span></b> look like, relative to the fit polynomials from earlier in the lecture?

In [None]:
util.plot_given_model_dict(X_train, y_train,
                                     {'Unregularized Polynomial of Degree 25': (model, '#ff7f0f'),
                                      'Regularized Polynomial of Degree 25': (model_regularized, 'green'),
                                      'Regularized Polynomial of Degree 3': (model_regularized_degree, 'skyblue'),
                                      'Regularized Polynomial of Degree 25, using LASSO': (model_regularized_lasso, 'red')})

- What do you notice about the coefficients of the polynomial themselves?

In [None]:
util.display_features(model_regularized_lasso.best_estimator_.named_steps['lasso'], precision=8)

- **Important**: Note that we fit a degree 25 polynomial, but many of the higher-order terms are missing, since their coefficients ended up being **exactly** 0!<br><small>There's are no $x^{18}, x^{19}, x^{20}, ..., x^{25}$ terms above, and also no $x$ term.</small>

- The <b><span style="color:red">resulting polynomial</span></b> ends up being of degree 17.

### When using LASSO, many coefficients are set to 0!

- When using $L_1$ regularization – that is, when performing LASSO – many of the optimal coefficients $w_1^*, w_2^*, ..., w_d^*$ end up being **exactly 0**.

- This was not the case in ridge regression – there, the optimal coefficients were all very small, but none were exactly 0.

In [None]:
display(Markdown('#### Fit using Ridge:'))
util.display_features(model_regularized.best_estimator_.named_steps['ridge'], precision=8)

In [None]:
util.plot_coefficient_magnitudes(model_regularized.best_estimator_.named_steps['ridge'])

- If a feature has a coefficient of 0, it means it's not being used at all in making predictions.

In [None]:
display(Markdown('#### Fit using LASSO (notice the larger coefficient on $x^3$):'))
util.display_features(model_regularized_lasso.best_estimator_.named_steps['lasso'], precision=8)

In [None]:
util.plot_coefficient_magnitudes(model_regularized_lasso.best_estimator_.named_steps['lasso'])

- LASSO implicitly performs **feature selection** for us – it automatically tells us which features we don't need to use.<br><small>Here, it told us "don't use $x$, don't use $x^{18}$, don't use $x^{19}$, ..., don't use $x^{25}$, and instead weigh the $x^2$ and $x^3$ terms more."</small>

- This is where the name "least absolute shrinkage and **selection** operator" comes from.

### Why does LASSO encourage sparsity?

- The fact that many of the optimal coefficients – $w_1^*, w_2^*, ..., w_d^*$ – are 0 when performing LASSO is often stated as:

<br><br>

<center><b>LASSO encourages <i>sparsity</i>.</b></center>

- To make sense of this, let's look at the equivalent formulation of LASSO as a **constrained optimization problem**.

    $$\text{minimize} \:\:\:\: \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d | w_j |$$

    is equivalent to:
    
    $$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{  such that   } \sum_{j = 1}^d | w_j | \leq Q$$

- Again, $Q$ and $\lambda$ are **inversely related**: the larger $Q$ is, the less of a penalty we're putting on size of $\vec{w}_\text{LASSO}^*$, so the smaller $\lambda$ is.

### Visualizing LASSO as a constrained optimization problem

$$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{  such that   } \sum_{j = 1}^d |w_j| \leq Q; \qquad \lambda \approx \frac{1}{Q}$$

- As before:

    - The **contour plot of the loss surface** for just the mean squared error component is in <span style="color:#440154"><b>v</b></span><span style="color:#482778"><b>i</b></span><span style="color:#3e4a89"><b>r</b></span><span style="color:#31688e"><b>i</b></span><span style="color:#26828e"><b>d</b></span><span style="color:#1f9a8c"><b>i</b></span><span style="color:#a3d8cf"><b>s</b></span>.
    - The constraint, $\sum_{j = 1}^d |w_j| \leq Q$, is in <span style="color:red"><b>red</b></span>. LASSO says, minimize mean squared error, <span style="color:red"><b>while staying in the red diamond</b></span>.<br><small>The larger $Q$ is, the larger the side length of the diamond is.</small>

In [None]:
util.show_lasso_contour()

- Notice that the <span style="color:red"><b>constraint set</b></span> has clearly defined "corners," which lie on the axes. The axes are where the parameter values, $w_1$ and $w_2$ here, are 0.

- Due to the shape of the constraint set, it's likely that the minimum value of <b><span style="color:blue">mean squared error</span></b>, among all options in the <b><span style="color:red">red diamond</span></b>, will occur at a corner, where some of the parameter values are 0.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have about LASSO, or regularization in general?

## Example: Commute times

---

### Another example: Commute times

- Last class, **before we learned about regularization**, we used $k$-fold cross-validation to choose between the following five models that predict commute time in `'minutes'`.

<center><img src="imgs/five-pipelines.png" width=900></center>

- The most complicated model, labeled `departure_hour with poly features + day OHE + month OHE + week`, didn't generalize well to unseen data, relative to more simple models.<br><small>At least, not when we used ordinary least squares to train it.</small>

- Let's use ordinary least squares, ridge regression, **and** LASSO to train the most complicated model from above, and compare the results.

In [None]:
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('minutes', axis=1), df['minutes'], random_state=23)

### Ordinary least squares for commute times

- The Pipeline below implements the feature transformations necessary to produce the 

    <center><code>departure_hour with poly features + day OHE + month OHE + week</code></center>

    model from the last slide.

In [None]:
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str),
                                     feature_names_out='one-to-one')
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first'))
# Note the include_bias=False once again!
commute_feature_pipe = make_pipeline(
    make_column_transformer(
        (PolynomialFeatures(3, include_bias=False), ['departure_hour']),
        (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']),
        (day_of_month_transformer, ['day_of_month']),
    )
)

- First, we'll fit a "vanilla" linear regression model, i.e. one that just minimizes mean squared error, with no regularization.

In [None]:
commute_model_ols = make_pipeline(commute_feature_pipe, LinearRegression())
commute_model_ols

- There are no hyperparameters to grid search for here, so we'll just fit the model directly.

In [None]:
commute_model_ols.fit(X_train, y_train)

- We'll keep `commute_model_ols` aside for now, and compare its performance to the fit regularized models in a few moments.

### Ridge regression for commute times

- Again, let's instantiate a Pipeline for the steps we want to execute.

In [None]:
commute_pipe_ridge = make_pipeline(commute_feature_pipe, Ridge())
commute_pipe_ridge

- Now, since we need to choose the regularization penalty, $\lambda$, we'll fit a `GridSearchCV` instance with a hyperparameter grid.

In [None]:
lambdas = 10.0 ** np.arange(-10, 15)
hyperparams = {
    'ridge__alpha': lambdas 
}

In [None]:
commute_model_ridge = GridSearchCV(
    commute_pipe_ridge,
    param_grid = hyperparams,
    scoring='neg_mean_squared_error',
    cv=10
)
commute_model_ridge.fit(X_train, y_train)

- Which $\lambda$ did it choose?<br><small>On its own, this value of $\lambda$ doesn't really tell us anything.</small>

In [None]:
commute_model_ridge.best_params_

### Aside: average validation error vs. $\lambda$

- How did the average validation MSE change with $\lambda$?<br><small>Here, large values of $\lambda$ mean **less complex models**, not more complex.</small>

In [None]:
(
    pd.Series(-commute_model_ridge.cv_results_['mean_test_score'], 
              index=np.log10(lambdas))
    .to_frame()
    .reset_index()
    .plot(kind='line', x='index', y=0)
    .update_layout(xaxis_title='$\log(\lambda)$', yaxis_title='Average Validation MSE')
)

### LASSO for commute times

- Let's instantiate a third Pipeline for the steps we want to execute.

In [None]:
commute_pipe_lasso = make_pipeline(commute_feature_pipe, Lasso())
commute_pipe_lasso

- Again, we'll grid search to find the best $\lambda$.

In [None]:
lambdas = 10.0 ** np.arange(-10, 15)
hyperparams = {
    'lasso__alpha': lambdas 
}

In [None]:
commute_model_lasso = GridSearchCV(
    commute_pipe_lasso,
    param_grid = hyperparams,
    scoring='neg_mean_squared_error',
    cv=10
)
commute_model_lasso.fit(X_train, y_train)

- Which $\lambda$ did it choose? Is it the same as when we used ridge regression?<br><small>On its own, this value of $\lambda$ doesn't really tell us anything.</small>

In [None]:
commute_model_lasso.best_params_

Run the cell below to set up the next slide.

In [None]:
commute_results = pd.concat([
    util.display_commute_coefs(commute_model_ols),
    util.display_commute_coefs(commute_model_ridge.best_estimator_),
    util.display_commute_coefs(commute_model_lasso.best_estimator_)
], axis=1)
commute_results.columns = ['ols', 'ridge', 'lasso']

### Comparing coefficients across models

- What do the resulting coefficients look like in all three models?

In [None]:
display_df(commute_results, rows=22)

- The coefficients in the OLS model tend to be the largest in magnitude.

- In the ridge model, the coefficients are all generally small, but none are 0.

- In the LASSO model, many coefficients are 0 exactly.

### Feature selection

- Which features did LASSO "select", i.e. assign a nonzero coefficient to?

In [None]:
display_df(
    commute_results.loc[commute_results['lasso'] != 0, 'lasso'],
    rows=22
)

- How does this change if we **increase** the regularization penalty, $\lambda$?

In [None]:
def control_alpha(lamb):
    commute_pipe_lasso = make_pipeline(commute_feature_pipe, Lasso(alpha=lamb))
    commute_pipe_lasso.fit(X_train, y_train)
    coefs = commute_pipe_lasso[-1].coef_
    names = commute_pipe_lasso[0].get_feature_names_out()
    s = pd.Series(coefs, index=names)
    fig = px.bar(x=s, y=s.index, title=f'Coefficients using LASSO with lambda={lamb}', height=800, width=800)
    fig.update_layout(xaxis_title='Coefficient', yaxis_title='Feature')
    return fig
interact(control_alpha, lamb=(0, 3, 0.01));

### Comparing training and test errors across models

In [None]:
model_dict = {'ols': commute_model_ols, 'ridge': commute_model_ridge, 'lasso': commute_model_lasso}
df = pd.DataFrame().assign(**{
    'Model': model_dict.keys(),
    'Training MSE': [mean_squared_error(y_train, model_dict[model].predict(X_train)) for model in model_dict],
    'Test MSE': [mean_squared_error(y_test, model_dict[model].predict(X_test)) for model in model_dict]
}).set_index('Model')
df.plot(kind='barh', barmode='group')

- The best-fitting LASSO model seems to have a lower training and testing MSE than the best-fitting ridge model.

- But, in general, sometimes LASSO performs better on unseen data, and sometimes ridge does. Cross-validate!<br><small>Sometimes, machine learning practitioners say "there's no free lunch" – there's no universal always-best technique to use to make predictions, it always depends on the specific data you have.</small>

### Standardize when regularizing

- As we discussed a few lectures ago, by **standardizing** our features, we bring them all to the same scale.

- Standardizing features in ordinary least squares doesn't change our model's **performance**; rather, it impacts the interpretability of the coefficients.

- But, when regularizing, we're penalizing the sizes of the coefficients, which can be on wildly different scales if the features are on different scales.

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \mathbf{+} \underbrace{\lambda \sum_{j = 1}^d w_j^2}_{\substack{\text{regularization} \\ \text{penalty}}}$$

- So, **when regularizing a linear model, you should standardize the features first**, so the coefficients for all features are on the same scale, and are penalized equally.

In [None]:
# In other words, commute_feature_pipe should've been this!
make_pipeline(commute_feature_pipe, StandardScaler())

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have about regularization in general?