In [None]:
from lec_utils import *
import lec18_util as util
def show_cv_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTydTrLDr-y4nxQu1OMsaoqO5EnPEISz2VYmM6pd83ke8YnnTBJlp40NfNLI1HMgoaKx6GBKXYE4UcA/embed?start=false&loop=false&delayms=60000&rm=minimal"
    display(IFrame(src, width=900, height=361))


<div class="alert alert-info" markdown="1">

#### Lecture 18

# Generalization and Cross-Validation

### EECS 398: Practical Data Science, Spring 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/sp25">github.com/practicaldsc/sp25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/78535/discussion/6647877) </small>
    
</div>


### Agenda 📆

- Generalization 🔭.
- Hyperparameters and train-test splits 🎛️.
- Cross-validation.



For additional reading, take a look at [mlu-explain.github.io](https://mlu-explain.github.io/), a site with interactive explanations for a lot of core machine learning topics, like:
- [Linear Regression](https://mlu-explain.github.io/linear-regression/).
- [The Bias-Variance Tradeoff](https://mlu-explain.github.io/bias-variance/).
- [Train, Test, and Validation Sets](https://mlu-explain.github.io/train-test-validation/).
- [Cross-Validation](https://mlu-explain.github.io/cross-validation/).
- and other ideas we'll see later in the semester!
We've linked these articles in the [Resources](https://practicaldsc.org/resources) tab of the course website, too.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Generalization 🔭

---

### Motivation

- You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a **practice exam**.<br><small>Your logic: If you do well on the practice exam, you should do well on the real exam.</small>

- You each take the practice exam once and look at the solutions afterwards.

- **Your strategy**: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."

- **Billy's strategy**: Learn high-level concepts from the solutions, e.g. "the TF-IDF of term $t$ in document $d$ is large when $t$ occurs often in $d$ but rarely overall."

- Who will do better on the **practice exam**? Who will probably do better on the **real exam**? 🧐

<center><img src="imgs/interpolation.png" width=700></center>

### Evaluating the quality of a model

- So far, we've computed the MSE of our fit regression models on the **data that we used to fit them**, i.e. the **training data**.<br><small>This mean squared error is called the **training MSE**, or **training error**.</small>

- We've said that Model A is <b><span style="color:green">better</span></b> than Model B if Model A's MSE is <b><span style="color:green">lower</span></b> than Model B's MSE.
    - Remember, our training data is a sample from some population.
    - Just because a model fits the training data well doesn't mean it will **generalize** and work well on **similar, unseen samples** from the same population!

### Overfitting and underfitting

- Let's collect two samples $\{(x_i, y_i)\}$ from the same population.

In [None]:
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()

- For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly **cubic**; that is, $y \approx x^3$.<br><small>Remember, in reality, you won't get to see the population distribution. If you could, there'd be no need to build a model!</small>

In [None]:
px.scatter(sample_1, x='x', y='y', title='Sample 1')

### Polynomial regression

- Let's fit three **polynomial** models on Sample 1: degree 1, degree 3, and degree 25.<br><small>We'll use the `PolynomialFeatures` transformer, which was part of one of our Pipelines from last class.</small>

In [None]:
from sklearn.preprocessing import PolynomialFeatures
# fit_transform fits and transforms the same input.
# We tell it not to add a column of 1s, because
# LinearRegression() does this automatically later on.
d2 = PolynomialFeatures(3, include_bias=False)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))

- Below, we look at our three models' predictions on Sample 1, which they were **trained** on.

In [None]:
# Look at the definition of train_and_plot in lec17_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25], data_name='Sample 1')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')

- The degree 25 polynomial has the lowest MSE on Sample 1.

- How do the same fit polynomials look on Sample 2?

In [None]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')

- The degree 3 polynomial has the lowest MSE on Sample 2. 

- Note that **we didn't get to see Sample 2 when fitting our models**! 

- As such, it seems that the degree 3 polynomial **generalizes better** to unseen data than the degree 25 polynomial does.

- What if we fit a degree 1, degree 3, and degree 25 polynomial **on Sample 2** as well?

In [None]:
fig = util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])
fig

- **Key idea**: Degree 25 polynomials seem to **vary more when trained on different samples** than degree 3 and 1 polynomials do.

### Bias and variance

- The training data we have access to is a sample from the population. We are concerned with our model's ability to **generalize** and work well on **different datasets** drawn from the same population.

- Suppose we **fit** a model $H^*$ (e.g. a degree 3 polynomial) on **several different datasets** from a population.

- There are three main sources of error in our model's predictions:
    - Model bias: Two slides from now.
    - Model variance: Next slide.
    - **Observation error**: The error due to the random noise in the process we are trying to model (e.g. measurement error). We can't control this, without collecting more data!


### Model variance

- **Model variance: The variance of a model's predictions, across all datasets.**

- In other words, for a given $\vec x_i$, how much does $H^*(\vec x_i)$ vary across all datasets?

In [None]:
fig

- Low model variance is good! ✅

- High model variance is a sign of **overfitting**, i.e. that our model is too **complicated** and is prone to fitting to the noise in our training data.</small>

### Model bias

- **Model bias: The averaged deviation between a predicted value and an actual value, across all datasets**.

- In other words, for a given $\vec x_i$, how far is $H^*(\vec x_i)$ from the true $y_i$, on average?

In [None]:
fig = util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25], data=True)
fig

- Low bias is good! ✅

- High bias is a sign of **underfitting**, i.e. that our model is too **basic** to capture the relationship between our features and response.

<center><img src="imgs/image_5.png" width="600"></center>

- Here, suppose:
    - The <span style='color:#c6283f'><b>red bulls-eye</b></span> represents your **true weight and height** 🧍.
    - The <span style='color:#080c6f'><b>dark blue darts</b></span> represent **predictions of your weight and height** using different models that were fit using different samples drawn from the same population. 


- We'd like our models to be in the top left, but in practice that's hard to achieve!

### Risk vs. empirical risk

- Since Lecture 11, we've minimized **empirical risk** to find optimal model parameters $\vec{w}^*$:

$$\vec{w}^* = \underset{\vec{w}}{\text{argmin}} \frac{1}{n} \sum_{i = 1}^n \left( y_i - H(\vec x_i) \right)^2$$

- **Key idea**: A model that works well on past data should work well on future data, if future data looks like past data.

- What we really want is for the:
    - **expected** loss for a new data point $(\vec x_{\text{new}}, y_{\text{new}})$, 
    - drawn from the same population as the training set, to be small.
    
    That is, we want to minimize **risk**:
    $$\text{risk} = \mathbb{E}[y_{\text{new}} - H(\vec x_{\text{new}})]^2$$

- In general, we don't know the entire population distribution of $x$s and $y$s, so we can't compute risk exactly.<br>That's why we compute **empirical** risk!

$$\mathbb{E}[y_{\text{new}} - H(\vec x_{\text{new}})]^2 \approx \frac{1}{n} \sum_{i = 1}^n \left( y_i - H(\vec x_i) \right)^2 = R(H)$$

### The bias-variance decomposition

- Risk can be decomposed as follows:<br><small>Remember, this expectation $\mathbb{E}$ is over the entire population of $x$s and $y$s. In real life, we don't know what this population distribution is, so we can't put actual numbers to this.</small>

$$\mathbb{E}[y_{\text{new}} - H(\vec x_{\text{new}})]^2 = \text{model bias}^2 + \text{model variance} + \text{observation error}$$

- We won't cover the proof of the decomposition here – read [**this**](https://learningds.org/ch/17/inf_pred_gen_prob.html#probability-behind-model-selection) for more – but note that in Homework 6, Question 3, you proved a related formula for $R_\text{sq}(h)$:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2 = \underbrace{\frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2}_{\text{variance of } y} + (\bar{y} - h)^2$$

- **Key takeaway**: If we care about minimizing (empirical) risk, we can equivalently try to minimize both model bias and model variance.

- As model variance increases, model bias tends to decrease, and vice versa.<br>That is, there is a **tradeoff** between bias and variance:

<center><img src="imgs/bv-decomp.svg" width=800></center>

## Hyperparameters and train-test splits 🎛️

---

### Example: Polynomial regression

- We recently looked at an example of **polynomial regression**.

In [None]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')

- When building these models:
    - We **got to choose** the degree of the polynomials – we chose 1, 3, and 25.
    - We didn't get to choose the exact formulas for the three polynomials – their formulas were **learned from data**.<br><small>No matter what the data looked like, the left-most model **had** to look like a line, because we chose its degree in advance.</small>

### Hyperparameters

- A **parameter** defines the relationship between variables in a model. **We learn parameters from data**.
    - For instance, suppose we fit a degree 3 polynomial to data, and end up with:
    
    $$H^*(x_i) = 1 - 2x_i + 13x_i^2 - 4x_i^3$$
    
    - 1, -2, 13, and -4 are parameters.

- A **hyperparameter** is a parameter that we choose _before_ our model is fit to the data.
    - Think of hyperparameters as knobs 🎛 – we get to pick and tune them!
    - **Polynomial degree** was a hyperparameter in the previous example, and we tried three different values: 1, 3, and 25.

- **Question**: How do we choose the "right" hyperparameter(s)?<br><small>Degree 3 was a better choice than degree 25, for example – but how do we systematically choose?

### Train-test splits 🚆

- Suppose we're choosing between many different models.<br><small>Here, by "model" we really mean "hyperparameter value", e.g. one "model" is a degree 3 polynomial, while another is a degree 4 polynomial.</small>

- We won't know whether a model has **overfit** to our sample unless we get to see how well it performs on a new sample from the same population.

- 💡**Idea**: **Split** our dataset into a <span style='color: blue'><b>training set</b></span> and <span style='color: orange'><b>test set</b></span>.

<center><img src='imgs/train-test-first.png' width=700></center>

- For each model we're considering (e.g. each polynomial degree):
    - Use **only** the training set to fit that model (i.e. find $\vec{w}^*$).
    - Use the test set to evaluate that model's error (e.g. compute its MSE).

- Pick the model with the **lowest** test error.

- Why? The test set is like a new sample of data from the same population as the training data!

### Train-test split 

- `sklearn.model_selection.train_test_split` implements a train-test split for us! 🙏🏼 

- If `X` is an array/DataFrame of features and `y` is an array/Series of responses,
    ```py
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    ```
    randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Read the documentation!
train_test_split?

- Let's perform a train/test split on `sample_1`, since we'll need it to find the optimal polynomial degree.

In [None]:
sample_1

In [None]:
X = sample_1[['x']] # DataFrame. 
y = sample_1['y'] # Series. 
# We don't have to choose 0.25.
# We also don't have to set a random_state;
# we've done this so that we get the same results in lecture every time.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23) 

- Before proceeding, let's check the sizes of `X_train` and `X_test`.

In [None]:
print('Rows in X_train:', X_train.shape[0])
display(X_train.head())
print('Rows in X_test:', X_test.shape[0])
display(X_test.head())

### Remember: Train _only_ using the training data!

- Now that we've performed a train/test split of Sample 1, we'll create models with degree 1 through 25 polynomial features and compute their train and test errors.

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
train_errs = []
test_errs = []
for d in range(1, 26):
    pl = make_pipeline(PolynomialFeatures(d, include_bias=False), LinearRegression())
    pl.fit(X_train, y_train)
    train_errs.append(mean_squared_error(y_train, pl.predict(X_train)))
    test_errs.append(mean_squared_error(y_test, pl.predict(X_test)))
errs = pd.DataFrame({'Train Error': train_errs, 'Test Error': test_errs})
errs

- **Notice that we only call `pl.fit` on the training data!**

### Polynomial degree vs. train/test error

- Let's look at the plots of training error vs. degree and test error vs. degree.

In [None]:
fig = px.line(errs.iloc[:-1])
fig.update_layout(showlegend=True, xaxis_title='Polynomial Degree', yaxis_title='Mean Squared Error')

- Training error appears to decrease as polynomial degree increases.

- Test error appears to decrease until a "valley", and then increases again.

- Here, we'd choose a degree of 3, since that degree has the **lowest test error**.

### Training error vs. test error

- The pattern we saw in the previous example is true more generally.

<center><img src='imgs/tt-errors.png' width=600></center>

- We pick the hyperparameter(s) at the "valley" of test error.

- Note that training error **tends** to underestimate test error, but it doesn't have to – i.e., it is possible for test error to be lower than training error (say, if the test set is "easier" to predict than the training set).

- The results – and the bias-variance tradeoff more generally – hold true for "classic" machine learning models, like the ones we're studying here.<br>But in deep neural networks, this pattern is often violated; extremely complex models can have low test error as well.<br><small>This phenomenon is known as "double descent"; learn more [**here**](https://en.wikipedia.org/wiki/Double_descent).</small>

### Conducting train-test splits

- Recall, <span style='color: blue'><b>training data</b></span> is used to fit our model, and <span style='color: orange'><b>test data</b></span> is used to evaluate our model.

<center><img src='imgs/train-test-first.png' width=40%></center>

- **Question**: _How_ should we split?
    - `sklearn`'s `train_test_split` splits **randomly**, which usually works well.
    - However, if there is some element of **time** in the training data (say, when predicting the future price of a stock), a better split is "past" and "future".

- **Question**: How _large_ should the split be, e.g. 90%-10% vs. 75%-25%?
    - There's a tradeoff – a larger training set should lead to a "better" model, while a larger test set should lead to a better estimate of our model's ability to generalize.
    - There's no "right" choice, but we usually choose between 10% to 25% for the test set.

### But wait...

- With our current strategy, we are choosing the hyperparameter that creates the model that **performs best on the test set**.

- As such, we are **overfitting to the test set** – the best hyperparameter for the test set might not be the best hyperparameter for a totally unseen dataset!

- It seems like we need **another** split.

## Cross-validation

---

### Idea: A single validation set

<center><img src='imgs/train-test-val.png' width=500></center>

1. Split the data into three sets: <span style='color: blue'><b>training</b></span>, <span style='color: green'><b>validation</b></span>, and <span style='color: orange'><b>test</b></span>.

2. For each hyperparameter choice, <span style='color: blue'><b>train</b></span> the model only on the <span style='color: blue'><b>training set</b></span>, and <span style='color: green'><b>evaluate</b></span> the model's performance on the <span style='color: green'><b>validation set</b></span>.

3. Find the hyperparameter with the best <span style='color: green'><b>validation</b></span> performance.

4. Retrain the final model on the <span style='color: blue'><b>training</b></span> and <span style='color: green'><b>validation</b></span> sets, and report its performance on the <span style='color: orange'><b>test set</b></span>.

- **Issue**: This strategy is too dependent on the <span style='color: green'><b>validation</b></span> set, which may be small and/or not a representative sample of the data. **We're not going to do this.** ❌

### A better idea: $k$-fold cross-validation

- Instead of relying on a single <span style='color: green'><b>validation</b></span> set, we can create $k$ <span style='color: green'><b>validation</b></span> sets, where $k$ is some positive integer (5 in the example below).

<center><img src='imgs/k-fold.png' width=500></center>

- Since each data point is used for <span style='color: blue'><b>training</b></span> $k-1$ times and <span style='color: green'><b>validation</b></span> once, the (averaged) <span style='color: green'><b>validation</b></span> performance should be a good metric of a model's ability to generalize to unseen data.


- $k$-fold cross-validation (or simply "cross-validation") is **the** technique we will use for finding hyperparameters, or more generally, for choosing between different possible models. **It's what you should use in your Final Project!** ✅

### Illustrating $k$-fold cross-validation

- To illustrate $k$-fold cross-validation, let's use the following example dataset with $n = 12$ rows.<br><small>Suppose this dataset represents our **training set**, i.e. suppose we already performed a train-test split.</small>

In [None]:
training_data = pd.DataFrame().assign(x=range(0, 120, 10),
                                      y=[9, 1, 58, 3, 6, 4, -2, 8, 1, 10, 1.1, -45])        
display_df(training_data, rows=12)

- Suppose we choose $k = 4$. Then, each fold has $\frac{12}{4} = 3$ rows.

In [None]:
show_cv_slides()

### $k$-fold cross-validation, in general

- First, **shuffle** the entire training set randomly and **split** it into $k$ disjoint folds, or "slices". Then:

- For each hyperparameter:
    - For each slice:
        - Let the slice be the "validation set", $V$.
        - Let the rest of the data be the "training set", $T$.
        - Train a model using the selected hyperparameter on the training set $T$.
        - Evaluate the model on the validation set $V$.
    - Compute the **average** validation error (e.g. MSE) for the particular hyperparameter.

- Choose the hyperparameter with the **lowest** average validation error.

### `GridSearchCV`

- Let's use $k$-fold cross-validation to choose a polynomial degree that best generalizes to unseen data.<br>As before, we'll choose our polynomial degree from the list [1, 2, ..., 25].

- `GridSearchCV` takes in:
    - an **un-`fit`** instance of an estimator, and
    - a **dictionary** of hyperparameter values to try,
    
  and performs $k$-fold cross-validation to find the **combination of hyperparameters** with the best average validation performance.

In [None]:
from sklearn.model_selection import GridSearchCV
GridSearchCV?

- Why do you think it's called "grid search"?

### Grid searching for the best polynomial degree

- Here, we want to try values of degree from 1 through 25, so we'll need to specify these values in a dictionary.

In [None]:
# The key names in this dictionary are chosen very carefully.
# They need to be of the format pipelinestep__hyperparametername,
# where pipelinestep is a lowercase version of the step in the pipeline
# that we want to tune, and 
# hyperparameter name is the formal name of the hyperparameter (see the documentation).
hyperparams = {
    'polynomialfeatures__degree': range(1, 26)
}

- The scoring metric we need to provide is `'neg_mean_squared_error'`.<br><small>The `scoring` argument is used to specify that we want to compute the MSE; by default, it computes the $R^2$. It's called "neg" MSE because, by default, `sklearn` likes to "maximize" scores, and maximizing -MSE is the same as minimizing MSE.</small>

In [None]:
searcher = GridSearchCV(
    make_pipeline(PolynomialFeatures(include_bias=False), LinearRegression()),
    param_grid=hyperparams,
    cv=5, # k = 5.
    scoring='neg_mean_squared_error'
)
searcher

- Like any other estimator, `GridSearchCV` instances need to be `fit`.<br><small>Again, notice that we're only fitting it with our training data.</small>

In [None]:
searcher.fit(X_train, y_train)

- Once fit, `searcher` can tell us what it found!

In [None]:
searcher.best_params_

- `searcher` is now a fit regression model. There's no _need_ to refit it on the entire training set; it was already fit on the entire training set automatically.

In [None]:
searcher.predict([[4], 
                  [-1], 
                  [0]])

### Interpreting the results of $k$-fold cross-validation

- Let's peek under the hood.

In [None]:
errs_df = util.format_results(searcher)
errs_df

- Note that for each choice of degree (our hyperparameter), we have **five** MSEs, one for each "fold" of the data. This means that in total, $5 \cdot 25 = 125$ models were trained!

In [None]:
errs_df

- Remember, our goal is to choose the **degree** with the **lowest average** validation error.

In [None]:
errs_df.mean(axis=0)

In [None]:
fig = errs_df.mean(axis=0).iloc[:18].plot(kind='line', title='Average Validation Error')
fig.update_layout(xaxis_title='Degree', yaxis_title='Average Validation MSE', showlegend=False)

In [None]:
# Chosen automatically by sklearn.
errs_df.mean(axis=0).idxmin()

- Note that if we didn't perform $k$-fold cross-validation, but instead just used a single validation set, we may have ended up with a different result:

In [None]:
errs_df.idxmin(axis=1)

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
        
- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- **How many times is the first row in the training dataset (`X.iloc[0]`) used for training a model?**


### Another example: Commute times

- We can also use $k$-fold cross-validation to determine which subset of features to use in a linear model that predicts commute times!

In [None]:
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()

- Let's make several candidate pipelines. But first, **as always**, a train-test split.

In [None]:
# Here, we're letting X_train and X_test keep all of the columns in the DataFrame
# OTHER than 'minutes'.
X_train, X_test, y_train, y_test = train_test_split(df.drop('minutes', axis=1), df['minutes'], random_state=23)

### Creating many pipelines

In [None]:
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

In [None]:
selecter = FunctionTransformer(lambda x: x) # Shortcut to say "keep just these columns."
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str))
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first')) # From last class.
pipes = {
    'departure_hour only': make_pipeline(
        make_column_transformer((selecter, ['departure_hour'])),
        LinearRegression()
    ),
    'departure_hour + day_of_month': make_pipeline(
        make_column_transformer((selecter, ['departure_hour', 'day_of_month'])),
        LinearRegression()
    ),
    'departure_hour + day OHE': make_pipeline(
        make_column_transformer(
            (selecter, ['departure_hour']),
            (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day'])
        ),
        LinearRegression()
    ),
    'departure_hour + day OHE + month OHE': make_pipeline(
        make_column_transformer(
            (selecter, ['departure_hour']),
            (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month'])
        ),
        LinearRegression()
    ),
    'departure_hour with poly features + day OHE + month OHE + week': make_pipeline(
        make_column_transformer(
        (PolynomialFeatures(3, include_bias=False), ['departure_hour']),
        (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']),
        (day_of_month_transformer, ['day_of_month']),
    ),
    LinearRegression())
}

- Here, we will have to call `GridSearchCV` multiple times. <br><small>Here, we're choosing between many different pipelines, **not** between hyperparameters for a particular pipeline.</small>

In [None]:
results = pd.DataFrame(columns=['Average Training MSE', 'Average Validation MSE'])
for pipe in pipes:
    fitted = GridSearchCV(
        pipes[pipe],
        param_grid={}, # No hyperparameters, but we could have them.
        scoring='neg_mean_squared_error',
        cv=10, # Change this and see what happens!
        return_train_score=True # So that we can compute training MSEs, too.
    )
    fitted.fit(X_train, y_train)
    results.loc[pipe] = [-fitted.cv_results_['mean_train_score'][0], -fitted.cv_results_['mean_test_score'][0]]
commute_models_summarized = (
    results
    .sort_values('Average Training MSE')
    .plot(kind='barh', barmode='group', width=1000)
    .update_layout(xaxis_title='Mean Squared Error', yaxis_title='Model')
)
commute_models_summarized

- Which model is most likely to perform best in practice? Which model has the highest bias? Variance?

### Summary

- We care about how well our models **generalize** to unseen data.<br><small><ul><li>The more complex a model is, the more it will **overfit** to the noise in the training data, and have high **model variance**.</li><li>The less complex a model is, the more it will **underfit** the training data, and have high **bias**.</li></ul>

- To navigate this **bias-variance tradeoff**, we choose model complexity by choosing the model with the lowest error on unseen data.

<center><img src="imgs/tt-errors.png" width=500></center>

- To do so, use cross-validation:
    1. Split the data into two sets: <span style='color: blue'><b>training</b></span> and <span style='color: orange'><b>test</b></span>.
    2. Use only the <span style='color: blue'><b>training</b></span> data when designing, training, and tuning the model.
        - Use <span style='color: green'><b>$k$-fold cross-validation</b></span> to choose hyperparameters and estimate the model's ability to generalize.
        - Do not ❌ look at the <span style='color: orange'><b>test</b></span> data in this step!
    3. Commit to your final model and train it using the entire <span style='color: blue'><b>training</b></span> set.
    4. Test the data using the <span style='color: orange'><b>test</b></span> data. If the performance (e.g. MSE) is not acceptable, return to step 2.
    5. Finally, train on **all available data** and ship the model to production! 🛳

- 🚨 This is the process you should **always** use! 🚨 

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
        
What questions do you have?