In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw09.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 9

# Cross-Validation and Regularization

### EECS 398: Practical Data Science, Winter 2025

#### Due Wednesday, April 2nd at 11:59PM (one day later than usual)
    
</div>

## Instructions

Welcome to Homework 9! In this homework, you'll implement cross-validation to choose model hyperparameters,
understand why ridge regression works the way it does (down to the linear algebraic details), and build more sophisticated `sklearn` Pipelines using these techniques. Only content through Lecture 19 is necessary.

You are given **eight** slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.
<div class="alert alert-warning" markdown="1">

<div class="alert alert-warning">
This homework features a mix of autograded programming questions and manually-graded questions.
    
- Question 1 is **manually graded**, like in Homework 8, and its parts say **[Written ✏️]** in the title. For this question, **do not write your answers in this notebook**! Instead, like in Homework 8, write **all** of your answers to the written questions in this homework in a separate PDF. You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in. Submit this separate PDF to the **Homework 9 (Question 1; written problem)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

- Questions 2-3 are **fully autograded**, and its parts say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 9 (Questions 2-3; autograded problems)** assignment on Gradescope to have your code graded by the hidden autograder. This is the same workflow you followed in previous homeworks.

Your Homework 9 submission time will be the **later** of your two individual submissions.
</div>
</div>

**Make sure to show your work for all written questions! Answers without work shown may not receive full credit.**

    
This homework is worth a total of **39 points**, 15 of which are manually graded and 24 of which come from the autograder. This is not including the potential extra credit provided by Question 3.4, which only 5 students in the class can receive (see Question 3.4 for more details). The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**, but keep in mind that you'll need to submit this homework twice – one submission for your written problems, and one submission for your autograded problems. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`. 

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

## Question 1: Ruffles Have Ridges ⛰️

---

In this question, you'll gain deep familiarity with **why** and **how** ridge regression – that is, least squares regression with an $L_2$ regularization penalty – works.

<div class="alert alert-success">
    
To get started, read [**this guide**](https://practicaldsc.org/guides/machine-learning/ridge-regression/) we've written about ridge regression, after completing Lecture 19 (Regularization). Think of it as an extension of the homework spec.
    
</div>

<!-- BEGIN QUESTION -->

### Question 1.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Prove that all of the eigenvalues of $X^TX$, where $X$ is the design matrix, are non-negative.

Some guidance:
- Start by letting $\lambda_i$ and $\vec{v}_i$ be an arbitrary eigenvalue-eigenvector pair of $X^TX$. By the definition of eigenvalues and eigenvectors, what does this mean? (Hint: can $\vec{v}_i = \vec{0}$?)
- Left-multiply both sides of the equation by $\vec{v}_i^T$. What does this give us?
- From here, use the facts that $(AB)^T = B^T A^T$ and that $\lVert \vec u \rVert^2 = \vec u \cdot \vec u$ to show that $\lambda_i \geq 0$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

If $\lambda_i$ and $\vec{v}_i$ are an eigenvalue-eigenvector pair of $X^TX$, then show that $\vec{v}_i$ is **also** an eigenvector of $X^TX + n \lambda_\text{ridge} I$, with a different eigenvalue. What is the corresponding eigenvalue?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Putting the results of 1.1 and 1.2 together, explain why it is **guaranteed** that $X^TX + n \lambda_\text{ridge} I$ is invertible, for any $\lambda_\text{ridge} > 0$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

<div class="alert alert-success">
    
Before proceeding, make sure you've read the last two sections of [**this guide**](https://practicaldsc.org/guides/machine-learning/ridge-regression/), since we'll use definitions and terms from there that aren't in this homework notebook.
    
</div>

Show that:

$$ X_{\text{adj}}^T \vec{y}_\text{adj} = X_c^T \vec{y}_c $$

Some guidance: Make sure you've carefully read our work in Step 1 of [**"The final derivation" in the guide**](https://practicaldsc.org/guides/machine-learning/ridge-regression/#the-final-derivation), because you'll need to follow a similar sequence of reasoning; when multiplying $X_\text{adj}^T \vec{y}_\text{adj}$, many of the terms in the resulting matrix are 0 – make sure you're clear on which terms those are.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Step 3 of [**"The final derivation" in the guide**](https://practicaldsc.org/guides/machine-learning/ridge-regression/#the-final-derivation) relates the objective functions we seek to minimize in linear and ridge regression. Again, we'll reason about how the adjusted terms $X_{\text{adj}}$ and $\vec{y}_\text{adj}$, in the context of un-regularized linear regression, relate to the ridge regression objective function. Using the work done in Question 1.4 and before, show that:

$$\frac{1}{n} \lVert \vec{y}_{\text{adj}} - X_{\text{adj}} \vec{w} \rVert_2^2 = \frac{1}{n} \lVert \vec{y}_c - X_c \vec{w} \rVert_2^2 + \lambda_\text{ridge} \sum_{j = 1}^d w_j^2$$

Some guidance: 
- One way to proceed is to recall the definition $\lVert \vec v \rVert_2^2 = \vec v \cdot \vec v = \vec v \cdot \vec v$ and expand from there. A useful fact is that:

    $$(\vec a - \vec b) \cdot (\vec a - \vec b) = \vec a \cdot \vec a - 2 \vec a \cdot \vec b + \vec b \cdot \vec b,$$

    which follows from the commutative and distributive properties of the dot product. Another way to proceed is to re-write $\lVert \vec v \rVert_2^2 = \sum_{i = 1}^n v_i^2$ and separate sums from there.
- Note that $\sum_{j = 1}^d w_j^2 = \lVert \vec w \rVert_2^2$; we've written it in this summation notation throughout to remain consistent with the notation from class.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.6 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Finally, argue why:
   $$
   \vec{w}^* = \left(X_c^T X_c + n \lambda_\text{ridge} I\right)^{-1} X_c^T \vec{y}
   $$

minimizes the ridge regression objective function: $$\frac{1}{n} \lVert \vec{y}_c - X_c \vec{w} \rVert_2^2 + \lambda_\text{ridge} \sum_{j = 1}^d w_j^2$$

<!-- END QUESTION -->

Thanks for bearing with us through this (extremely) long question! We hope you've left it with a deeper understanding of what ridge regression is, and how and why it works.

## Question 2: $k$-Nearest Neighbors Returns! 🏡🏠

---

In Homework 8, you implemented $k$-nearest neighbors regression. For a refresher of how the method words, review the writeup to Question 1.3 in [Homework 8](https://github.com/practicaldsc/wn25/blob/main/homeworks/hw08/hw08.ipynb). (In Lecture 21, you will also learn about the $k$-nearest neighbors classifier; the classifier and regressor work similarly, but our exploration here is about the **regressor**.)

In $k$-NN regression, $k$ was a **hyperparameter** – you got to choose it before the model was fit to the data. In Question 1.4, we had you estimate, intuitively, a value of $k$ that would create a regressor that generalized well to unseen data. In this question, we'll have you use a more principled approach – cross-validation. And, you'll have to implement the cross-validation yourself!

Let's start by loading in the same `homeruns` dataset from Homework 8, Question 1.

In [None]:
homeruns = pd.read_csv('data/homeruns.csv')
homeruns.head()

You'll notice that the `'Year'` values aren't integers. That's because we've added a small amount of artificial noise to the `'Year'` column so that `sklearn`'s $k$-NN regressor – which you'll use in this question – _doesn't_ encounter any ties when determining the $k$ nearest points. This was problematic last semester, since ties were broken by `sklearn` non-deterministically in a way that we cannot control.

Once you run the cell below, you will see that this additional noise does not affect the general trend we observed last time from the `homeruns` dataset.

In [None]:
homeruns.plot(kind='scatter', x='Year', y='Homeruns')

We'll continue trying to predict `'Homeruns'` as a function of `'Year'`. Last time, we had you implement $k$-nearest neighbors regression by hand. Since you know how to do that already, here, we'll use `sklearn`'s implementation. `KNeighborsRegressor` is imported for you below, along with another useful function.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

### Question 2.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Assign `X_train`, `X_test`, `y_train`, and `y_test` to the result of performing a train-test split on `homeruns`. Use the default train-test split size, and set `random_state=98`.

In [None]:
...

In [None]:
grader.check("q02_01")

### Question 2.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Eventually, we'll want to use cross-validation to identify the value of $k$ – that is, the number of neighbors – that specifies a model that best generalizes to unseen data. But first, we need a function that can perform cross-validation for us.

While `GridSearchCV` and `cross_val_score` exist, **you cannot use them in this question** – instead, the goal here is to implement cross-validation yourself to help really understand how it works.

<div class="alert alert-warning">
    
Note that the terminology is a little confusing, since we're using $k$-fold cross-validation to choose a $k$ for $k$-nearest neighbors regression.

<b>In this question, $k$ will always refer to the number of neighbors to use in $k$-nearest neighbors regression.</b> We'll use other terminology to refer to the number of folds in cross-validation.
    
</div>

Complete the implementation of the function `cross_validate_model`, which takes in:
- `model`, an **un-fit** instance of an `sklearn` estimator object, like `LinearRegression()` or `KNeighborsRegressor(10)`,
- `X_train`, a 2D array/DataFrame with $x$-values being used to train a model.
- `y_train`, a 1D array/Series with $y$-values being used to train a model, with the same number of rows as `X_train`.
- `cv`, a number of folds to use for cross-validation (normally, we call this the $k$ in $k$-fold cross-validation).

`cross_validate_model` should implement `cv`-fold cross-validation, as described [**here in Lecture 18**](https://practicaldsc.org/resources/lectures/lec18/lec18-filled.html#Illustrating-$k$-fold-cross-validation). Specifically, it should:
- Divide `X_train` and `y_train` into `cv` disjoint folds of equal size.
    - `cross_validate_model` **should not** shuffle before creating these folds – instead, it should divide the data as-is into the folds.
    - For example, if `X_train` and `y_train` have 30 rows, and `cv = 3`, fold 0 should be rows 0-9, fold 1 should be rows 10-19, and fold 2 should be rows 20-29.
- Train `model` `cv` times, such that:
    - Each fold is used for validation once and for training `cv - 1` times.
    - Each time `model` is trained, compute its **validation mean squared error** on the validation fold.
- Return a **DataFrame** with `cv` rows and 2 columns, `'training_mse'` and `'validation_mse'`.
    - There should be one row per fold; the index of the returned DataFrame should be `'Fold 0'`, `'Fold 1'`, and so on.
    - If `out` is the returned DataFrame, then for example, `out.loc['Fold 4', 'training_mse']` should be the training mean squared error when fold 4 was used for validation (and the other `cv - 1` folds were used for training) and `out.loc['Fold 4', 'validation_mse']` should be the validation mean squared error when fold 4 was used for validation.
    - The example above assumes that `cv >= 5`; note that in general, the only restriction on `cv` is that `cv >= 2`.
    

Example behavior is given below.

```python
>>> cross_validate_model(KNeighborsRegressor(2), X_train, y_train, 10)
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>training_mse</th>
      <th>validation_mse</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Fold 0</th>
      <td>44926.432099</td>
      <td>281430.638889</td>
    </tr>
    <tr>
      <th>Fold 1</th>
      <td>52946.635802</td>
      <td>99437.972222</td>
    </tr>
    <tr>
      <th>Fold 2</th>
      <td>47598.129630</td>
      <td>119365.750000</td>
    </tr>
    <tr>
      <th>Fold 3</th>
      <td>54323.262346</td>
      <td>115330.555556</td>
    </tr>
    <tr>
      <th>Fold 4</th>
      <td>67494.194444</td>
      <td>77006.083333</td>
    </tr>
    <tr>
      <th>Fold 5</th>
      <td>49634.175926</td>
      <td>281051.194444</td>
    </tr>
    <tr>
      <th>Fold 6</th>
      <td>55767.966049</td>
      <td>50324.750000</td>
    </tr>
    <tr>
      <th>Fold 7</th>
      <td>54990.524691</td>
      <td>373829.111111</td>
    </tr>
    <tr>
      <th>Fold 8</th>
      <td>55223.456790</td>
      <td>73802.888889</td>
    </tr>
    <tr>
      <th>Fold 9</th>
      <td>52366.061728</td>
      <td>97442.916667</td>
    </tr>
  </tbody>
</table>

For more context on the example above:
- `X_train` and `y_train` are divided into 10 folds. `X_train` and `y_train` each have 90 rows, so each fold has 9 points.
- When fold 0 is used for validation:
    - Fold 0 corresponds to rows 0-8 of `X_train` and `y_train`.
    - The rows used for training, then, are folds 1-9, i.e. rows 9-89 of `X_train` and `y_train`.
    - A `KNeighborsRegressor(2)` instance is fit on rows 9-89 of `X_train` and `y_train`.
    - The mean squared error of that model instance, when evaluated on rows 9-89, is `44926.432099`, so `out.loc['Fold 0', 'training_mse']` is `44926.432099`.
    - The mean squared error of that model instance, when evaluated on rows 0-8, is `281430.638889`, so `out.loc['Fold 0', 'validation_mse']` is `281430.638889`.
- When fold 1 is used for validation:
    - Fold 1 corresponds to rows 9-17 of `X_train` and `y_train`.
    - The rows used for training, then, are folds 0, 2, 3, 4, ..., 9, i.e. rows 0-8 and rows 18-89 of `X_train` and `y_train`. **A big part of the question is determining, programmatically, which rows are to be used for training!**
    - A `KNeighborsRegressor(2)` instance is fit on rows 0-8 and 18-89 of `X_train` and `y_train`.
    - The mean squared error of that model instance, when evaluated on rows 0-8 and 18-89, is `52946.635802`, so `out.loc['Fold 1', 'training_mse']` is `52946.635802`.
    - The mean squared error of that model instance, when evaluated on rows 9-17, is `99437.972222`, so `out.loc['Fold 1', 'validation_mse']` is `99437.972222`.
- And so on!
    

Some guidance:
- Assume that the number of rows in `X_train` is divisible by `cv`, i.e. assume all folds are of the same size. Furthermore, assume that `cv >= 2`.
- Assume that `X_train` and `y_train` have the same number of rows, but **don't** assume that they have the same index values! The `X_train` and `y_train` you produced in Question 2.1 do have the same index, but we should make your `cross_validate_model` more general-purpose. Separate data into folds based on integer positions. 
- Remember that `model` could be any un-fit `sklearn` estimator instance, not just `KNeighborsRegressor(2)`.
- Remember that **you must implement cross-validation from scratch here – you cannot use any pre-built implementation of it**. The animation in Lecture 18 will be helpful.
    - If it helps you in understanding what the goal is, note that `cross_validate_model` does something very similar to the built-in function `cross_val_score` – but again, you can't use it (we're checking!).
    - **You can't use `sklearn`'s `mean_squared_error` either – please implement it yourself, we'll be checking!**
- You can use a `for`-loop – our solution had two (more specifically, one loop and one list comprehension).

In [None]:
def cross_validate_model(model, X_train, y_train, cv):
    # Remember: Do **not** shuffle X_train or y_train here;
    # train_test_split already did the shuffling for us!
    ...
    
# Feel free to change this input to make sure your function works correctly.
cross_validate_model(KNeighborsRegressor(2), X_train, y_train, 10)

In [None]:
grader.check("q02_02")

### Question 2.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Now, let's use your implementation of `cross_validate_model` to find the value of $k$ – that is, the number of neighbors – that best generalizes to unseen data.

Complete the implementation of the function `plot_validation_error_vs_k`, which takes in:
- `k_max`, a positive integer.
- `X_train`, `y_train`, and `cv`, all of which are defined the same way as when implementing `cross_validate_model`.

`plot_validation_error_vs_k` should return a `plotly` Figure object like the one below:

<br>

<center><img src="imgs/knn.png" width=550></center>

There are several steps involved here.
- `plot_validation_error_vs_k` should call `cross_validate_model` `k_max` times.
    - Once, `cross_validate_model` should be called with `model = KNeighborsRegressor(1)`.
    - Then, `cross_validate_model` should be called with `model = KNeighborsRegressor(2)`.
    - and so on, until `model = KNeighborsRegressor(k_max)`.
    - Each time `cross_validate_model` is called, `X_train`, `y_train`, and `cv` should all be passed as-is without modification.
- After calling `cross_validate_model` for a particular value of $k$, you should compute the **average** training MSE and **average** validation MSE for that $k$ and store it somewhere.
- Then, create a DataFrame (likely, one that has $k$ rows and 2 columns) with the average training and validation MSE for each value of $k$, and create a `plotly` line chart with the values in that DataFrame.
- Some properties that must be true of your `plotly` Figure:
    - It must have exactly two lines.
    - The two lines should have different names in the legend; one should have `'training'` somewhere in the name (in any case), and the other should have `'validation'`, and these names should correspond to the types of errors the lines are depicting.
    - The $x$-axis and $y$-axis titles must be exactly the same as ours (including capitalization).
    - The $x$-axis ticks should say `'k = 1'`, `'k = 2'`, and so on, as well. It doesn't matter if the tick labels are rotated on an angle or not (this is determined automatically by `plotly`, depending on what you set `k_max` to, and is not super relevant).

In [None]:
def plot_validation_error_vs_k(k_max, X_train, y_train, cv):
    ...
    
# Feel free to change this input to make sure your function works correctly.
plot_validation_error_vs_k(20, X_train, y_train, 5)

In [None]:
grader.check("q02_03")

Now that you've completed `plot_validation_error_vs_k`, let's take a look at the results one more time:

In [None]:
plot_validation_error_vs_k(20, X_train, y_train, 5)

You reflected on the behavior of $k$ in $k$-nearest neighbor regression models last week, and gave an intuitive choice for a "good" value of $k$.

Now, it's unambiguously clear what the "best" choice of $k$ is: $k = 4$. But, any choice in the range of $k = 3$ to $k = 11$ seems to produce roughly the same average validation mean squared error. Note that this plot looks a little different than the plots of training/validation error vs. model complexity that we saw in lecture because **here, as Number of Neighbors ($k$) increases, model complexity decreases**. $k = 1$ is the most overfit model, since it simply memorizes the $(x_i, y_i)$ pairs in the dataset. In our first lecture example, as our hyperparameter (there, polynomial degree) increased, model complexity increased, too.

Nice job! You've manually implemented every single calculation that produced the results above. In the last homework, you implemented the regressor yourself, and in this homework you cross-validated it yourself.

## Question 3: In This Economy? 🏡

---

In this question, you'll put your understanding of `sklearn` Pipeline objects and cross-validation to practical use as you aim to predict housing prices. The dataset we're using, originally compiled by Professor Dean De Cock at Truman State University **specifically for** teaching regression, contains information about houses sold in Ames, Iowa from 2006 to 2010.

Run the cell below to load in a **subset of the full dataset, which we've designated as your training set**:

In [None]:
houses_training = pd.read_csv('data/houses-training.csv')
houses_training

There are 82 columns in the dataset! `'SalePrice'` is what we're aiming to predict; everything else _could_ be used as a feature. You'll notice there are some categorical features (some ordinal, some nominal) and some numeric features, and many missing values. Many of the features are self-explanatory, but some are not. Rather than trying to define each feature ourselves, we'll point you to the data description written by the curator of the dataset.

<center><b>Read the data description <a href="https://jse.amstat.org/v19n3/decock/DataDocumentation.txt">here</a>.</center>

Before we build any models, as always, we should explore the data.

In the cell below, draw a histogram depicting the distribution of `'SalePrice'`.

In [None]:
...

In the cell below, draw a scatter plot of `'SalePrice'` vs. `'Gr Liv Area'` (which represents square footage, not including the basement).

In [None]:
...

Normally, we'd have you perform a train-test split. However, we've already done this for you, in that `houses_training` is just the training data for this dataset.

In [None]:
X_train_houses = houses_training.drop(columns=['SalePrice'])
X_train_houses.head()

In [None]:
y_train_houses = houses_training['SalePrice']
y_train_houses.head()

The test set's features are below. Note that there is no `'SalePrice'` column in this DataFrame.

In [None]:
X_test_houses = pd.read_csv('data/housing-test-X.csv')
X_test_houses.head()

The test set's actual `'SalePrice'` values (i.e., actual $y$-values) are **intentionally** hidden from you. You won't need them at all in Questions 3.1-3.3. In Question 3.4, you'll have the (optional) opportunity to enter a prediction competition, in which you engineer a model that minimizes testing mean squared error. If you enter, your predictions will be compared against the true `'SalePrice'` values in the test set.

For now, let's just work with `X_train_houses` and `y_train_houses`.

<div class="alert alert-danger">
   
**A common issue last semester was that students would unknowingly use the argument `validate=False` or `validate=True` in various Pipeline-related function calls in this question, causing the Gradescope autograder to crash. Please avoid using the `validate` keyword argument here. (This will be easy to do if you _don't_ use ChatGPT to write all of your code!)**
    
</div>

### Question 3.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

To start, you'll build a `sklearn` Pipeline that does the following to predict `'SalePrice'`:

- Creates a new feature that results from **adding** `'Gr Liv Area'` (non-basement square footage) and `'Total Bsmt SF'` (basement square footage). This is the total square footage of the house.
- Creates degree-2 polynomial features from the total square footage feature defined above.
- One hot encodes `'Neighborhood'`.
- Fits a `LinearRegression` model.

Complete the implementation of the function `create_pipe_sqft_and_neighborhood`, which takes in a DataFrame like `X_train_houses` and a Series like `y_train_houses` and returns a **fit** Pipeline that follows all of the steps above.

Example behavior is given below.

```python
>>> pipe = create_pipe_sqft_and_neighborhood(X_train_houses, y_train_houses)
>>> pipe
```

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="imgs/first-example-pipe.png" width=400>

```python
>>> pipe.predict(pd.DataFrame([{
    'Gr Liv Area': 2500,
    'Total Bsmt SF': 1500,
    'Neighborhood': 'CollgCr'
}]))
array([312269.07580775])

```


Some guidance:
- Like in the Pipelines you created in Homework 8, all transformations should be done within the Pipeline – you **cannot** preprocess the training data using vanilla `pandas` before creating your Pipeline!
    - So, for instance, to create the total square footage feature, you should create a `FunctionTransformer` that takes in a DataFrame with two columns, adds those two columns, and returns a new DataFrame with a single column that contains the result. This `FunctionTransformer` should be part of a larger Pipeline whose next step is a `OneHotEncoder`.
    - Remember, a `ColumnTransformer` – which you can create easily using `make_column_transformer` – is how you specify which transformations you want applied to which columns. In this particular case, you may want to make a (nested) Pipeline that does the two steps above (namely, the `FunctionTransformer` and `OneHotEncoder`), and then tell the `ColumnTransformer` that you want to use this nested Pipeline on just the two original square footage columns.
    - If you try and preprocess the data using `pandas` before creating your final Pipeline, the example call to `pipe.predict` at the bottom of the cell below won't work.
    - It's okay if the graphical representation of your Pipeline isn't exactly the same as ours.
- Remember to set `include_bias=False` when creating `PolynomialFeatures` so that your model doesn't end up trying to create two intercept terms.
- Remember to use `drop='first'` when using `OneHotEncoder` to avoid multicollinearity, and `handle_unknown='ignore'` so that your Pipeline doesn't error if we try to predict the `'SalePrice'` of a house in a `'Neighborhood'` we've never seen before.
- You'll need to fill missing `'Total Bsmt SF'` values with 0. (There's only one house in `X_train` with this property; it likely just doesn't have a basement.) **Do this within your `FunctionTransformer` using `fillna(0)`, not by using `SimpleImputer`.**

In [None]:
from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression

def create_pipe_sqft_and_neighborhood(X_train_houses, y_train_houses):
    ...

# Feel free to change this input to make sure your function works correctly.
# In particular, try changing X_train_houses and y_train_houses to something like
# X_train_houses.head(20) and y_train_houses.head(20)!
pipe = create_pipe_sqft_and_neighborhood(X_train_houses, y_train_houses)
pipe
# Once the above looks right, uncomment the expression below.
# pipe.predict(pd.DataFrame([{
#     'Gr Liv Area': 2500,
#     'Total Bsmt SF': 1500,
#     'Neighborhood': 'CollgCr'
# }]))

In [None]:
grader.check("q03_01")

### Question 3.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Now, let's create a Pipeline that executes all of the steps that `create_pipe_sqft_and_neighborhood` does, but:

1. Instead of fixing the degree of `PolynomialFeatures` to 2, try any possible degree from 1 to 5 (inclusive).
2. Instead of using `LinearRegression`, use `Ridge`, i.e. $L_2$-regularized linear regression. Try regularization penalties $2^{-5}, 2^{-4}, ..., 2^{8}, 2^{9}$, plus $0$. (In class, we called the regularization penalty hyperparameter $\lambda$, but `sklearn` calls it `alpha`.)

The biggest difference here is that you need to use **cross-validation** to choose with polynomial degree and regularization penalty. Use `GridSearchCV` to do this; use the default number of folds. **Remember to tell `GridSearchCV` that you want the hyperparameter combination that yields the lowest mean squared error**; by default, this is not what it does!

Complete the implementation of the function `create_pipe_cross_validated_degree_ridge`, which takes in a DataFrame like `X_train_houses` and a Series like `y_train_houses` and returns a **fit** Pipeline that follows all of the steps above.

Example behavior is given below.

```python
>>> pipe_cv = create_pipe_cross_validated_degree_ridge(X_train_houses, y_train_houses)
>>> pipe_cv
```

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="imgs/second-example-pipe.png" width=400>

```python
>>> pipe_cv.predict(pd.DataFrame([{
    'Gr Liv Area': 2500,
    'Total Bsmt SF': 1500,
    'Neighborhood': 'CollgCr'
}]))
array([304362.63752408])

```

Some guidance:
- At some point, you'll need to create a grid of hyperparameters to pass to `GridSearchCV`. You'll need to supply this grid as a **dictionary**, mapping hyperparameter names to lists (or ranges) of values. When doing this, the hyperparameter names will be a bit more complicated than seen in lecture – for instance, the `PolynomialFeatures` `degree` hyperparameter is a hyperparameter of a nested Pipeline, which itself is likely part of a `ColumnTransformer`, so the key for the degree will likely look something like `'columntransformer__pipeline__ ...'`.
- If you're getting the wrong output for the example prediction above, verify that you've set the `scoring` argument of `GridSearchCV` correctly.
- `create_pipe_cross_validated_degree_ridge` shouldn't run instantly – it may take ~5 seconds to run. This means that `grader.check("q03_02")` won't run instantly either; it may take ~20 seconds to run.
- Once again, fill in the lone missing `'Total Bsmt SF'` value within your `FunctionTransformer` using `fillna(0)`, not by using `SimpleImputer`.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

def create_pipe_cross_validated_degree_ridge(X_train_houses, y_train_houses):
    ...

# Feel free to change this input to make sure your function works correctly.
pipe_cv = create_pipe_cross_validated_degree_ridge(X_train_houses, y_train_houses)
pipe_cv
# Once the above looks right, uncomment the expression below.
# pipe_cv.predict(pd.DataFrame([{
#     'Gr Liv Area': 2500,
#     'Total Bsmt SF': 1500,
#     'Neighborhood': 'CollgCr'
# }]))

In [None]:
grader.check("q03_02")

### Question 3.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

So far, we've only used a small subset of the features in `X_train_houses`. But, there are significantly more!

For our final Pipeline, you're required to use **all features in the dataset**. You'll need to handle numeric and categorical features separately; you can extract all of the numeric columns in a DataFrame using `df.select_dtypes('number')`, for example.

- For **numeric features**:
    - Very few columns have missing values. (For your own exploration, determine which columns these are.) One guess is that these values are missing when the house doesn't have one of those features, e.g. a missing `'Bsmt Half Bath'` must mean that the house doesn't have a basement half-bathroom. **So, use `SimpleImputer` to fill all of the missing numerical values with 0. Make sure to instantiate your `SimpleImputer` instance with `strategy='constant'`.**
    - Then, one missing values are imputed, standardize all numeric features (and only numeric features!) using a `StandardScaler`.
- For **categorical features**:
    - There are many more missing values. **Use `SimpleImputer` to fill all of the missing values in a column with the _most frequently observed_ value in that column.**
    - Then, one hot encode the resulting categorical columns, making sure to use the same arguments you did earlier to handle multicollinearity and unknown categories (`drop='first'` and `handle_unknown='ignore'`).
    - Here, we are using **`SimpleImputer`** because it makes it easier to apply the imputation technique to groups of columns (one imputation strategy for all numeric columns, and one imputation strategy for all categorical columns), not individual columns like we did in 3.1 and 3.2.

After you've created all of your features, fit a `Lasso` object. Use cross-validation to try different regularization penalties from $10^{0}, 10^{1}, ..., 10^{5}$, plus $0$; again, make sure `GridSearchCV` knows that you want the hyperparameter that minimizes mean squared error. Note the different range of hyperparameters as compared to before!

Complete the implementation of the function `create_pipe_all_features_lasso`, which takes in a DataFrame like `X_train_houses` and a Series like `y_train_houses` and returns a **fit** Pipeline that follows all of the steps above.

Example behavior is given below.

```python
>>> pipe_all = create_pipe_all_features_lasso(X_train_houses, y_train_houses)
>>> pipe_all
```

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="imgs/third-example-pipe.png" width=400>

```python
>>> pipe_all.predict(X_train_houses.head(1))
array([174683.44773055])

```


Some guidance:
- **Pipelines involving `Lasso` take significantly longer to `fit` than Pipelines involving `Ridge`.** This may be due to the fact that the LASSO objective function involves non-differentiable pieces that are harder to optimize. Eventually, when you call `create_pipe_all_features_lasso`, it may take ~1 minute to run.
- To make sure that your transformations are working correctly without having to wait a minute each time you want to test them out, you may want to start by just providing a single regularization penalty hyperparameter for `GridSearchCV` to choose. Once that works without error, switch to providing the range specified above.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

def create_pipe_all_features_lasso(X_train_houses, y_train_houses):
    ...

# Feel free to change this input to make sure your function works correctly.
pipe_all = create_pipe_all_features_lasso(X_train_houses, y_train_houses)
pipe_all
# Once the above looks right, uncomment the expression below.
# pipe_all.predict(X_train_houses.head(1))

In [None]:
grader.check("q03_03")

We've now created three pipelines, all of which predict `'SalePrice'` using some combination of features in `X_train`. The Pipeline that uses **all** features has the lowest training mean squared error, by far:

In [None]:
models = {'pipe (3.1)': pipe, 'pipe_cv (3.2)': pipe_cv, 'pipe_all (3.3)': pipe_all}
for model in models:
    mse = mean_squared_error(y_train_houses, models[model].predict(X_train_houses))
    print(f'Mean squared error for {model}: {mse:.2e}')

But, it's still not clear which of the three Pipelines generalize best to unseen test data. While we used cross-validation to fit `pipe_cv` and `pipe_all`, we didn't directly compare them to one another when doing cross-validation, so it's _possible_ that `pipe_cv` generalizes better than `pipe_all`.

Of course, we **do** have a test set that we could use to assess how well these Pipelines all generalize, but we can't give you access to it just yet.

One last thing before Question 3.4: recall, LASSO (which we used in `pipe_all`) encourages **sparsity**, meaning that we should expect the coefficients of many features to end up being 0. We can see exactly which features had a coefficient of 0 in `pipe_all` here:

In [None]:
feature_names = pipe_all.best_estimator_[:-1].get_feature_names_out()
coefficients = pipe_all.best_estimator_[-1].coef_

In [None]:
coefs = pd.Series(coefficients, index=feature_names)
coefs[coefs == 0]

So, LASSO was implicitly telling us **not** to use those features, if we care about building a model that generalizes well to unseen data! This feature selection process might be useful to you should you choose to complete Question 3.4.

### Optional: Question 3.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">Extra Credit</div>

**This part of the question is OPTIONAL, and just counts for extra credit!**

In Questions 3.1 through 3.3, we specified exactly how you should create your Pipelines. But now, it's your job to choose features and transformations that make good, generalizable predictions.

If you decide to complete Question 3.4, here's the process.
1. Below, complete the implementation of the function `create_pipe_custom`, which takes in a DataFrame like `X_train_houses` and a Series like `y_train_houses` and returns a **fit** Pipeline, created however you'd like.
2. Then, run the cell below so that `pipe_custom` is assigned to a a fit Pipeline.
3. Run the three cells below this one, that starts with `# EXPORT CELL!`, to create a CSV of your Pipeline's predictions on the hidden test set.
4. Upload the CSV created (it should be named something like `'predictions-2025-03.....csv'`) to the **Homework 9, Question 3.4 Leaderboard (Optional!)** autograder on Gradescope.
5. Ignore the "score" that appears on Gradescope, since everyone receives a score of 0 – all that matters is your ranking on the leaderboard, linked [**here**](https://www.gradescope.com/courses/930446/assignments/5994908/leaderboard). **The rankings are computed using your mean squared error on the unseen testing set; the lower the MSE, the higher your ranking.**
6. The top 5 **good-faith** submissions on the leaderboard will receive extra credit on Homework 9 as follows:
    - The top submission – that is, the one with the lowest MSE – will earn 10 points of extra credit on Homework 9. Since the homework is out of 39 points, this equates to around **25% of extra credit**.
    - The 2nd-best submission will earn 8 extra points on Homework 9 ($\approx$20% of extra credit).
    - The 3rd-best submission will earn 6 extra points on Homework 9 ($\approx$15% of extra credit).
    - The 4th-best submission will earn 4 extra points on Homework 9 ($\approx$10% of extra credit).
    - The 5th-best submission will earn 2 extra points on Homework 9 ($\approx$5% of extra credit).
     
Some guidance:
- You can use _any_ regression class **in `sklearn`** to make your predictions. By **good-faith submission**, we mean a submission that doesn't somehow determine the true $y$-values in the test set and hardcodes them, and a submission that is unique, i.e. not copied from someone else (that would, of course, be an honor code violation) or from a Kaggle competition online. Before assigning extra credit, we will manually inspect your submitted notebook (which you need to submit anyways for the rest of the homework to be graded), and any submissions that don't use an `sklearn` Pipeline will not receive the extra credit.
- Don't just guess arbitrarily which features might be useful and how to engineer them. **Do some exploratory data analysis!** Look at the relationships between various features and `'SalePrice'`. You might discover various new features you want to engineer.
- For reference, you'll find a submission titled **Submitted by Suraj: Baseline Model from Question 3.3** on the leaderboard. This shows you the test MSE of the Pipeline that `create_pipe_all_features_lasso` achieves. Your model's MSE should be lower than this!
- Have fun with it, and use what you've learned in this question to improve your models in the Portfolio Homework!

In [None]:
def create_pipe_custom(X_train_houses, y_train_houses):
    ...

# Make sure this has been run before you try and run the export cell below!
pipe_custom = create_pipe_custom(X_train_houses, y_train_houses)
pipe_custom

Once you've implemented `create_pipe_custom` and defined `pipe_custom` at the bottom of the cell above, run the cell below to generate your CSV of test set predictions. Upload this CSV to the Gradescope assignment titled **Homework 9, Question 3.4 Leaderboard (Optional!)**.

In [None]:
# EXPORT CELL!

import datetime
current_time = str(datetime.datetime.now())[:19]

y_pred = pipe_custom.predict(X_test_houses)
y_pred_df = pd.DataFrame().assign(predictions=y_pred)
y_pred_df.to_csv(f'test-predictions-{current_time}.csv', index=False)
print(f'Saved test-predictions-{current_time}.csv; upload this to Gradescope.') 

## Finish Line 🏁

Congratulations! You're ready to submit Homework 9. **Remember, you need to submit Homework 9 twice (or three times, if you're participating in the optional competition for Question 3.4)**:

### To submit the manually graded problem (Question 1; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 9 (Question 1; written problem)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Questions 2-3; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 9 (Questions 2-3; autograded problems)". Make sure your notebook is still named `hw09.ipynb` and the name has not been changed.
5. Stick around while the Gradescope autograder grades your work.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 9 submission time will be the **later** of your two individual submissions.

### To submit to the optional prediction competition (Question 3.4)
See the details in Question 3.4.