In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw08.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 8

# Feature Engineering and Pipelines

### EECS 398: Practical Data Science, Winter 2025

#### Due Wednesday, March 26th at 11:59PM (one day later than usual)
    
</div>

## Instructions

Welcome to Homework 8! In this homework, you'll learn how to create new features for model building using both `pandas` and `sklearn`, as well as how to expand on this by building `sklearn` modeling Pipelines. The most relevant lectures are Lectures 16 and 17.

You are given **eight** slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.
<div class="alert alert-warning" markdown="1">

<div class="alert alert-warning">
This homework features a mix of autograded programming questions and manually-graded questions.
    
- Questions 1.4, 3, and 4 are **manually graded**, like in Homework 7, and say **[Written ✏️]** in the title. For these questions, **do not write your answers in this notebook**! Instead, like in Homework 7, write **all** of your answers to the questions in this homework in a separate PDF. You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in. Submit this separate PDF to the **Homework 8 (Questions 1.4, 3-4; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

- Questions 1.1-1.3 and 2 are **fully autograded**, and say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 8 (Questions 1.1-1.3, 2; autograder problems)** assignment on Gradescope to have your code graded by the hidden autograder. This is the same workflow you followed in earlier homeworks.

Your Homework 8 submission time will be the **later** of your two individual submissions.
</div>
</div>

**Make sure to show your work for all written questions! Answers without work shown may not receive full credit.**

This homework is worth a total of **46 points**, 23 of which are manually graded and 23 of which come from the autograder. The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**, but keep in mind that you'll need to submit this homework twice – one submission for your written problems, and one submission for your autograded problems. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`. 

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

## Question 1: Play Ball ⚾️

---

In this question, you'll get a feel for the process of creating new features from existing ones and how to _think_ about model generalizability, an idea we'll see more in Lectures 18 onwards.

As we discussed in Lecture 16, a numerical-to-numerical transformation results from taking the values in some numerical column $x_1, x_2, ..., x_n$ and applying some function $f$ to each value, to produce a new set of numbers $f(x_1), f(x_2), ..., f(x_n)$. These **transformed** values, $f(x_1), f(x_2), ..., f(x_n)$, can then either be used as a feature, or as the target ($y$) variable.

A common goal of applying a numerical-to-numerical transformation is to modify the data from a complicated, non-linear relationship into a **linear** relationship. Linear relationships are easy to understand and are well-described using linear models.

However, non-linear growth is common in real-world datasets. Sometimes this growth is by a **fixed power** and sometimes it is **exponential**. The transformation functions, $f$, that turn these types of growth linear are **root** and **log** transformations respectively. (Generally, it is more difficult to determine which transformation is appropriate for a given dataset, though the [Tukey-Mosteller bulge diagram](https://freakonometrics.hypotheses.org/files/2014/06/Selection_005.png) from Lecture 16 is useful.)

Let's start by looking at some examples of transformations.

### Example 1

Run the cell below to generate a scatter plot.

In [None]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell.
np.random.seed(23)

# Generates a random scatter plot
x = np.arange(1, 101) + np.random.normal(0, 0.5, 100)
y = 2 * ((x + np.random.normal(0, 1, 100)) ** 2) + np.abs(x) * np.random.normal(0, 30, 100)
df_1 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_1, x='x', y='y', trendline="ols", trendline_color_override="#ff7f0e")

It doesn't appear to be the case that `'x'` and `'y'` are linearly associated here, and they aren't – there is a **quadratic** relationship between them. 

One way we may be able to notice this is a **residual plot**, where we visualize the residuals (or errors), $e_i = y_i - H^*(x_i)$, as defined in Question 1. Note that if we were to create a **residual plot** based on the data above, there would be a pattern – the residuals for smaller `'x'` would mostly be positive, and the residuals for larger `'x'` would mostly be negative. Patterns in a residual plot imply that the relationship between the two variables is non-linear.

Let's take a look at the residual plot, using a helper function defined below. This function fits a `LinearRegression` model to `'x'` and `'y'`, adds a `'residuals'` column to the `df`, and plots that against the predictions `'pred'`. Note that it's equally valid to plot the residuals against `'x'`: to do that, change `x = 'pred'` to `x = x` in the call to `px.scatter` below. You'll see the trend is the same, but the x-axis will have different numbers. That's because `'pred'` is just a linear transformation of `'x'`.

In [None]:
# Feel free to use this function directly to help you answer Question 1.
def create_residual_plot(df, x, y):
    df = df.copy()
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(df[[x]], df[y])
    df['pred'] = model.predict(df[[x]])
    df[f'{y} residuals'] = df[y] - model.predict(df[[x]])
    return px.scatter(df, x='pred', y=f'{y} residuals', trendline='ols', trendline_color_override='red')

create_residual_plot(df_1, 'x', 'y')

To linearize the relationship, we can take the square root of each `'y'` value:

In [None]:
df_1['root y'] = np.sqrt(df_1['y'])

px.scatter(df_1, x='x', y='root y', trendline="ols", trendline_color_override="#ff7f0e")

That looks much better!

### Example 2

Run the cell below to generate another scatter plot.

In [None]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell
np.random.seed(32)

# Generates a different random scatter plot
x = np.linspace(2, 5, 100)
y = 10 * (np.e ** x) + np.abs(x) * np.random.normal(0, 5, 100) + np.random.normal(0, 30, 100)
df_2 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_2, x='x', y='y', trendline="ols", trendline_color_override="#ff7f0e")

Again, the relationship between `'x'` and `'y'` is not quite linear. Let's try the square root transformation we tried in Example 1:

In [None]:
df_2['root y'] = np.sqrt(df_2['y'])

px.scatter(df_2, x='x', y='root y', trendline="ols", trendline_color_override="#ff7f0e")

Hmm... the relationship certainly looks _more_ linear than before, but still not quite linear. Let's look at the residual plot:

In [None]:
create_residual_plot(df_2, 'x', 'root y')

There is clearly a pattern in the residual plot. Let's instead try another transformation for the `'y'` values – $\log$.

In [None]:
df_2['log y'] = np.log(df_2['y'])

px.scatter(df_2, x='x', y='log y', trendline="ols", trendline_color_override="#ff7f0e")

That looks much better! We can verify that the residual plot has no "patterns":

In [None]:
create_residual_plot(df_2, 'x', 'log y')

Note – there is still evidence of **heteroscedasticity**, or "uneven spread", in this scatter plot, but the relationship is as close to linear as we'll get.

Now that we've learned how to perform transformations with example datasets, it's your job to apply these ideas to a real dataset. Below, we load in a dataset that describes the [number of home runs in the MLB per year](https://www.mlb.com/glossary/standard-stats/home-run). The relationship between the two variables, `'Year'` and `'Homeruns'`, is not linear.

In [None]:
homeruns = pd.read_csv('data/homeruns.csv')
homeruns.head()

In [None]:
homeruns.plot(kind='scatter', x='Year', y='Homeruns')

**Throughout this entire question**, suppose we're modeling `'Homeruns'` as a function of `'Year'`, i.e. `'Homeruns'` is the $y$ variable and `'Year'` is the $x$.

### Question 1.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

**Your first job is to determine what the appropriate transformation to apply to the `'Homeruns'` column is, in order to linearize the relationship.** Specifically, try out the transformations below, and then draw and examine residual plots to identify which numerical-to-numerical transformation is best.

While you'll have to write a bunch of code, this is a multiple-choice question. Assign `best_transformation` to either 1, 2, 3, or 4, with the value corresponding to one of the following choices:

1. Square root transformation.
2. Log transformation.
3. Both work the same.
4. Neither gives a transformation revealing a linear relationship.

If you find that both residual plots have some sort of pattern, choose the residual plot in which the vertical spread is constant. There is one clearly correct answer.

In [None]:
best_transformation = ...
best_transformation

In [None]:
grader.check("q01_01")

### Question 1.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Recall, our goal in this question is to model `'Homeruns'` as a function of `'Year'`. In the previous part, we had you apply a numerical-to-numerical transformation to `'Homeruns'`, which is the $y$ variable.

In this part, you'll be required to engineer new quantitative features **of your own choosing**, all based on transformations of the $x$ variable, `'Year'`.

Complete the implementation of the function `fit_model_and_return_predictions`, which takes in:
- `X`, a DataFrame with a single column of `'Year'` values from `homeruns`, and
- `y`, an array or Series with a sequence of `'Homerun'` values from `homeruns`.

`fit_model_and_return_predictions` should:
- Create new numerical features by applying various transformations to the values in `X['Year']` (look at Question 5 in Homework 7 for inspiration),
- Fit a `sklearn` LinearRegression object using your custom design matrix as the `X` argument and our passed-in `y` as the `y` argument, and
- **Return an array of predictions** that result from calling the `predict` method on the fit model, using your custom design matrix as the `X` argument.

For example, suppose our `fit_model_and_return_predictions` function creates polynomial features of degrees 2 through 10, and adds no other new features. Example behavior of `fit_model_and_return_predictions` may then be as follows:

```python
>>> fit_model_and_return_predictions(homeruns[['Year']], homeruns['Homeruns'])[:5]
array([165.07808666, 300.52105073, 174.28363771, 288.87689757, 395.065479  ])
```

A plot of the predictions returned by `fit_model_and_return_predictions` might then look like:

<center><img src="imgs/fit-model.png" width=500></center>

Is this a "good" model? Sure, it has a low training MSE, but it's not likely to generalize well to unseen $x$-values – in this case, future `'Year'`.

**You can create your features however you'd like!** Don't just use our example of using polynomial features of degrees 2 to 10. Try, intuitively, to come up with a fit hypothesis function that _you think_ is likely to generalize well to future `'Year'`s for whom we don't know the number of `'Homeruns'`. We will formalize how to develop models that generalize well in the coming lectures.

All we can autograde in this question are your resulting predictions – practically, we have no way of knowing how you come up with them. Other than what's described above, here are the only added requirements of your function:

- It should be able to take in a **subset** of the rows in `homeruns`, and should do all calculations (feature creation, fitting, predicting) using that subset. So, this should work too:
    ```python
        >>> fit_model_and_return_predictions(homeruns.head()[['Year']], homeruns.head()['Homeruns'])
        
    ```
    Note that in `fit_model_and_return_predictions`, the `X` data used to fit the model is always the same as the data used to make predictions. In future examples in this class, this is not necessarily how model building will work – after all, we typically build models using historical data and use them to make predictions about future data – but this is how we'll use and test `fit_model_and_return_predictions`.

- The array that `fit_model_and_return_predictions` returns should be **deterministic**. That is, if `fit_model_and_return_predictions` is called twice with the exact same inputs `X` and `y`, the output should not change.
- The mean squared error of the predictions, when called on `X=homeruns[['Year']]` and `y=homeruns['Homeruns']`, should be **between 100,000 and 200,000**. Yes, it's possible to achieve a mean squared error of less than 100,000, but such a model is likely **overfitting** significantly to the data. (In fact, in Homework 7, you learned how to build models with 0 MSE, using Lagrange Interpolation! **Don't do that here – try and build more general-purpose models.**)

In [None]:
from sklearn.linear_model import LinearRegression

def fit_model_and_return_predictions(X, y):
    X = X.copy()
    # Below, create your features and design matrix,
    # instantiate a LinearRegression object,
    # fit it, and then call model.predict on it.
    ...

# Feel free to change this input to make sure your function works correctly.
preds = fit_model_and_return_predictions(homeruns[['Year']], homeruns['Homeruns'])

# Uncomment the code below to see a graph of your
# fit hypothesis function's predictions.
# fig = homeruns.plot(kind='scatter', x='Year', y='Homeruns')
# fig.add_trace(go.Scatter(
#     x=homeruns['Year'],
#     y=preds,
#     mode='lines',
#     line=dict(width=4),
#     name='Fit Model'
# ))

In [None]:
grader.check("q01_02")

### Question 1.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Above, you had to manually create features that resulted in a hypothesis function that fit the data well (but not too well). You may wonder, is there a way to do this automatically?

One _kind-of_ solution is to use **nearest neighbors regression**. In nearest neighbors regression, to evaluate the hypothesis function $H^*$ on the (scalar) input $x_\text{new}$:

1. First, choose a value of $k$. Sometimes, this is called **$k$-nearest neighbors, or $k$-NN regression**.
1. Then:
    1. Find the $k$ points in the original dataset whose $x$-values are closest to $x_\text{new}$ in terms of absolute value (note that since we're essentially dealing with just a single $x$ feature, using squared distance would also work the same way).
    1. Return the mean of the $y$-values corresponding to the $k$ points found in the step above.

For example, suppose our original dataset is:

| x | y |
| --- | --- |
| 10 | 5 |
| 11 | 17 |
| 12 | 26 |
| 19 | -5 |
| 25 | 3 |

Suppose we choose $k = 3$, and suppose we want to predict the $y$-value for $x_\text{new} = 20$. Then:
- The $k = 3$ points with the closest $x$-values are $(19, -5)$, $(25, 3)$, and $(12, 26)$.
- The mean of the $y$-values of the three points above is $\frac{-5 + 3 + 26}{3} = 8$.
- So, we predict a $y$-value of 8 for input $x_\text{new} = 20$.

This is a regression technique, because it allows us to predict real-valued outputs. However, it is different from linear regression in that it is **non-parametric** – there are no **parameters** $w_0^*, w_1^*, ...$ that we're learning from the data in order to make our predictions. Another way of thinking about the idea of a parametric model is that parametric methods make assumptions about the shape of the data and/or its underlying probability distribution; linear regression assumes that the underlying data looks linear (among other things), while $k$-NN regression doesn't assume anything about the shape of the data.


We can choose $k$ to be whatever we want it to be, but some values of $k$ are "better" in some sense than others. We'll explore this idea in Question 1.4, when we tie things back into the `homeruns` dataset.

**Your job is to** complete the implementation of the function `create_knn_regressor`, which takes in:
- `x`, a 1D array/Series of $x$-values, e.g. `homeruns['Year']`,
- `y`, a 1D array/Series of $y$-values, e.g. `homeruns['Homeruns']`, and
- `k`, a positive integer corresponding to the value of $k$ (where `k <= len(x)`).

`create_knn_regressor` should return a **function** that can take in a single number `x_new` and return the predicted $y$-value for the input `x_new`, according to the process outlined above.

Example behavior is given below.

```python
>> regressor = create_knn_regressor(x=np.array([10, 11, 12, 19, 25]),
                                    y=np.array([5, 17, 26, -5, 3]),
                                    k=3)
>>> regressor(20)
8.0
```

Some guidance: 
- The bulk of the work in this question is in understanding how nearest neighbor regression works. Our implementation is very short (5 lines total).
- **You're not allowed to use `sklearn` here**, but don't forget to use what you know about `pandas` DataFrames! You shouldn't use a `for`-loop.
- Don't worry about cases in which there are ties in distance (e.g. if $k = 3$ but there are 4 points that are all equidistant from $x_\text{new}$ such that they are all the closest); our tests are written in a way that will not penalize your handling of this situation if it's different from ours.

In [None]:
def create_knn_regressor(x, y, k):
    ...

# Feel free to change these inputs to make sure your function works correctly.
# It's a good idea to test out create_knn_regressor on the homeruns dataset!
regressor = create_knn_regressor(x=np.array([10, 11, 12, 19, 25]),
                                 y=np.array([5, 17, 26, -5, 3]),
                                 k=3)
regressor(20)

In [None]:
grader.check("q01_03")

Once you've implemented `create_knn_regressor`, run the cell below to see an **interactive** widget that will allow you to choose different values of $k$ and see the resulting $k$-NN regressor plotted on top of the `homeruns` dataset.

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

def plot_given_k(k):
    x = homeruns['Year']
    y = homeruns['Homeruns']
    regressor = create_knn_regressor(x, y, k)
    preds = [regressor(xi) for xi in x]

    fig = px.scatter(x=x, y=y).update_layout(xaxis_title='Year', yaxis_title='Homeruns', title=f'Fit kNN Model with k={k}')

    return fig.add_trace(go.Scatter(
        x=x,
        y=preds,
        mode='lines',
        line=dict(width=4),
        name='Fit Model'
    ))

widgets.interact(plot_given_k, k=widgets.IntSlider(min=1, max=140, step=1, value=5));

Try different values of $k$ – what do you notice?

<!-- BEGIN QUESTION -->

### Question 1.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Play around with the interactive cell above. Then, comment on the following points **in your PDF writeup, along with your answers to the rest of the written problems in this homework (that is, _not_ in this notebook)**:
1. When $k = 1$, what does the resulting fit model look like, and how does it relate to models we've seen in earlier lectures/homeworks?
2. When $k = 140$, what does the resulting fit model look like, and how does it relate to models we've seen in earlier lectures/homeworks?
3. Which value of $k$ do you _feel_ best captures the trend in the data, and why? (Just give a one sentence intuitive answer – no calculations needed.)

<!-- END QUESTION -->

## Question 2: March Madness 🏀

---

Michigan's men's basketball team is back in the NCAA Tournament! Our first game, on Thursday, March 20th, is against Suraj's old team, UC San Diego. (By the time you're reading this, we may already know the result!)

In this question, we'll continue with our theme of using sports data. This time, we'll use a dataset that has one row each for all 364 Division 1 men's basketball team this season, taken from [Sports Reference](https://www.sports-reference.com/cbb/seasons/men/2025-school-stats.html). Run the cell below to load in the dataset.

In [None]:
teams = pd.read_csv('data/ncaa-2025.csv')
teams.head()

There are several pieces of information for each `'School'`, based on their performance in the 2024-25 season leading up to the NCAA Tournament, i.e. **not** including the NCAA Tournament.

In [None]:
teams.columns

We won't use most of these features, and will explain the features we do end up using.

First, we'll say that we're not going to try and make a perfect bracket, or predict who is going to win any particular March Madness game – that's far too difficult to do. There's a reason that [nobody has ever correctly predicted the results of all 63 games in the NCAA Tournament](https://www.ncaa.com/news/basketball-men/bracketiq/2023-03-16/perfect-ncaa-bracket-absurd-odds-march-madness-dream).

Let's start simpler, by looking at the relationship between the **total** number of points each team scored in the regular season, `'Points For'`, and the total number of points their opponents scored against them, `'Points Against'`. We'll color each school by whether or not they `'Qualified'` for the NCAA Tournament.

In [None]:
fig = px.scatter(teams, x='Points For', y='Points Against', hover_name='School', color='Qualified', 
                 color_discrete_map={True: '#D81B60', False: '#1E88E5'})

fig.update_layout(width=800, title='Points Against vs. Points For for All 364 D1 Teams',
                  legend={'title': 'Qualified for NCAA Tournament'})

If you hover over a point in the plot above, you'll see the name of the corresponding school. For fun, take a look at the NCAA's [Official Rankings](https://www.cbssports.com/college-basketball/news/march-madness-2025-committee-reveals-official-ncaa-tournament-bracket-seed-list-from-1-68/) for all 68 teams that qualified for the tournament. Where do the top teams on the NCAA's list appear in the plot above?

You'll notice that teams that qualified for the tournament appear in the bottom right quadrant of the graph. Typically, to qualify for the tournament, a team has to be pretty good, and good teams score lots of points. There are also not-so-good teams that automatically qualified for the tournament; see if you can spot them in the graph above.

It's important to note that there are 364 teams in the dataset, but each team only played ~35 games leading up to the tournament. So, each individual team only played a small subset of other teams. Each team mostly played teams in its own "conference". Michigan, for example, is in the Big Ten Conference, and mostly played other Big Ten teams.

Now that we're somewhat familiar with the dataset, let's get started. In Question 1, you only used `sklearn` to train your model, once you had already created your features. In this question, we will have you create various `sklearn` Pipelines that implement the end-to-end modeling process.

### Question 2.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Let's start by creating a linear regression model that uses a team's `'Points For'` and `'Points Against'` to predict `'W-L%'`, the percentage of games that the team won leading up to the tournament. Specifically, let's fit the model:

$$\text{pred. W-L%}_i = w_0 + w_1 \cdot \text{Points For}_i + w_2 \cdot \text{Points Against}_i$$

Complete the implementation of the function `create_model_points_for_against`, which takes in a DataFrame like `teams` and returns an **already fit** `LinearRegression` object that implements the hypothesis function above. Example behavior is given below.

```python
>>> model = create_model_points_for_against(teams)

>>> model.coef_
array([ 0.0007842 , -0.00077308])

>>> model.predict([[2400, 2200]])[0]
0.6289662049554058
```

In [None]:
def create_model_points_for_against(teams):
    # Make sure not to fit the model with all of the columns in `teams`,
    # just the two that are relevant.
    ...
    
# Feel free to change this input to make sure your function works correctly.
# Remember that your function should work on subsets of the teams DataFrame as well.
model = create_model_points_for_against(teams)

# Note that to predict W-L%, I only need to pass in two values!
model.predict([[2400, 2200]])[0]

In [None]:
grader.check("q02_01")

Let's visualize your fit model. Run the cell below to see your model's predictions in <b><span style="color:orange">orange</span></b>, along with the original dataset in <b><span style="color:blue">blue</span></b>.

In [None]:
# Thanks, Claude, for helping us generate this visualization code!
model = create_model_points_for_against(teams)

x_range = np.linspace(teams['Points For'].min(), teams['Points For'].max(), 20)
y_range = np.linspace(teams['Points Against'].min(), teams['Points Against'].max(), 20)
x_grid, y_grid = np.meshgrid(x_range, y_range)

grid_points = np.column_stack([x_grid.flatten(), y_grid.flatten()])
z_pred = model.predict(grid_points)
z_grid = z_pred.reshape(x_grid.shape)

fig = go.Figure()

fig.add_trace(go.Scatter3d(
    x=teams['Points For'],
    y=teams['Points Against'],
    z=teams['W-L%'],
    mode='markers',
    name='Actual Values'
))

fig.add_trace(go.Surface(
    x=x_grid,
    y=y_grid,
    z=z_grid,
    name='Predictions',
    showscale=False,
    colorscale='oranges'
))

fig.update_layout(
    scene=dict(
        xaxis_title='Points For',
        yaxis_title='Points Against',
        zaxis_title='Win-Loss Percentage'
    ),
    title='Actual vs. Predicted Win-Loss Percentage',
    width=900,
    height=500
)

fig.show()

It seems like our model is doing a decent job at modeling win-loss percentage. Let's peek at our optimal model parameters:

In [None]:
model.intercept_, model.coef_

If you did this correctly, you should see that our model looks like:

$$\text{pred. W-L%}_i \approx 0.447660 + 0.000784 \cdot \text{Points For}_i - 0.000773 \cdot \text{Points Against}_i$$

It appears that the coefficients on $\text{Points For}$ and $\text{Points Against}$ are _almost_ cancelling each other out! This could imply that we could achieve _almost_ as good performance by looking at just the **difference** between a team's $\text{Points For}$ and $\text{Points Against}$.

### Question 2.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Recall from Lecture 16 that standardizing a model's features **does not** change its performance; rather, it makes the model's coefficients more directly comparable to one another, since standardizing brings features to the same scale. Here, standardizing is _truly_ not necessary since both the `'Points For'` and `'Points Against'` columns are already roughly on the same scale (both features are in the 1800-3000 range, roughly speaking).

But, while it won't help with our model's performance, we'll have you standardize our model's features here, to practice creating a Pipeline. Specifically, you'll build an `sklearn` Pipeline that does the following to predict `'W-L%'`:

- Standardizes `'Points For'` and `'Points Against'`.
- Fits a `LinearRegression` model.

Complete the implementation of the function `create_model_points_for_against_standardized`, which takes in a DataFrame like `teams` and returns a fit Pipeline that follows all of the steps above. Example behavior is given below.


```python
>>> model = create_model_points_for_against_standardized(teams)
>>> model
```

<div style="text-align: left;">
  <img src="imgs/base-pipe.png" width="200">
</div>

```python
>>> model[-1].coef_
array([ 0.16508223, -0.12354681]) # Different from before!

>>> model.predict([[2400, 2200]])[0]
0.6289662049554058 # Identical to before!
```

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler

def create_model_points_for_against_standardized(teams):
    ...
    
# Feel free to change this input to make sure your function works correctly.
# Remember that your function should work on subsets of the teams DataFrame as well.
model = create_model_points_for_against_standardized(teams)
model.predict([[2400, 2200]])[0]

In [None]:
grader.check("q02_02")

Above, you should have noticed that both the original model and the standardized Pipeline model give the same predicted win-loss percentage for a team that scored 2400 points and had 2200 points scored against them. Again, standardizing _does not_ change a model's predictions. In the rest of this cell, we'll refer to our two models so far as the same one model.

How well is our model so far performing? It's useful to compute some performance metrics that we can refer back to later on. Once you finish Question 2.2, run the cell below.

In [None]:
# Run this cell, and don't edit it.

from sklearn.metrics import mean_squared_error

def model_performances(model, X, y):
    rmse_100 = 100 * np.sqrt(mean_squared_error(y, model.predict(X)))
    r2 = model.score(X, y)
    return pd.Series({'100 * RMSE': rmse_100, 'R^2': r2})

model = create_model_points_for_against_standardized(teams)

perf = pd.DataFrame(columns=['100 * RMSE', 'R^2'])
perf.loc['Points For + Points Against'] = model_performances(
    model, teams[['Points For', 'Points Against']], teams['W-L%']
)
perf

We've computed two metrics above:
- **100 * RMSE**: This is the square root of our model's mean squared error, multiplied by 100. Since our model predicts win percentages – or more precisely, win **proportions** – the square root of its mean squared error can be interpreted, **roughly**, as the average difference between a team's actual win proportion and our prediction of its win proportion. (Remember that a model's mean absolute error is _different_ from the square root of its mean squared error, though abstractly, these are both similar measurements of the model's quality.) If we multiply the model's root mean squared error by 100, we get, roughly, the average number of **percentage points** our predictions were wrong by. Here, it seems that a typical predicted win percentage was off by 6.43 percentage points.
- **R^2**: This metric, computed by calling `model.score` on any fit regression model, is called the **multiple $R^2$** coefficient (among [other things](https://en.wikipedia.org/wiki/Coefficient_of_determination)). It is a measure of the quality of a model's predictions, or more precisely, a measure of what proportion of the $y$-variable (`'W-L%'`)'s variance our model captured. It ranges between 0 and 1, where 0 means we captured none of the $y$-variable's variance (<span style="color:orange"><b>bad</b></span>) and 1 means we captured all of the $y$-variables variance (<span style="color:blue"><b>good</b></span>). There are a few equivalent ways of computing it, assuming we're dealing with a linear model with an intercept term:

    1. $R^2 = \frac{\text{variance}(\text{predicted } y \text{ values})}{\text{variance}(\text{actual } y \text{ values})}$
    1. $R^2 = \left[ r(\text{actual } y \text { values}, \text{predicted } y \text { values}) \right]^2$, where $r$ denotes the correlation coefficient from Lecture 12.

In [None]:
# Same value as above!
np.var(model.predict(teams[['Points For', 'Points Against']])) / np.var(teams['W-L%'])

Let's keep these metrics in mind as we build more sophisticated models to predict win-loss percentage.

### Question 2.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

In the basketball world, two common "advanced stats" that analysts use to rank teams are **offensive efficiency** and **defensive efficiency**. We define them below, **using variable names that correspond to column names in `'teams'`**.

$$\text{Offensive Efficiency}_i = 100 \cdot \frac{\text{Points For}_i}{0.96 \cdot \text{FGA}_i + \text{TOV}_i + 0.44 \cdot \text{FTA}_i - \text{ORB}_i}$$

$$\text{Defensive Efficiency}_i = 100 \cdot \frac{\text{Points Against}_i}{0.96 \cdot \text{FGA}_i + \text{TOV}_i + 0.44 \cdot \text{FTA}_i - \text{ORB}_i}$$

Large offensive efficiency values mean a team is good at scoring points. Small defensive efficiency values mean a team is good at preventing their opponents from scoring points. Perhaps there's an argument to be made that we can make better predictions if we use offensive and defensive efficiency, rather than just the number of points a team scored and had scored against them. Let's try it out!

Your job here is to build an `sklearn` Pipeline that does the following to predict `'W-L%'`:

- Takes the existing features in the `teams` DataFrame and creates two new features, one for offensive efficiency and one for defensive efficiency.
- One hot encodes the `'Qualified'` column, with `drop='first'`.
- Fits a `LinearRegression` model.

Our final hypothesis function will then look like:

$$\text{pred. W-L%}_i = w_0 + w_1 \cdot \text{Offensive Efficiency}_i + w_2 \cdot \text{Defensive Efficiency}_i + w_3 \cdot (\text{Qualified}_i == \text{True})$$

<center><small><small>The last feature could also be $(\text{Qualified}_i == \text{False})$; <code>sklearn</code> will decide.</small></small></center>

Complete the implementation of the function `create_model_advanced`, which takes in a DataFrame like `teams` and returns a fit Pipeline that follows all of the steps above. Example behavior is given below.

```python
>>> model = create_model_advanced(teams)
>>> model
```

<img src="imgs/pipe-complex.png" width=500>

```python
# Three coefficients: 
# - one for offensive efficiency.
# - one for defensive efficiency.
# - one for the one hot encoded 'Qualified' column.
>>> model[-1].coef_
array([ 0.01484763, -0.0168983 ,  0.03888132])

>>> model.predict(pd.DataFrame([{
    'Points For': 2400,
    'Points Against': 2200,
    'FGA': 1998,
    'TOV': 500,
    'FTA': 700,
    'ORB': 300,
    'Qualified': False
}]))[0]
0.6343109351729046
```



Some guidance:
- All transformations should be done within the Pipeline – you **cannot** preprocess the training data using vanilla `pandas` before creating your Pipeline!
    - Specifically, you're **not** supposed to make an `'Offensive Efficiency'` column directly within `teams` before fitting a model. If you do that, your Pipeline won't behave like our example above. Your Pipeline needs to be able to take in **original, raw data** (without `'Offensive Efficiency'` values) and use them for prediction.
    - So, for instance, to create a column of offensive efficiency values **within** your Pipeline, you should create a `FunctionTransformer`! 
        - That `FunctionTransformer` will take in a function, say `f`, as input.
        - `f` itself will take in a DataFrame like `teams`, and return a **DataFrame with just a single column** of offensive efficiency values. (This is something we mentioned in [Lecture 17](https://practicaldsc.org/resources/lectures/lec17/lec17-annotated.pdf#page=18).)
        - Then, `FunctionTransformer(f)` will be given to `make_column_transformer`, along with a list of column names that are needed for `f` to work (including `'Points For'`, `'FGA'`, etc.).
    - You'll make a **separate** `FunctionTransformer` to create defensive efficiency values.
    - Remember, a `ColumnTransformer` – which you can create easily using `make_column_transformer` – is how you specify which transformations you want applied to which columns. 
    - The tests assume that you add your offensive efficiency transformer to your `ColumnTransformer` **before** your defensive efficiency transformer.
    - It's okay if the graphical representation of your Pipeline isn't exactly the same as ours.
- Remember to use `drop='first'` when using `OneHotEncoder` to avoid multicollinearity.

In [None]:
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

def create_model_advanced(teams):
    # Create your transformers here.
    ...
    
    # Make your Pipeline here.
    model = ...
    
    # Don't change the line below.
    # We're ensuring that your model is only fit using the columns that are actually needed,
    # so that when we call model.predict, we only need to specify these values as inputs.
    feature_cols = ['Points For', 'Points Against', 'FGA', 'TOV', 'FTA', 'ORB', 'Qualified']
    return model.fit(teams[feature_cols], teams['W-L%'])
    
# Feel free to change this input to make sure your function works correctly.
# Remember that your function should work on subsets of the teams DataFrame as well.
model = create_model_advanced(teams)
model.predict(pd.DataFrame([{
    'Points For': 2400,
    'Points Against': 2200,
    'FGA': 1998,
    'TOV': 500,
    'FTA': 700,
    'ORB': 300,
    'Qualified': False
}]))[0]

In [None]:
grader.check("q02_03")

Nice! Let's see how our shiny new model compares to our original model.

In [None]:
feature_cols = ['Points For', 'Points Against', 'FGA', 'TOV', 'FTA', 'ORB', 'Qualified']

model = create_model_advanced(teams)

perf.loc['With OffEff, DefEff, and Qualifying Status'] = model_performances(
    model,
    teams[feature_cols],
    teams['W-L%']
)
perf

You should notice that the predictions aren't significantly better! This is a good life lesson: engineering more sophisticated features _can_ lead to better model performance, but that isn't guaranteed.

To wrap up, let's use your model to look at how the offensive and defensive efficiencies of the teams in our dataset are related. If you've implemented `create_model_advanced` correctly, you should see a matrix with 3 columns below. You'll know that your columns are in the right order if the top-left value is `103.42113452`.

In [None]:
transformed = model[0].transform(teams[feature_cols])
transformed

If that looks good, run the cell below.

In [None]:
fig = px.scatter(teams, x='Points For', y='Points Against', hover_name='School', color='Qualified', 
                 )
fig.update_layout(width=800)

fig = px.scatter(x=transformed[:, 0], 
                 y=transformed[:, 1], 
                 hover_name=teams['School'], 
                 color=teams['Qualified'],
                 color_discrete_map={True: '#D81B60', False: '#1E88E5'})

fig.update_layout(width=800, title='Defensive Efficiency vs. Offensive Efficiency For for All 364 D1 Teams',
                  xaxis_title='Offensive Efficiency',
                  yaxis_title='Defensive Efficiency',
                  legend={'title': 'Qualified for NCAA Tournament'})

Can you find Michigan 〽️? If you did everything correctly, Michigan should appear around (115, 105).

Best of luck to your bracket!

## Question 3: Sums of Residuals 🤔

---

In this problem, we will prove that the sum of the residuals of a fit regression model is 0.

We define the $i$th **residual** to be the difference between the actual and predicted values for individual $i$ in our dataset, when the predictions are made using a regression model whose coefficients $w_0^*$ and $w_1^*$ (or, for multiple linear regression models, $w_0^*$, $w_1^*$, $w_2^*$, ..., $w_d^*$) are all optimal.

In other words, the $i$th residual $e_i$ is: $$e_i = y_i - H^*(\vec x_i)$$

We use the letter $e$ for residuals because residuals are also known as errors.

We'll get to the proof soon, but first, a warmup.

<!-- BEGIN QUESTION -->

### Question 3.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
Suppose $\vec{1} \in \mathbb{R}^n$ is a vector containing the value 1 for each element, i.e. $\vec{1} = \begin{bmatrix} 1 \\ 1 \\ ... \\ 1 \end{bmatrix}$.

For any other vector $\vec{b} = \begin{bmatrix} b_1 \\ b_2 \\ ... \\ b_n \end{bmatrix}$, what is the value of $\vec{1}^T \vec{b}$, i.e. what is the dot product of $\vec{1}$ and $\vec{b}$?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Back to the main problem at hand.

Consider the typical multiple regression scenario where our hypothesis function has an intercept term:
$$H(\vec{x}_i) = w_0 + w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}$$

Note that another way of writing the $i$th residual, $e_i = y_i - H^*(\vec x_i)$, is:

$$e_i = (\vec{y} - X \vec{w}^*)_i$$

Here, $X$ is a $n \times (d + 1)$ design matrix, $\vec{y} \in \mathbb{R}^n$ is an observation vector, and $\vec{w} \in \mathbb{R}^{(d+1)}$ is the parameter vector. We'll use $\vec{w}^*$ to denote the optimal parameter vector, or the one that satisfies the normal equations. $(\vec{y} - X \vec{w}^*)_i$ is referring to element $i$ of the vector $\vec{y} - X \vec{w}^*$.

Using facts about $\vec{w}^*$ we learned in Lectures 14 and 15, prove that for multiple linear regression models with an intercept term, the sum of the residuals is 0. That is, prove that:$$\sum_{i = 1}^n e_i = 0$$

*Hint: Refer to the derivation of $\vec w^*$ in Lecture 14. How did we define $X$? Your proof should not be very long.*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.3 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Now suppose our hypothesis function does not have an intercept term, but is otherwise linear with multiple features: $$H(\vec x_i) = w_1 x_i^{(1)} + w_2 x_i^{(2)} + ... + w_d x_i^{(d)}$$

- Is it still guaranteed that $\displaystyle\sum_{i = 1}^n e_i = 0$? Why or why not?
- Is it still possible that $\displaystyle\sum_{i = 1}^n e_i = 0$? If you believe the answer is yes, come up with a simple example where a linear hypothesis function without an intercept has residuals that sum to 0. If you believe the answer is no, state why not.

<!-- END QUESTION -->

## Question 4: Real Estate 🏡

---

You are given a dataset containing information on recently sold houses in Ann Arbor, including:

- square footage
- number of bedrooms
- number of bathrooms
- year the house was built
- asking price, or how much the house was originally listed for, before negotiations
- sale price, or how much the house actually sold for, after negotiations

The table below shows the first few rows of the data set. Note that since you don't have the full dataset, you cannot answer the questions that follow based on calculations; you must answer conceptually.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>House</th>
      <th>Square Feet</th>
      <th>Bedrooms</th>
      <th>Bathrooms</th>
      <th>Year</th>
      <th>Asking Price</th>
      <th>Sale Price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>1247</td>
      <td>3</td>
      <td>3</td>
      <td>2005</td>
      <td>500,000</td>
      <td>494,000</td>
    </tr>
    <tr>
      <td>2</td>
      <td>1670</td>
      <td>3</td>
      <td>2</td>
      <td>1927</td>
      <td>1,000,000</td>
      <td>985,000</td>
    </tr>
    <tr>
      <td>3</td>
      <td>716</td>
      <td>1</td>
      <td>1</td>
      <td>1993</td>
      <td>335,000</td>
      <td>333,850</td>
    </tr>
    <tr>
      <td>4</td>
      <td>1600</td>
      <td>4</td>
      <td>2</td>
      <td>1962</td>
      <td>830,000</td>
      <td>815,000</td>
    </tr>
    <tr>
      <td>5</td>
      <td>2635</td>
      <td>4</td>
      <td>3</td>
      <td>1993</td>
      <td>1,250,000</td>
      <td>1,250,000</td>
    </tr>
    <tr>
      <td>&#8943;</td> <!-- ellipsis -->
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
    </tr>
  </tbody>
</table>


<!-- BEGIN QUESTION -->

### Question 4.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
First, suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other variables.  Which feature would you expect to have the largest magnitude weight? Why? (Remember that the weight of a feature is the value of $w^*$ for that feature.)

Then, suppose we standardize each variable separately. (Recall, to standardize a column $x_1, x_2, ..., x_n$, we replace each value $x_i$ with $\frac{x_i - \bar{x}}{\sigma_x}$.) Suppose we fit another multiple linear regression model to predict the sale price of a house given all five of the other standardized variables. Now, which feature would you expect to have the largest magnitude weight? Why?

Some guidance: There _could_ be multiple answers to one of the parts; if that's the case, you only need to list and justify one possible answer.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other variables in their original, unstandardized form. Suppose the weight for the Year feature is $\alpha$.

Now, suppose we replace Year with a new feature, Age, which is 0 if the house was built in 2025, 1 if the house was built in 2024, 2 if the house was built in 2023, and so on. If we fit a new multiple linear regression model on all five variables, but using Age instead of Year, what will the weight for the Age feature be, in terms of $\alpha$?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Now, suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other features, plus a new sixth feature named $\text{Rooms}$, which is the total number of bedrooms and bathrooms in the house. Will our new regression model with an added sixth feature make better predictions than the models we fit in Questions 2.1 or 2.2?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Now, suppose we fit two multiple linear regression models to predict the sale price of a house.

The first uses the features $\text{Square Feet}$ and $\text{Bathrooms}$:

$$H_\gamma(\vec x_i) = \gamma_0 + \gamma_1 \cdot \text{Square Feet}_i +\gamma_2 \cdot \text{Bathrooms}_i$$

The second model uses the  features $\text{Square Feet}$ and $\text{Bathrooms}$ and a new seventh feature named $\text{Length of Street Name}$, which is the number of letters in the name of the street that the house is on:

$$H_\lambda(\vec x_i) = \lambda_0 + \lambda_1 \cdot \text{Square Feet}_i +\lambda_2 \cdot \text{Bathrooms}_i + \lambda_3 \cdot \text{Length of Street Name}_i$$

Let $\text{TMSE}$ refer to the "training" mean squared error, that is, the mean squared error of a hypothesis function **on the same dataset we used to fit it**. (Through Lecture 16, we just referred to this idea as MSE.)

Argue why $\text{TMSE}(H_\lambda) \leq \text{TMSE}(H_\gamma)$.

<!-- END QUESTION -->

## Finish Line 🏁

Congratulations! You're ready to submit Homework 8. **Remember, you need to submit Homework 8 twice**:

### To submit the manually graded problems (Questions 1.4, 3-4; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 8 (Questions 1.4, 3-4; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Questions 1.1-1.3, 2; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 8 (Questions 1.1-1.3, 2; autograded problems)".
5. Stick around while the Gradescope autograder grades your work.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 8 submission time will be the **later** of your two individual submissions.