In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw07.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 7

# Multiple Linear Regression

### EECS 398: Practical Data Science, Winter 2025

#### Due Tuesday, March 18th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 7! In this homework, you'll develop a deeper understanding of the inner workings of linear regression through the lens of linear algebra. The concepts in this homework are all _crucial_ to modern machine learning.

You are given 8 slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.

<div class="alert alert-warning">

### Submission
    
This homework features a mix of autograded programming questions and manually-graded questions.
    
- Questions 1-2 are **manually graded**, and say **[Written ✏️]** in their titles. For these questions, **do not write your answers in this notebook**! Instead, write **all** of your answers in a separate PDF. Submit this separate PDF to the **Homework 7 (Questions 1-2; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**! Make sure to show your work for all written questions, as answers without work shown may not receive full credit.
  
- Questions 3-5 are **fully autograded**, and each part will say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 7 (Questions 3-5; autograded problems)** assignment on Gradescope to have your code graded by the autograder.

    
Your Homework 7 submission time will be the **later** of your two individual submissions. Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll take your **most recent** submission. 
</div>
</div>

This homework is worth a total of **60 points**, 37 of which come from the autograder (Question 3-5) and 23 which are manually graded by us (Questions 1-2). The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents. You can also view a static version of this homework notebook [**at this link**](https://practicaldsc.org/resources/homeworks/hw07/hw07.html).

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`.

In [None]:
import pandas as pd
import numpy as np

import plotly
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center")
    )
)
pio.templates.default = "simple_white+pds"
pd.options.plotting.backend = 'plotly'

import warnings
warnings.simplefilter('ignore')

## Question 1: Correlation Bounds 📈

---

In class, you were told that the correlation coefficient, $r$, ranges between $-1$ and $1$, where $r = -1$ implies a perfect negative linear association and $r = 1$ implies a perfect positive linear association. However, you were never given a proof of the fact that $-1 \leq r \leq 1$.

Here, you will prove this fact, using linear algebra. Before proceeding, you'll want to review 
[slide 47 onwards in Lecture 12](https://practicaldsc.org/resources/lectures/lec12/lec12-filled.pdf#page=47).
**Remember to show your work all throughout!**

<!-- BEGIN QUESTION -->

### Question 1.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Let $\vec{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix}$. We define the "mean-centered" version of $\vec{x}$ to be $\vec{x_c} = \begin{bmatrix} x_1 - \bar{x} \\ x_2 - \bar{x} \\ \vdots \\ x_n - \bar{x}\end{bmatrix}$, where $\bar{x}$ is the mean of the components of $\vec{x}$.

The mean-centered version of $\vec{y}$, named $\vec{y_c}$, is defined similarly. Express $\vec{x_c} \cdot \vec{y_c}$ using summation notation.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Prove that: $$r=\frac{\vec{x_c}\cdot \vec{y_c}}{\lVert \vec{x_c} \rVert \lVert \vec{y_c} \rVert}$$

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div> 
**Explain** why the result in Question 1.2 implies that $-1 \leq r \leq 1$.

<!-- END QUESTION -->

## Question 2: Same, but Different 😱

---

In Lecture 12, we were introduced to one of many formulas for the optimal slope, $w_1^*$, and optimal intercept, $w_0^*$, for the simple linear regression model $H(x_i) = w_0 + w_1 x_i$ when using squared loss:

$$w_1^* = r \frac{\sigma_{y}}{\sigma_{x}} \qquad w_0^* = \bar y - w_1^* \bar x$$

Then, in Lectures 14 and 15, we revisited the simple linear regression model in terms of linear algebra. When $X \in \mathbb{R}^{n \times 2}$ is the design matrix, $\vec{y} \in \mathbb{R}^n$ is the observation vector, and $\vec{w} \in \mathbb{R}^2$ is the parameter vector, we found that the optimal parameter vector $\vec{w}^*$ is one that satisfies the normal equations:

$$X^TX \vec{w} = X^T \vec y$$

When $X^TX$ is invertible, $\vec{w}^*$ can be expressed as:

$$\vec{w}^* = (X^TX)^{-1}X^T \vec y$$

In this problem, we will prove that both of these formulations are equivalent, for any dataset $(x_1, y_1)$, $(x_2, y_2)$, ..., $(x_n, y_n)$. Specifically, we'll show that the vector $\vec{w}^* = (X^TX)^{-1}X^Ty$ has two components, the first of which is $w_0^*$ and the second of which is $w_1^*$. (To do this, we'll need to assume that $(X^TX)^{-1}$ is invertible, which it is in this setting.)

Note that on first glance, it looks like this problem is quite long, since it has seven subparts. However, the subparts are meant to guide you through the proof. The problem would take much longer if we just said "prove it!"

<!-- BEGIN QUESTION -->

### Question 2.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Throughout this question, assume $X$ refers to the $\mathbb{R}^{n \times 2}$ design matrix and $\vec y \in \mathbb{R}^n$ refers to the observation vector, as defined on [slide 29 of Lecture 14](https://practicaldsc.org/resources/lectures/lec14/lec14-filled.pdf#page=29).

Express the vector $X^T \vec y$ using constants and/or summations involving $x_i$ and/or $y_i$. Make sure that your answer has the correct dimensions.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
Express the matrix $X^TX$ using constants and/or summations involving $x_i$ and/or $y_i$. Make sure that your answer has the correct dimensions.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
Recall, if $M = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ is a $2 \times 2$ matrix, then the inverse of $M$ is given by:

$$M^{-1} = \frac{1}{ad - bc} \begin{bmatrix} 
 d & -b \\ -c & a\end{bmatrix}$$

Express the matrix $(X^TX)^{-1}$ using constants and/or summations involving $x_i$ and/or $y_i$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
At this point, the expressions you have for $(X^TX)^{-1}$ and $X^Ty$ likely involve many summation notations and look... complicated. Let's take a step back and simplify things before we proceed.

Prove that:

$$\sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2$$

<!-- % $n^2 \sigma_x^2 = n \sum_{i = 1}^n x_i^2 - n^2 \bar{x}^2$ -->

where $\bar{x}$ and $\sigma_x$ are the mean and standard deviation of $x_1, x_2, ..., x_n$, respectively.

Some guidance: Start with the definition of $\sigma_x^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2$ and expand the sum.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
Using your work in Questions 2.3 and 2.4, prove that:

$$(X^TX)^{-1} = \frac{1}{n\sigma_x^2} \begin{bmatrix} \sigma_x^2 + \bar{x}^2 & -\bar{x} \\ -\bar{x} & 1 \end{bmatrix}$$

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.6 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

$(X^TX)^{-1}$ is about as simplified as it can be for now. But, before we multiply $(X^TX)^{-1}$ and $X^Ty$, we should deal with the fact that at least one of the components in the vector $X^T \vec{y}$ still involves a summation.

Prove that:

$$\sum_{i=1}^n x_i y_i = nr \sigma_x \sigma_y + n \bar{x}\bar{y}$$

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.7 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>
Now, put it all together. That is, prove that:

$$(X^TX)^{-1}X^T \vec{y} = \begin{bmatrix} \bar{y} - r \frac{\sigma_y}{\sigma_x} \bar{x} \\ r \frac{\sigma_y}{\sigma_x}  \end{bmatrix}$$

Note that the second component of the vector above is $w_1^* =  r \frac{\sigma_y}{\sigma_x}$ and the first component of the vector above is $w_0^* = \bar{y} -  r \frac{\sigma_y}{\sigma_x} \bar{x} = \bar{y} - w_1^* \bar{x}$, as we first saw in Lecture 12! This concludes our proof that both formulations of the optimal parameters of the simple linear regression model are equivalent.

<!-- END QUESTION -->

## Question 3: Simple LADs 🧍

---

In lecture, we explored simple linear regression, and defined it as the problem of finding the values of $w_0$ (intercept) and $w_1$ (slope) that minimize mean squared error:
\begin{align*}
R_{\text{sq}}(w_0, w_1) &= \frac{1}{n} \sum_{i=1}^{n} (y_i -(w_0 + w_1x_i))^2
\end{align*}

The optimal slope and intercept were denoted $w_1^*$ and $w_0^*$, respectively, and have closed-form solutions that we derived in lecture (and even restated in Question 2: Same, but Different 😱). When using squared loss to find our optimal parameters, linear regression is often called "least squares regression." 

**What if we used a different loss function instead?**

In this question, we'll implement another type of linear regression: simple least absolute deviation (LAD) regression. LAD regression uses absolute loss to measure the quality of predictions, rather than squared loss. Put another way, to find the optimal slope $w_1^*$ and intercept $w_0^*$ for LAD regression, we minimize mean absolute error:

\begin{align*}
R_{\text{abs}}(w_0, w_1) &= \frac{1}{n} \sum_{i=1}^{n} |y_i -(w_0 + w_1x_i)|
\end{align*}

The "simple" in "simple LAD" refers to the fact that our hypothesis function $H(x_i) = w_0 + w_1 x_i$, like in regular simple linear regression, only uses a single input feature.

Since absolute value functions are not differentiable, we cannot just take partial derivatives of $R_{\text{abs}}$ with respect to $w_0$ and $w_1$, set them equal to zero, and solve for the values of $w_0$ and $w_1$, as we did to minimize $R_{\text{sq}}$.

In order to generate the optimal LAD regression line we are going to leverage the following theorem (which, luckily, we won't need to prove):

> The regression model that minimizes mean absolute error passes directly through at least $k$ points, where $k$ is the number of parameters of the model.

This theorem is useful to us because it allows us to adopt a very conceptually simple, albeit not very efficient, strategy to compute an optimal simple LAD regression line. Since our hypothesis function has $k = 2$ parameters, an intercept $w_0$ and a slope $w_1$, we can simply:

1. Generate all possible pairs of 2 points. We know that the optimal LAD line will pass through at least one of these pairs.
1. For each pair of points:
    1. Find the equation of the line that passes through the pair. Denote the intercept and slope of this line $w_0$ and $w_1$, respectively.
    1. Compute the mean absolute error of the line with intercept $w_0$ and slope $w_1$, i.e. compute $R_\text{abs}(w_0, w_1)$.
1. Return the $(w_0, w_1)$ combination with the minimum value of $R_\text{abs}(w_0, w_1)$. By the above theorem, this line is guaranteed to minimize mean absolute error.

Notice that unlike with simple linear regression, the optimal simple LAD regression line may not be unique!

In this question, you will ultimately complete the implementation of the `SimpleLAD` **class**, which can be used as follows:

```python
>>> model = SimpleLAD()
>>> model.fit([1, 2, -1, 4], [15, 6, 7, 8])
>>> model.intercept_
7.2
>>> model.coef_
0.2
>>> model.predict(5)
8.2
>>> model.predict([5, -3.5, 5])
array([8.2, 6.5, 8.2])
```

But first, we'll have you implement several standalone helper functions.

### Question 3.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `generate_pairs`, which takes in two 1D lists/arrays, `x` and `y`, where the points in our dataset are `(x[0], y[0])`, `(x[1], y[1])`, and so on. `generate_pairs` should return a **list** containing all unique **pairs** of points in `x` and `y`. Each pair in the returned list should be a tuple of tuples.

Example behavior is given below.

```python
>>> generate_pairs([1, 2, -1, 4], [15, 6, 7, 8])
[((1, 15), (2, 6)),
 ((1, 15), (-1, 7)),
 ((1, 15), (4, 8)),
 ((2, 6), (-1, 7)),
 ((2, 6), (4, 8)),
 ((-1, 7), (4, 8))]
```

For more context on the example above:
- There are four points in the dataset: $(1, 15), (2, 6), (-1, 7), (4, 8)$. In Python, we represent each point as a tuple.
- There are 6 pairs of points:

    $$(1, 15) \text { and } (2, 6)$$$$(1, 15) \text { and } (-1, 7)$$$$(1, 15) \text { and } (4, 8)$$$$(2, 6) \text{ and } (-1, 7)$$$$(2, 6) \texttt{ and } (4, 8)$$$$(-1, 7) \text { and } (4, 8)$$
    
    Note that we don't consider $(2, 6) \text{ and } (1, 15)$ to be a different pair than $(1, 15) \text { and } (2, 6)$. That is, order does not matter.
    
- We represent each pair as a tuple, e.g. the first pair in the list above is represented as `((1, 15), (2, 6))` (or `((2, 6), (1, 15))`; either is fine, but not both).
- The returned list contains 6 tuples of tuples.

You may assume that `len(x) == len(y) >= 2`. The order in which the resulting pairs appear in the returned list does not matter. If there are duplicated points, there may be duplicated pairs, and that's to be expected:

```python
>>> generate_pairs([1, 1, 1], [1, 1, 1])
[((1, 1), (1, 1)), ((1, 1), (1, 1)), ((1, 1), (1, 1))]
```

In [None]:
# This is a big hint! Using this, our solution only took 2 lines,
# and didn't use any for-loops or list comprehensions (but yours can).
from itertools import combinations 

def generate_pairs(x, y):
    ...

# Feel free to change this input to make sure your function works correctly.
generate_pairs([1, 2, -1, 4], [15, 6, 7, 8])

In [None]:
grader.check("q03_01")

### Question 3.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `generate_lines`. 
- `generate_lines` takes in a list, `pairs`, in which each element is a tuple. Each tuple is itself made up of two tuples, corresponding to a pair of points. The input to `generate_lines` may look like:

```python
    [((1, 2), (3, 7)), ((1, 10), (-4, 20))]
```

- `generate_lines` returns a list with the same length as `pairs`, in which each element is a tuple of the form `(intercept, slope)`. Element `i` of the returned list should be a tuple containing the intercept and slope of the line passing through the two points in `pairs[i]` (the order of the outputted lines should be the same as the order of the inputted pairs).

Example behavior is given below.

```python
>>> generate_lines([((1, 2), (3, 7)), ((1, 10), (-4, 20))])
[(-0.5, 2.5), (12.0, -2.0)]
```

For more context on the example above:
- The input to `generate_lines` contains two pairs of points.
- The first pair of points, $(1, 2) \text{ and } (3, 7)$, sit on the line $y = -0.5 + 2.5x$. The intercept of this line is -0.5 and the slope is 2.5, so the first returned tuple is `(-0.5, 2.5)`.
- The second pair of points, $(1, 10) \text{ and } (-4, 20)$, sit on the line $y = 12 - 2x$. The intercept of this line is 12 and the slope is -2, so the second returned tuple is `(12.0, -2.0)`.


Some guidance:
- A fact from high school algebra is that given any two points, there is exactly one line that passes through them. You'll need to figure out how to programmatically find the intercept and slope of this line, given any two arbitrary points $(x_1, y_1)$ and $(x_2, y_2)$.
- There is theoretically the risk of a `DivisionByZero` error, if a pair of points contains two values with the same $x$-coordinate. We won't test your code on such examples.
- You don't have to manually convert the values in the output tuples to floats – this will likely happen automatically because your calculations will involve division, and if it doesn't, don't worry about it.

In [None]:
def generate_lines(pairs):
    ...
    
# Feel free to change this input to make sure your function works correctly.
generate_lines([((1, 2), (3, 7)), ((1, 10), (-4, 20))])

In [None]:
grader.check("q03_02")

### Question 3.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `mae_of_candidate_line`, which takes in four inputs:
- `intercept`, a float,
- `slope`, a float,
- `x`, a 1D list/array of numbers, and
- `y`, a 1D list/array of numbers.

`mae_of_candidate_line` should return the mean absolute error from using the line with intercept `intercept` and slope `slope` to predict `y` from `x`.

Example behavior is given below.

```python
>>> mae_of_candidate_line(5, 2, [1, 2, -1, 4], [15, 6, 7, 8])
5.0
```

For more context on the example above:

- There are four points in the dataset provided: $(1, 15)$, $(2, 6)$, $(-1, 7)$, and $(4, 8)$.
- The line we're using to make predictions is $H(x_i) = 5 + 2x_i$. This line, and the four points above, are visualized below:

<center><img src="imgs/mae-example.png" width=400></center>

- The absolute errors of the line's predictions are 4, 8, 3, and 5. So, the mean of absolute errors is $\frac{4+8+3+5}{4} = 5$.

Don't use a `for`-loop!

In [None]:
def mae_of_candidate_line(intercept, slope, x, y):
    ...

# Feel free to change this input to make sure your function works correctly.  
mae_of_candidate_line(5, 2, [1, 2, -1, 4], [15, 6, 7, 8])

In [None]:
grader.check("q03_03")

### Question 3.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Now, put it all together. Complete the implementation of the `SimpleLAD` class, which has two methods, apart from the constructor.

#### `fit`

`fit` takes in two* 1D list/arrays, `x` and `y`. Using the previously-defined helper functions, `fit` determines the intercept and slope that minimize mean absolute error on the dataset defined by `x` and `y`.
                
`fit` should not return anything, but should instead set the values of `self.intercept_` (the optimal intercept) and `self.coef_` (the optimal slope; we use the attribute name `coef_` instead of `slope_` to match `sklearn`'s naming conventions).

If there are multiple optimal combinations of intercepts and slopes, set `self.intercept_` and `self.slope_` to any one of those combinations.

*As you'll see in the method stub, `fit` takes in a third argument (at the start), named `self`. The role of the `self` argument is to be able to access attributes and methods of the current instance of the `SimpleLAD` class. Read [this article](https://www.geeksforgeeks.org/self-in-python-class/) for more information on the role of the `self` argument.

<br>

#### `predict`

`predict` takes in a single (non-`self`) input, named `x_new`, which can either be a single value or list/array of values.

- If `x_new` is a single value, `predict` should return a single value, corresponding to the predicted $y$-value for the passed in $x$-value, using the already-found `self.intercept_` and `self.coef_`.
- If `x_new` is a list or array, `predict` should return an **array** corresponding to the predict $y$-values for the passed in $x$-values, using the already-found `self.intercept_` and `self.coef_`.

`fit` must be called before `predict`; if not, raise an `AttributeError`.

<br>

Example behavior is given below.

```python
>>> model = SimpleLAD()
>>> model.fit([1, 2, -1, 4], [15, 6, 7, 8])
>>> model.intercept_
7.2
>>> model.coef_
0.2
>>> model.predict(5)
8.2
>>> model.predict([5, -3.5, 5])
array([8.2, 6.5, 8.2])
```

For more context on the example above:

- There are four points in the dataset provided: $(1, 15)$, $(2, 6)$, $(-1, 7)$, and $(4, 8)$.
- The helper functions `generate_pairs`, `generate_lines`, and `mae_of_candidate_line` helped us deduce that the line with the minimum mean absolute error on this dataset is $H^*(x_i) = 7.2 + 0.2x_i$, so `model.intercept_` is `7.2` and `model.coef_` is `0.2`.
- Using the fit hypothesis function $H^*(x_i) = 7.2 + 0.2x_i$ on the inputs 5, -3.5, and 5 give us the predictions $H(5) = 8.2$, $H(-3.5) = 6.5$, and $H(5) = 8.2$, so we return an array with those three values. (Note that we return an array even though the inputs were provided as a list.) When using this hypothesis function on the single input 5, we return just the value $H(5) = 8.2$, not as an array.

In [None]:
class SimpleLAD:
    
    def __init__(self):
        """
        __init__ is the name given to the constructor method in a Python class.
        We don't need to do anything to initialize a SimpleLAD object, so this constructor
        doesn't actually do anything.
        """
        pass
    
    def fit(self, x, y):
        if len(x) != len(y):
            raise ValueError(f'Dimension mismatch: x has length {len(x)} while y has length {len(y)}')
            
        ...
        
        # The last two lines in the body of `fit` should be the two below.
        self.intercept_ = ...
        self.coef_ = ...
        
    def predict(self, x_new):
        if isinstance(x_new, list):
            x_new = np.array(x_new)
        try:
            ...
        except AttributeError:
            raise AttributeError('Cannot use `predict` before `fit`.')
            
            
# Feel free to change the inputs below to make sure your class implementation works correctly.
model = SimpleLAD()
model.fit([1, 2, -1, 4], [15, 6, 7, 8])
preds = model.predict([5, -3.5, 5])
print(f'''
model.intercept_ = {model.intercept_}
model.coef_ = {model.coef_}
model.predict([5, -3.5, 5]) = {preds}
''')

In [None]:
grader.check("q03_04")

Now that our implementation of `SimpleLAD` is complete, we can use it to fit real datasets! Run the cell below to load in a DataFrame with two columns, `'x'` and `'y'`.

In [None]:
data_for_lad = pd.read_csv('data/data-for-lad.csv')
data_for_lad.head()

Run the cell below to draw a scatter plot of `'y'` vs. `'x'`.

In [None]:
px.scatter(data_for_lad, x='x', y='y', color_discrete_sequence=['#444']).update_layout(width=800, height=400)

There's a clear linear association at the bottom, with some outliers spread throughout the top. Let's see how the best-fitting lines look on this dataset, when the lines are chosen by minimizing mean squared error vs. mean absolute error.

First, we'll find the standard simple linear regression line, i.e. the one that minimizes mean squared error. We'll use `sklearn` to do this.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model_mse = LinearRegression()
model_mse.fit(X=data_for_lad[['x']], y=data_for_lad['y'])

In [None]:
model_mse.intercept_

In [None]:
# An array with one optimal parameter.
# sklearn's LinearRegression supports multiple regression, meaning it stores
# the coef_ attribute in a way that is flexible enough to hold multiple slope parameters.
model_mse.coef_

Now, let's compute the least absolute deviations line, i.e. the one that minimizes mean absolute error. **This is where your hard work comes in!**

In [None]:
model_lad = SimpleLAD()
model_lad.fit(data_for_lad['x'].to_numpy(), data_for_lad['y'].to_numpy())

In [None]:
model_lad.intercept_

In [None]:
model_lad.coef_

Let's graph both of these lines!

In [None]:
fig = px.scatter(data_for_lad, x='x', y='y', color_discrete_sequence=['#888']).update_layout(width=800, height=400)

fig.add_trace(
    go.Scatter(
        x=[-1, 11],
        y=model_mse.predict([[-1], [11]]),
        mode='lines',
        name='Best Line with Minimizing MSE',
        line={'color': '#00274C'}
    )
)

fig.add_trace(
    go.Scatter(
        x=[-1, 11],
        y=model_lad.predict([-1, 11]),
        mode='lines',
        name='Best Line when Minimizing MAE',
        line={'color': '#FFCB05'}
    )
)

What do you notice? There's nothing you need to write or comment on here, but you should think about what makes the lines appear so different, and **why** this is happening.

### Question 3.5 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

We've built a naïve implementation of simple LAD (least absolute deviations) regression. Suppose $n$ is the number of points in the dataset that we fit a `SimpleLAD` object on. Which of the following most accurately describe the runtime of `SimpleLAD.fit`, in Big-O notation? Assign `naive_lad_runtime` to an integer between 1 and 8, inclusive, corresponding to your answer among the choices below.

1. $O(1)$
2. $O(n)$
3. $O(n^2)$
4. $O(n^3)$
5. $O(\log n)$
6. $O(n \log n)$
7. $O(n!)$
8. $O(2^n)$

_Hint: When computing the theoretical runtime of an algorithm, it doesn't matter which language or package an operation is implemented in – a fast `numpy` vectorized operation still involves a loop!_

In [None]:
naive_lad_runtime = ...
naive_lad_runtime

In [None]:
grader.check("q03_05")

## Question 4: Interpol...ation 🚔 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">6 Points</div>

---

So far in this class, our primary tool for making predictions has been linear regression – that is, a straight line, or if using two features, a plane. However, one potential issue with linear regression, depending on your use case, is that a straight line doesn't pass through each data point, leading to prediction errors.

<center><img src="imgs/model.png" width=500></center>

In this question, we will explore the idea of **polynomial interpolation**, which is a method of constructing a polynomial that passes directly through a given set of points. Interpolation is widely used in numerical analysis, a subfield of mathematics that deals with approximating solutions to equations that (often) don't have solutions that we can solve for by hand, often by writing code.

The specific method for interpolation we'll study in this problem is called Lagrange Interpolation. It solves the following problem:

> Given a set of $n+1$ points, $(x_1, y_1), (x_2, y_2), ..., (x_{n+1}, y_{n+1})$, what is the equation of the degree $n$ polynomial that passes through all $n+1$ points?

<div class="alert alert-success">

To get started, read [**this guide**](https://practicaldsc.org/guides/machine-learning/interpolation) we've written about the method of maximum likelihood estimation. Think of it as an extension of the homework spec.
    
</div>

<br>

**Your job** is to complete the implementation of the function `interpolate`.
- `interpolate` takes in a list of tuples, `points`, corresponding to points $(x_1, y_1), (x_2, y_2), ..., (x_{n+1}, y_{n+1})$.
- `interpolate` returns a **function**, that:
    - takes in a single number, $x$, and
    - returns the output of passing $x$ into the polynomial that **interpolates** the points in `points`.

Example behavior is given below.

```python
>>> f = interpolate([(1, 3), (3, 19), (4, 33)])
>>> type(f)
function
>>> f(1)
3.0
>>> f(3)
19.0
>>> f(100)
20001
```

For more context on the example above:

- As discussed in the linked guide, the polynomial that interpolates the points $(1, 3)$, $(3, 19)$, and $(4, 33)$ is $p(x) = 1 + 2x^2$.
- So, `interpolate([(1, 3), (3, 19), (4, 33)])` returns a function that takes in an `x` and outputs `1 + 2 * (x ** 2)`.
- This expression, `1 + 2 * (x ** 2)`, is not hard-coded anywhere. Rather, `interpolate` creates a function using the Lagrange interpolation process outlined in the guide.

Some guidance:
- The bulk of your work is in figuring out how to implement the basis polynomials in code. Feel free to use `for`-loops and list comprehensions as necessary.
- For context, the body of `f` in our solution only had 6 lines of code, and we added a few lines before `def f...`.
- Don't worry about cases where two points have the same $x$-coordinate – as mentioned in the guide, in Lagrange interpolation, we assume there are no duplicate $x_i$s.

In [None]:
def interpolate(points):
    def f(x):
        ...
    return f

# Feel free to change this input to make sure your function works correctly.
f = interpolate([(1, 3), (3, 19), (4, 33)])
f(100)

In [None]:
grader.check("q04")

Now that you've implemented `interpolate`, let's implement a function that interpolates a set of points and then plots the resulting polynomial. We've done this for you below.

In [None]:
def interpolate_and_plot(points, x_min=None, x_max=None):
    f = interpolate(points)
    xs = [point[0] for point in points]
    ys = [point[1] for point in points]
    if x_min == None:
        x_min = min(xs) - 1
    if x_max == None:
        x_max = max(xs) + 1
    x_range = np.linspace(x_min, x_max, 10000)
    outs = [f(xi) for xi in x_range]
    
    fig = px.scatter(x=xs, y=ys, size=[1] * len(xs), size_max=15, title='Original Data and Interpolated Polynomial')

    fig.add_trace(
        go.Scatter(
            x=x_range,
            y=outs,
            mode='lines',
            name='Interpolated Polynomial',
            line={'color': 'red'}
        )
    )
    
    return fig

interpolate_and_plot([(0, -1), (1, 0), (2, -11), (3, 2), (4, 99)])

Let's look at an even more interesting example: the commute times dataset from lecture!

In [None]:
commutes = pd.read_csv('data/commute-times.csv')
commutes.head()

As in lecture, let's let our $x$ variable be `'departure_hour'` and $y$ be `'minutes'`, representing the length of our commute to school in minutes.

Unfortunately, the `'departure_hour'` column isn't unique, meaning that there are duplicated $x_i$s, which would cause Lagrange interpolation to fail (think about why!).

In [None]:
commutes['departure_hour'].value_counts()

So, let's keep just the first instance of each unique $x_i$.

In [None]:
commutes = commutes.groupby('departure_hour').first().reset_index()

Now, we can take the `'departure_hour'` and `'minutes'` columns and create an $(x_i, y_i)$ point out of each row.

In [None]:
departure_hours = commutes['departure_hour'].to_numpy()
minutes = commutes['minutes'].to_numpy()
as_points = list(zip(departure_hours, minutes))

Finally, let's call our freshly-minted `interpolate_and_plot` on `as_points`:

In [None]:
interpolate_and_plot(as_points)

Pay close attention to the numbers on the $y$-axis. What do they mean?

Let's zoom in closer to the middle of the graph.

In [None]:
interpolate_and_plot([point for point in as_points if 7.9 <= point[0] <= 8.1], x_min=7.9, x_max=8.1)

Given these outputs, here's something to think about.

The process of Lagrange Interpolation finds a polynomial $p(x)$ with a mean squared error of 0, which is generally much lower than the mean squared error of the simple linear regression line, $H^*(x)$, on a given dataset (unless the dataset consists of points that all fall on a straight line). Why isn't Lagrange Interpolation used very often in the context of finding hypothesis functions to use for prediction, and why do we prefer empirical risk minimization in general?

You don't have to write your answer anywhere, because Question 4 is entirely autograded, but it's important that you understand the nature of _how_ interpolation works.

## Question 5: Billy the Waiter 🧑‍🍳

---

Run the cell below to load in a dataset containing information about the tips Billy received over the last month as a waiter at Mani Osteria.

In [None]:
tips = px.data.tips().rename(columns={'size': 'table_size'}).replace('Fri', 'Thur')
tips

Each row corresponds to a single table that he served. Throughout this question, our goal will be to predict `'tip'` using some or all of the other features in the DataFrame. We will do so by implementing all aspects of the linear regression model-building process **manually using `numpy`, i.e. WITHOUT using `sklearn` or other machine learning packages**.

Let's start by just using `'total_bill'` to predict `tip`. Here's a scatter plot showing the relationship between the two variables:

In [None]:
tips.plot(kind='scatter', x='total_bill', y='tip', title='Total Bill vs. Tip')

Before we get started actually making predictions, we'll need to implement several helper functions. We've defined two for you; read their implementations to make sure you understand what they do.

In [None]:
def solve_normal_equations(X, y):
    '''
    Equivalent to returning np.linalg.inv(X.T @ X) @ X.T @ y
    when X.T @ X is invertible, but more efficient and numerically stable.
    '''
    return np.linalg.solve(X.T @ X, X.T @ y)

def compute_mse(X, y, w):
    return np.mean((y - X @ w) ** 2)

Your first task is to complete the final of the three helper functions.

### Question 5.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `create_design_matrix`, which takes in a DataFrame `df` like `tips` and a list of column names, `columns`, and returns a 2D **array** where:
- The first column of the output array is all 1s.
- All other columns of the output array are the columns of `df` specified in the list `columns`, in the same order in which they appear in `columns`.

Example behavior is given below.

```python
>>> create_design_matrix(tips.head(), ['total_bill', 'table_size'])
array([[ 1.  , 16.99,  2.  ],
       [ 1.  , 10.34,  3.  ],
       [ 1.  , 21.01,  3.  ],
       [ 1.  , 23.68,  2.  ],
       [ 1.  , 24.59,  4.  ]])
```

Some guidance:
- Make sure your function doesn't make in-place modifications to the passed in DataFrame!
- Assume that all of the column names in `columns` refer to numeric columns in `df`.
- There could be repeated column names in `columns`; if this happens, include the specified columns multiple times, as requested.

In [None]:
def create_design_matrix(df, columns):
    ...

# Feel free to change this input to make sure your function works correctly.
create_design_matrix(tips.head(), ['total_bill', 'table_size'])

In [None]:
grader.check("q05_01")

**If you implemented `create_design_matrix` correctly, you should be able to run the next few cells without any issues.**

Recall, our goal is to start with a simple linear model that uses `'total_bill'` to predict `'tip'`. Let's use your implementation of `create_design_matrix` to set up our $X$ and $\vec{y}$.

In [None]:
X_one_feature = create_design_matrix(tips, ['total_bill'])
y = tips['tip']

# Notice that X_one_feature has two columns.
X_one_feature

Next, let's find the optimal parameter vector, $\vec{w}^*$.

In [None]:
# Finding w*.
w_one_feature = solve_normal_equations(X_one_feature, y)
w_one_feature

We can now use this hypothesis function to make predictions:

In [None]:
# Dot product of an augmented feature vector for a total bill of 15 with the optimal parameter vector.
np.array([1, 15]) @ w_one_feature

In [None]:
px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')

x_range = np.linspace(0, 60)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=y, mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_one_feature[0] + w_one_feature[1] * x_range, 
                         name='Linear Hypothesis Function', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')

The mean squared error of this hypothesis function is as follows:

In [None]:
mse_one_feature = compute_mse(X_one_feature, y, w_one_feature)
mse_one_feature

We'll define the DataFrame `hypothesis_functions` solely to keep track of the hypothesis functions we've used so far along with their MSEs. (We'll update this DataFrame for you.)

In [None]:
hypothesis_functions = pd.DataFrame(index=['total_bill'], columns=['MSE'])
hypothesis_functions.loc['total_bill'] = mse_one_feature
hypothesis_functions

### Question 5.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div> 

Let's suppose Billy works for a day as a waiter at the [Gandy Dancer](https://www.gandydancerrestaurant.com/), a fancy restaurant. He waits a table whose total bill is \$350. He decides to use the above linear hypothesis function to predict the tip that he will receive.

1. What tip would the above single-feature model predict for a total bill of \$350? In the cell below, assign the answer to the variable `prediction_for_350`. (Try and use the `@` symbol as part of your answer!)
1. Is this prediction likely to be accurate? If so, in the cell below, assign the variable `is_accurate` to `True`, otherwise, assign it to `False`. Before assigning `is_accurate` to either `True` or `False`, you should think about what makes a prediction about the future likely to be accurate vs. not.

**You should not round any numbers at any point in this question**!

In [None]:
prediction_for_350 = ...
is_accurate = ...

# Don't change the line below.
print(f'The predicted tip for a total bill of $350 is ${round(prediction_for_350, 2)}, and we {"do" if is_accurate else "do not"} think this prediction is likely to be accurate.')

In [None]:
grader.check("q05_02")

### Question 5.3 [Autograded 💻]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Now, let's suppose we want to use `'total_bill'` **and** `'table_size'` to predict `'tip'`.

Below, complete the following tasks:

1. Assign `X_two_features` to the design matrix for this new hypothesis function.
1. Assign `w_two_features` to the optimal parameter vector for this new hypothesis function.
1. Assign `mse_two_features` to the mean squared error of this hypothesis function.
1. Did adding `'table_size'` as a feature make our hypothesis function significantly more accurate as compared to the hypothesis function that used just `'total_bill'`? If so, assign `much_more_accurate` to `True`, otherwise assign it to `False`.

Tasks 1, 2, and 3 should each only take line; remember to use the helper functions we've already defined.

In [None]:
X_two_features = ...
w_two_features = ...
mse_two_features = ...
much_more_accurate = ...

# Don't change the lines below.
print('first five rows of design matrix:\n', X_two_features[:5])
print('optimal parameter vector:', w_two_features)
print('MSE:', mse_two_features)
print('much more accurate:', 'yes' if much_more_accurate else 'no')

In [None]:
grader.check("q05_03")

If you completed Question 5.3 correctly, you should see a 3D scatter plot of the original data points and your hypothesis function below.

In [None]:
XX, YY = np.mgrid[0:60:2, 0:8:2]
Z = w_two_features[0] + w_two_features[1] * XX + w_two_features[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=tips['total_bill'], 
                           y=tips['table_size'], 
                           z=tips['tip'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title='Total Bill',
    yaxis_title='Table Size',
    zaxis_title='Tip'), title='Tip vs. Total Bill')

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill and table_size'] = mse_two_features
hypothesis_functions

### Question 5.4 [Autograded 💻]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Which feature is more important in predicting tip – `'total_bill'` or `'table_size'`?

Assuming you answered Question 5.3 correctly, run the cell below to create a **standardized** design matrix, where the two columns for `'total_bill'` and `'tip'` are standardized to have mean 0 and standard deviation 1.

In [None]:
X_two_features_standardized = X_two_features.copy()
X_two_features_standardized[:, 1:] = (X_two_features[:, 1:] - np.mean(X_two_features[:, 1:], axis=0)) / X_two_features[:, 1:].std(axis=0, ddof=0)
X_two_features_standardized[:5]

Below,

1. Assign `w_two_features_standardized` to an array containing the standardized regression coefficients for our two-feature hypothesis function.
1. Assign `more_important` to either `'total_bill'` or `'table_size'`, depending on which of the two features you think is more important in predicting `'tip'`.

In [None]:
w_two_features_standardized = ...
more_important = ...
w_two_features_standardized, more_important

In [None]:
grader.check("q05_04")

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill and table_size std'] = compute_mse(X_two_features_standardized, y, w_two_features_standardized)
hypothesis_functions

The MSEs of the last two hypothesis functions were the same! The only difference is that when we standardized the features in creating the most recent hypothesis function, we were able to compare the coefficients directly.

Now, let's revisit the scatter plot of `'tip'` vs. `'total bill'`:

In [None]:
fig = px.scatter(tips, x='total_bill', y='tip', title='Tip vs. Total Bill')
fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip')

Let's see if using higher-degree polynomial features yields a better hypothesis function. Specifically, let's try and create a degree 4 polynomial hypothesis function, using the features `'total_bill'`, `'total_bill^2'`, `'total_bill^3'`, and `'total_bill^4'`.

(We'll see this in more detail in Lecture 16; this part is meant you to introduce you to the idea of polynomial features.)

In [None]:
# Making a copy of the tips DataFrame so that we don't modify the original data.
tips_with_poly_features = tips.copy()

In [None]:
# Computing total_bill^2.
tips_with_poly_features['total_bill^2'] = tips_with_poly_features['total_bill'] ** 2
tips_with_poly_features.head()

### Question 5.5 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Below,

1. Add columns `'total_bill^3'` and `'total_bill^4'` to the DataFrame `tips_with_poly_features`.
1. Define `X_poly`, `w_poly`, and `mse_poly` to be the design matrix, optimal parameter vector, and mean squared error of our new 4th degree polynomial hypothesis function. Note that this hypothesis function should be of the form:

    $$H(x_i) = w_0 + w_1 x_i + w_2 x_i^2 + w_3 x_i^3 + w_4 x_i^4$$

    where $x$ is the `'total_bill'`.

Again, this subpart should only take a few minutes.

In [None]:
tips_with_poly_features = ...
X_poly = ...
w_poly = ...
mse_poly = ...

# Don't change the lines below.
print('first five rows of design matrix:\n', X_poly[:5])
print('optimal parameter vector:', w_poly)
print('MSE:', mse_poly)

In [None]:
grader.check("q05_05")

Don't change this cell, just run it.

In [None]:
hypothesis_functions.loc['total_bill 4th degree poly'] = mse_poly
hypothesis_functions

Assuming you completed Question 5.5 correctly, run the following cell to see a visualization of our 4th degree polynomial hypothesis function.

In [None]:
x_range = np.linspace(0, 50)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name='4th Degree Polynomial Hypothesis Function', 
                         line=dict(color='red', width=5)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

The 4th degree polynomial hypothesis function seems to fit the data the best so far, since its MSE is the lowest.

In [None]:
hypothesis_functions

But let's see what happens when we "zoom out" and look at how this hypothesis function behaves.

In [None]:
x_range = np.linspace(-20, 70)

fig = go.Figure()
fig.add_trace(go.Scatter(x=tips['total_bill'], y=tips['tip'], mode='markers', name='actual'))
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_poly[0] + w_poly[1] * (x_range) + w_poly[2] * (x_range**2) + \
                             w_poly[3] * (x_range**3) + w_poly[4] * (x_range**4),
                         name='4th Degree Polynomial Hypothesis Function', 
                         line=dict(color='red', width=5)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

**This is precisely the same behavior we saw in Question 5, when we learned about Lagrange Interpolation!** Indeed, if we kept increasing the degrees of the polynomial features we use, our hypothesis function will look more and more like the interpolated polynomial. As we had you ponder in Question 5, **think** about **why** a hypothesis function with a lower MSE is not necessarily better than a hypothesis function with a higher MSE. You don't need to write your answer anywhere, but discuss it with someone (either a peer or IA/GSI/Professor) before submitting this homework.

We'll explore polynomial regression in more detail in Lecture 16, as mentioned above, but we'll discuss this more general idea – of why we shouldn't make our hypothesis functions overly complex – in Lectures 17 and 18.

### Question 5.6 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Let's again suppose Billy works for a day as a waiter at [The Gandy Dancer](https://www.gandydancerrestaurant.com/). He waits a table whose total bill is \$350. He decides to use the above 4th degree polynomial hypothesis function to predict the tip that he will receive.

What tip would the above polynomial model predict for a total bill of \$350? In the cell below, assign the answer to the variable `poly_prediction_for_350`.

In [None]:
poly_prediction_for_350 = ...

# Don't change the line below.
print(f'The predicted tip for a total bill of $350 is ${round(poly_prediction_for_350, 2)}.')

In [None]:
grader.check("q05_06")

There was another column in our original DataFrame, `tips`, that we haven't yet looked at: `'day'`.

In [None]:
tips.head()

In [None]:
px.bar(tips['day'].value_counts().loc[['Thur', 'Sat', 'Sun']])

Note that unlike `'total_bill'` and `'table_size'`, `'day'` is **categorical**. This means there's no easy way to put it in our design matrix or find the best hypothesis function.

A naïve solution would be to encode `'Thur'` as 1, `'Sat'` as 2, and `'Sun'` as 3, but this would make it seem like Sunday is "more" than Saturday or Thursday in some regard, which it is not – these are all just different days of the week.

A more robust and common solution is called **one hot encoding** (OHE). You will be exposed to it in more detail in Lecture 16, but we want to show you an example of how it works now since it's a natural extension of what we've already covered.

Let's first get it working on a toy example. Let's pretend we have a DataFrame with just 5 rows and 2 columns, `'total_bill'` and `'day'`. Call it `mini_tips`.

In [None]:
mini_tips = pd.DataFrame()
mini_tips['total_bill'] = tips['total_bill'].iloc[:5]
mini_tips['day'] = ['Sat', 'Sun', 'Sun', 'Thur', 'Sat']
mini_tips

When we **one hot encode** a categorical variable, we create a new column for each unique value of that categorical variable. In this case, we'd create three new columns, one each for `'Thur'`, `'Sat'`, and `'Sun'`.

Each of these new columns is binary, meaning they only contain the values 1 and 0. 
- The new column for `'Thur'`, which we'll call `'is_thur'`, will contain a 1 for rows where the value of `'day'` is `'Thur'`, and 0 for all other rows. 
- Similarly, the new column for `'Sun'`, which we'll call `'is_sun'`, will contain a 1 for rows where the value of day is `'Sun'`, and 0 for all other rows.

Again, you'll see more efficient ways to do this in Lecture 16, but here's one way to one hot encode using our understanding of `pandas`.

In [None]:
(mini_tips['day'] == 'Thur')

Repeating this for all columns:

In [None]:
mini_tips['is_thur'] = (mini_tips['day'] == 'Thur').astype(int)
mini_tips['is_sat'] = (mini_tips['day'] == 'Sat').astype(int)
mini_tips['is_sun'] = (mini_tips['day'] == 'Sun').astype(int)

# Dropping the 'day' column. We've encoded it numerically, we don't need it anymore.
mini_tips = mini_tips.drop(columns=['day'])
mini_tips

Now we've converted a categorical feature into three numerical features, so we're good to go!

**There's just one more thing.** Since we're used to fitting linear hypothesis functions with an intercept term, our design matrix generally has a column of all 1s in it. In the case of `mini_tips`, which contains three binary columns, this would look like:

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

This design matrix contains redundant information! Specifically, we can recreate the column of all 1s by adding together the three one-hot encoded columns:

$$X^TX\vec{w} = X^Ty$$

$$\vec{w}^* = (X^TX)^{-1}X^Ty$$

In [None]:
X_not_full_rank = create_design_matrix(mini_tips, list(mini_tips.columns))
X_not_full_rank

In [None]:
# Note that the 0, 1, 2, 3, 4 that you see is the index of this Series, which is irrelevant for our purposes.
mini_tips['is_thur'] + mini_tips['is_sat'] + mini_tips['is_sun']

What this means is that our design matrix $X$ suffers from multicollinearity, and is not **full rank**. There are multiple nasty side effects of this – there is no unique solution for $\vec{w}^*$ and it makes our optimal parameters more difficult to interpret.

Again, we'll address this idea in lectures to come, so don't worry if this is a bit confusing. This is more meant to be a preview of what's to come.

**For now, know this – the way to avoid this problem is to drop one of the one hot encoded columns.** That way, there is no redundant information in the design matrix, and we don't run into any issues. This is not "getting rid" of any information, so it will not impact our predictions – if we know it is not Saturday or Sunday, it must be Thursday.

In [None]:
# We've arbitrarily chosen to drop 'is_thur', but it would make no difference if we instead dropped 'is_sat' or 'is_sun'.
mini_tips = mini_tips.drop(columns=['is_thur'])
mini_tips

In [None]:
create_design_matrix(mini_tips, list(mini_tips.columns))

Now we have a design matrix that is ready to go. Let's replicate this process on our full dataset.

In [None]:
# Run this cell.
tips_ohe = tips.copy()
tips_ohe['is_sat'] = (tips_ohe['day'] == 'Sat').astype(int)
tips_ohe['is_sun'] = (tips_ohe['day'] == 'Sun').astype(int)

# Design matrix with two one-hot encoded columns.
X_ohe = create_design_matrix(tips_ohe, ['total_bill', 'is_sat', 'is_sun'])
print('first five rows of design matrix:\n', X_ohe[:5])

In [None]:
w_ohe = solve_normal_equations(X_ohe, y)
w_ohe

Let's now plot the resulting hypothesis function. We've zoomed into the region where the `'total_bill'`s are less than \\$30 and `'tip'`s are less than \\$4 to make the hypothesis function more clear.

In [None]:
x_range = np.linspace(0, 30)

under_30 = tips[(tips['total_bill'] < 30) & (tips['tip'] < 4)]

fig = go.Figure()
fig.add_trace(go.Scatter(x=under_30['total_bill'], y=under_30['tip'], mode='markers', name='actual'))

# Line for Thursday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[1] * x_range, 
                         name='Thursday', 
                         line=dict(color='blue', width=4)))

# Line for Saturday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[2] + w_ohe[1] * x_range, 
                         name='Saturday', 
                         line=dict(color='orange', width=4)))

# Line for Sunday.
fig.add_trace(go.Scatter(x=x_range, 
                         y=w_ohe[0] + w_ohe[3] + w_ohe[1] * x_range, 
                         name='Sunday', 
                         line=dict(color='red', width=4)))

fig.update_layout(xaxis_title='Total Bill', yaxis_title='Tip', title='Tip vs. Total Bill')

It looks like the hypothesis function is actually three separate lines, each of which have the same slope but different intercepts!

Let's try and understand why this is the case.

In [None]:
w_ohe

Our hypothesis function is of the following form (approximately, since the coefficients are rounded):

$$\text{predicted tip} = 0.925 + 0.105 (\text{total bill}) - 0.072 (\text{is saturday}) + 0.089 (\text{is sunday})$$

### Question 5.7 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Below, assign `intercept_thur`, `intercept_sat`, and `intercept_sun` to the **$y$-intercepts** of the three lines above, corresponding to when the `'day'` is Thursday, Saturday, or Sunday. You should do this using code,  pulling values from `w_ohe`, but you should think conceptually about where each of the three intercepts are coming from.

In [None]:
intercept_thur = ...
intercept_sat = ...
intercept_sun = ...

# Don't change the lines below.
print('Intercept for Thursday:', intercept_thur)
print('Intercept for Saturday:', intercept_sat)
print('Intercept for Sunday:', intercept_sun)

In [None]:
grader.check("q05_07")

Just for completeness, we'll also compute the MSE of this hypothesis function:

In [None]:
mse_ohe = compute_mse(X_ohe, y, w_ohe)
hypothesis_functions.loc['total_bill + OHE day'] = mse_ohe
hypothesis_functions

This new hypothesis function didn't have a much lower MSE than the hypothesis function that used `total_bill` only. That's not all that surprising, since the three lines above look quite similar.

## Finish Line 🏁

Congratulations! You're ready to submit Homework 7.

You need to submit Homework 7 twice:

### To submit the manually graded problems (Questions 1-2; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 7 (Questions 1-2; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Questions 3-5; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under **Homework 7 (Questions 3-5; autograded problems)**.
4. Stick around while the Gradescope autograder grades your work.
5. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 7 submission time will be the **later** of your two individual submissions.