<div class="alert alert-success" markdown="1">

#### Homework 7

# Loss Functions and Linear Algebra

### EECS 398-003: Practical Data Science, Fall 2024

#### Due Thursday, October 24th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 7! In this homework, you'll gain a strong understanding of a key concept in machine learning: loss functions. Along the way, you'll practice working with summation notation, derivatives, limits, and linear algebra, all concepts that are crucial to machine learning. See the [Readings section of the Resources tab on the course website](https://practicaldsc.org/resources/#readings) for supplemental resources.

You are given six slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/fa24/). The [⚙️ Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.
<div class="alert alert-warning" markdown="1">
    
Unlike other homeworks, you **are not** going to submit this notebook! Instead, you will write **all** of your answers to the questions in this homework in a separate PDF. You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in. 

**Make sure to show your work for all questions! Answers without work shown may not receive full credit.**
</div>
    
This homework is worth a total of **56 points**, all of which come from **manual grading**. The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below. There's no need to `import otter` at the top, since there are no autograder tests. But, you will still run some code.

In [None]:
import pandas as pd
import numpy as np

import plotly
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"
pd.options.plotting.backend = 'plotly'

import warnings
warnings.simplefilter('ignore')

## Question 1: Imputation Returns 📏

---

Earlier in the semester, we learned about mean imputation, a technique for handling missing values in a dataset. Specifically, with mean imputation, we fill in all of the missing values in a column with the mean of the observed values in that column. (Here, we'll just consider _unconditional_ mean imputation.)

One of the observations we made in [Lecture 8](https://practicaldsc.org/resources/lectures/lec08/lec08-filled.html#Idea:-Mean-imputation) is that when performing mean imputation:

- the mean of the imputed column is **the same as** the mean of the observed values, pre-imputation, and
- the standard deviation of the imputed column is **less than** the standard deviation of the observed values, pre-imputation.

For clarity, let's illustrate that fact once more. Run the cell below to load in the same `heights` DataFrame we used in Lecture 8.

In [None]:
heights = pd.read_csv('data/heights-missing-2.csv')
heights.head()

The `'child'` column in `heights` has many missing values.

In [None]:
# 169 missing values, 765 present values.
original_heights = heights['child']
original_heights.isna().value_counts()

When dropping all missing values, the mean and standard deviation of `heights` is below:

In [None]:
original_heights = heights['child']
original_heights.mean()

In [None]:
# We'll talk about what ddof=0 does later in the question.
# For now, just interpret this result as the standard deviation.
original_heights.std(ddof=0)

After filling in missing `'child'` heights with the mean of the observed heights, we have:

In [None]:
imputed_heights = original_heights.fillna(original_heights.mean())

# The same as original_heights.mean()!
imputed_heights.mean()

In [None]:
# Smaller than original_heights.std(ddof=0)!
imputed_heights.std(ddof=0)

Again, we see that the mean pre-imputation and post-imputation is the same, but the standard deviations are different. But is the mean always guaranteed to be the same after performing mean imputation? And what is the relationship between the standard deviation pre-imputation, $\approx 3.52047$, and the standard deviation post-imputation, $\approx 3.18609$?

In this question, we will mathematically **prove** the relationships we're seeing above in more generality, to better understand the properties of mean imputation. This will also give us practice with manipulating equations involving summations and squares, which we'll need to do a lot to understand the machinery behind machine learning.

<!-- BEGIN QUESTION -->

### Question 1.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Consider a column of $n$ numbers, $y_1, y_2, ..., y_n$ with mean $M$ and standard deviation $S$, where the standard deviation is defined as follows:

$$S = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - M)^2}$$

Suppose we introduce $k$ new values to the dataset, $y_{n+1}, y_{n+2}, ... , y_{n+k}$, all of which are equal to $M$. (This is like mean imputation, if the full dataset had $n + k$ values, with $n$ observed/not missing and $k$ missing, and we imputed the $k$ missing values with the mean of the observed, $M$.)

Let the new mean and standard deviation of all $n+k$ values be $M'$ and $S'$, respectively.

**Prove that $M' = M$.**

Some guidance: It's not sufficient to provide a verbal argument. Instead, start with the definition of $M' = \frac{1}{n+k} \sum_{i = 1}^{n + k} y_i$, and manipulate the sum to show that it's equal to $M$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Using the same definitions as in Question 1.1, find $S'$ in terms of $M$, $n$, $k$, and $S$. Show your work.

Some guidance: 
- You may not need to use all of these variables in your answer.
- To verify your answer is correct, plug in $M = 67.10340$, $n = 765$, $k = 169$, and $S = 3.52047$. The answer you get should be approximately $3.18609$, as we saw at the start of the question.

<!-- END QUESTION -->

## Question 2: Relative Squared Loss 🧑‍🧑‍🧒‍🧒

---

In [Lecture 14](https://practicaldsc.org/resources/lectures/lec14/lec14-filled.pdf), we introduced the "modeling recipe" for making predictions:

1. Choose a model.
1. Choose a loss function.
1. Minimize average loss to find optimal model parameters.

The first instance of this recipe saw us choose:
1. The constant model, $H(x) = h$.
1. The squared loss function: $L_\text{sq}(y_i, h) = (y_i - h)^2$.
1. The average squared loss across our entire dataset, then, was:
$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2$$
which, using calculus, we showed is minimized when: $$h^* = \text{Mean}(y_1, y_2, ..., y_n)$$
This means that using the squared loss function, the **best** constant prediction is $h^* = \text{Mean}(y_1, y_2, ..., y_n)$.

In this question, you will find the best constant prediction when using a different loss function. In particular, here, we'll explore the **relative squared loss** function, $L_{\text{rsq}}(y_i, h)$:

$$L_{\text{rsq}}(y_i, h) = \frac{(y_i - h)^2}{y_i}$$

Throughout this question, assume that each of $y_1, y_2, ..., y_n$ is positive.

<!-- BEGIN QUESTION -->

### Question 2.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Determine $\frac{d}{d h} L_{\text{rsq}}(h)$, the derivative of the relative squared loss function with respect to $h$.

(Technically, this is a **partial** derivative, since there are other variables in the definition of $L_\text{rsq}(h)$.)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

What value of $h$ minimizes average loss when using the relative squared loss function – that is, what is $h^*$? Your answer should only be in terms of the variables $n, y_1, y_2, ..., y_n$, and any constants.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Let $C(y_1, y_2, ..., y_n)$ be your minimizer $h^*$ from Question 2.2. That is, for a particular dataset $y_1, y_2, ..., y_n$, $C(y_1, y_2, ..., y_n)$ is the value of $h$ that minimizes empirical risk for relative squared loss on that dataset.

What is the value of $\displaystyle\lim_{y_4 \rightarrow \infty} C(1, 3, 5, y_4)$ in terms of $C(1, 3, 5)$? Your answer should involve the function $C$ and/or one or more constants.

Some guidance: To notice the pattern, evaluate $C(1, 3, 5, 100)$, $C(1, 3, 5, 10000)$, and $C(1, 3, 5, 1000000)$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

What is the value of $\displaystyle\lim_{y_4 \rightarrow 0} C(1, 3, 5, y_4)$? Again, your answer should involve the function $C$ and/or one or more constants.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.5 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Based on the results of Questions 2.3 and 2.4, when is the prediction $C(y_1, y_2, ..., y_n)$ robust to outliers? When is it not robust to outliers?

<!-- END QUESTION -->

## Question 3: Bye, Calculus 👋

---

As we discussed in the previous question, in Lecture 2, we found that $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean squared error:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i-h)^2$$

To arrive at this result, we used calculus: we took the derivative of $R_\text{sq}(h)$ with respect to $h$, set it equal to 0, and solved for the resulting value of $h$, which we called $h^*$.

In this question, we will minimize $R_\text{sq}(h)$ in a way that **doesn't** use calculus. The general idea is this: if $f(x) = (x - c)^2 + k$, then we know that $f$ is a quadratic function that opens upwards with a vertex at $(c, k)$, meaning that $x = c$ minimizes $f$. As we saw in class (see [Lecture 14, Slide 35](https://practicaldsc.org/resources/lectures/lec14/lec14-filled.pdf#page=35)), $R_\text{sq}(h)$ is a quadratic function of $h$!

Throughout this problem, let $y_1, y_2, ..., y_n$ be an arbitrary dataset, and let $\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i$ be the mean of the $y$'s.


<!-- BEGIN QUESTION -->

### Question 3.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

What is the value of $\sum_{i = 1}^n (y_i - \bar{y})$? Show your work.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Show that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n \left( (y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - h) + (\bar{y} - h)^2 \right)$$

Some guidance:
- To proceed, start by rewriting $y_i - h$ in the definition of $R_\text{sq}(h)$ as $(y_i - \bar{y}) + (\bar{y} - h)$. Why is this a valid step?
- Make sure not to expand unnecessarily. Your work should only take ~3 lines.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Show that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2 + (\bar{y} - h)^2$$

This is called the **bias-variance decomposition** of $R_\text{sq}(h)$, which is an idea we'll revisit in the coming weeks.

Some guidance: At some point, you will need to use your result from Question 3.1.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Why does the result in Question 3.3 prove that $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ minimizes $R_\text{sq}(h)$?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.5 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

In Question 3.3, you showed that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2 + (\bar{y} - h)^2$$

Take a close look at the equation above, then fill in the blank below with **a single word**:

> The value of $R_\text{sq}(h^*)$, when $h^* = \text{Mean}(y_1, y_2, ..., y_n)$, is equal to the ____ of the data.

<!-- END QUESTION -->

## Question 4: Probability 🤝 Statistics

---

In Lecture 14, we discussed the relationship between probability and statistics:
- In probability questions, we're given some model of how the universe works, and it's our job to determine how various samples could turn out.<br><small>Example: If we have 5 blue marbles and 3 green marbles and pick 2 at random, what are the chances we see one marble of each?</small>
- In statistics questions, we're given information about a sample, and it's our job to figure out how the universe – or **data generating process** works.<br><small>Example: Repeatedly, I picked 2 marbles at random from a bag with replacement. I don't know what's inside the bag. One time, I saw 2 blue marbles, then next time I saw 1 of each, the next time I saw 2 red marbles, and so on. What marbles are inside the bag?</small>

In this question, we'll gain a deeper understanding of this relationship, through the lens of your probability knowledge from EECS 203. To do so, we'll introduce you to a key idea in machine learning and statistics, called **maximum likelihood estimation**.

<div class="alert alert-success">

<h4>Click <a href="https://practicaldsc.org/mle"><b>here</b></a> to read a lecture note (written by us) that introduces you to maximum likelihood estimation!</h4>You <i>can</i> attempt the question without reading the note, but it'll be significantly more difficult.

</div>

<!-- BEGIN QUESTION -->

### Question 4.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

When you step on campus, each person you see has a $0.1$ chance of saying "Go Blue!" to you, independent of all other people.

Tomorrow, what's the probability that the first person to say "Go Blue!" to you is the **6th** person you see?

Leave your answer in unsimplified form. This question should not take very long; think back to the probability distributions you learned in EECS 203 (other than the binomial distribution).

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Again, assume that the probability that each person you see has a $0.1$ chance of saying "Go Blue!" to you, independent of all other people. 

What's the probability that:
- the first person to say "Go Blue!" to you tomorrow is the **6th** person you see, **and**
- the first person to say "Go Blue!" to you the day after tomorrow is the **10th** person you see, **and**
- the first person to say "Go Blue!" to you the day after that is the **2nd** person you see?

Again, leave your answer in unsimplified form. Note that we're asking for a single probability, not three separate probabilities.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Now, suppose that the probability that each person you see says "Go Blue!" to you is some **unknown parameter**, $\pi$. (That is, $0.1$ will not appear in the rest of this question.)

Suppose you go to campus on $n$ straight days, and you collect a dataset $x_1, x_2, ..., x_n$, where:
- On Day 1, the first person to say "Go Blue!" to you is the $x_1$th person you saw (or "person $x_1$"), **and**
- On Day 2, the first person to say "Go Blue!" to you is person $x_2$, **and**,
- On Day 3, the first person to say "Go Blue!" to you is person $x_3$, **and** so on.
- In general, for $i = 1, 2, ..., n$, on Day $i$, the first person to say "Go Blue!" to you is person $x_i$.

For example, the dataset $x_1 = 5, x_2 = 10, x_3 = 2$ would mean that on Day 1, person 5 was the first to say "Go Blue!"; on Day 2, person 10 was the first to say "Go Blue!"; and on Day 3, person 2 was the first to say "Go Blue!".

**Prove** that $\log L(\pi)$, the log of the likelihood function for $\pi$, is:

$$\log L(\pi) = \log(1 - \pi) \sum_{i = 1}^n (x_i - 1) + n \log \pi$$

Some guidance: Try and generalize the calculation you made in Question 4.2.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Using the result to Question 4.3, find $\pi^*$, the maximum likelihood estimate of $\pi$ given the dataset $x_1, x_2, ..., x_n$. Once you've done that, give a brief English explanation of why the value of $\pi^*$ makes intuitive sense.

<!-- END QUESTION -->

## Question 5: More and More Losses 🅻

---

As we mentioned in Questions 2 and 3, $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean squared error, i.e. average squared loss:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i-h)^2$$

Related, $h^* = \text{Median}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean absolute error, i.e. average absolute loss:

$$R_{\text{abs}}(h) = \frac{1}{n} \sum_{i=1}^n \left|y_i - h\right|$$

You may notice that the formulas for $R_\text{sq}(h)$ and $R_\text{abs}(h)$ look awfully similar – they're nearly identical besides the exponent. More generally, for any positive integer $p$, define the $L_p$ loss as follows:

$$L_p(y_i, h) = |y_i - h|^p$$

With this definition, $L_2$ loss is the same as squared loss and $L_1$ loss is the same as absolute loss. The corresponding average loss, for any value of $p$, is then:

$$ R_{p}(h) = \frac{1}{n} \sum_{i=1}^n \left|y_i - h\right| ^ p $$

Written in terms of $R_p(h)$, we know – from the top of this question – that:

- The minimizer of $R_1(h)$ is $\text{Median}(y_1, y_2, ..., y_n)$:

$$\text{Median}(y_1, y_2, ..., y_n) = \underset{h}{\mathrm{argmin}} \: R_1(h)$$

- The minimizer of $R_2(h)$ is $\text{Mean}(y_1, y_2, ..., y_n)$:

$$\text{Mean}(y_1, y_2, ..., y_n) = \underset{h}{\mathrm{argmin}} \: R_2(h)$$

But what constant prediction $h^*$ minimizes $R_3(h)$, or $R_{10}(h)$, or $R_{10000}(h)$? In this question, we'll explore this idea – more specifically, we'll study how $h^*$ changes as $p$ (the exponent on $|y_i - h|$) increases.

[Lecture 14](https://practicaldsc.org/resources/lectures/lec14/lec14-filled.pdf#page=38) worked through how to solve for constant prediction $h^*$ that minimized average squared loss (i.e. minimized $R_2(h)$), and we linked to a [video](https://youtu.be/0s7M8OsnBNA?si=lHm6eN3rns7PzPOW) that works through a similar derivation for average absolute loss (i.e. $R_1(h)$). Unfortunately, $p = 1$ and $p = 2$ are the only cases in which we can solve for the minimizer to $R_p(h)$ by hand. 

For all other values of $p$, there is no closed-form solution (i.e. no "formula" for the best constant prediction), and so we need to approximate the solution using the computer. Later in the class, we'll learn how to minimize functions using code we write ourselves (the idea is called gradient descent if you're curious), but for now, we're going to use `scipy.optimize.minimize`, which does the hard work for us.

The `minimize` function is a versatile tool from the `scipy` library that can help us find the input that minimizes the output of a function. Let's test it out.

In [None]:
from scipy.optimize import minimize

Below, we've defined and plotted a quadratic function. We can see 👀 that it's minimized when $x = -4$.

In [None]:
def f(x):
    return (x + 4) ** 2 - 1

In [None]:
xs = np.linspace(-20, 20)
ys = f(xs)
px.line(x=xs, y=ys)

But Python doesn't have eyes, so it can't see that the graph is minimized at $x = -4$. `minimize`, though, magically **can** do this minimization!

In [None]:
# To call minimize, we have to provide an array of initial "guesses"
# as to where the minimizing input might be.
# For our purposes, using 0 as an initial guess will work fine.
minimize(f, x0=[0])

Above, the `x` attribute of the output tells us that the minimizing input to `f` is `-4.0000`, which is what we were able to see ourselves! Cool.

In this question, we'll deal with the following example array of values, `vals`:

In [None]:
vals = np.array([1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 5, 10, 10, 39])

For context, let's see what the distribution of `vals` looks like:

In [None]:
pd.Series(vals).hist(nbins=40)

To reiterate, the constant prediction $h^*$ that minimizes $R_1(h)$ for `vals` is:

In [None]:
np.median(vals)

And the constant prediction $h^*$ that minimizes $R_2(h)$ for `vals` is:

In [None]:
np.mean(vals)

### Question 5.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">0 Points</div>

Complete the implementation of the function `h_star`, which takes in a positive integer `p` and an array `vals` and returns the value of the constant prediction $h^*$ that minimizes average $L_p$ loss for `vals`, i.e. the value of $h^*$ that minimizes $R_p(h)$ for `vals`. Example behavior is given below.

```python
>>> h_star(1, vals)
2.0

>>> h_star(2, vals)
5.714285345730987
```

Some guidance:
- Your solution should use `minimize`, and will likely involve defining a helper function inside.
- It's okay if your example values are slightly different than those above, but they should be roughly the same. (So, it's fine if `h_star(1, vals)` gives you `1.9999999920558864` or something similar.)

<div class="alert alert-warning">

**We're not autograding Question 5.1, and it's not worth any points.** But, you need to do it in order to answer Question 5.2, which is worth points (and which you will answer on paper!).
    
</div>

In [None]:
def h_star(deg, vals):
    ...

# Feel free to change this input to make sure your function works correctly.
h_star(2, vals)

Before proceeding, make sure that the following cells both say `True`, otherwise you did something incorrectly:

In [None]:
np.isclose(h_star(1, vals), np.median(vals))

In [None]:
np.isclose(h_star(2, vals), np.mean(vals))

Once you have a working implementation of `h_star`, run the cell below.

In [None]:
ps = np.arange(1, 91)
hs = [h_star(p, vals) for p in ps]
px.line(x=ps, y=hs).update_layout(xaxis_title=r'$p$', yaxis_title=r'$h^* = \text{minimizer of } R_p(h)$')

It seems like as $p$ increases, the value of $h^*$ that minimizes $R_p(h)$ approaches some fixed value. But what is that value? For context, look at `vals` again:

In [None]:
vals

<!-- BEGIN QUESTION -->

### Question 5.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Use the plot above to answer the following prompts:

1. In the `vals` dataset, as $p$ increases, what does the value of $h^*$ that minimizes $R_p(h)$ approach?
1. In any general dataset of values $y_1, y_2, ..., y_n$, as $p$ increases, what does the value of $h^*$ that minimizes $R_p(h)$ approach? Why?

Put another way, we're asking you to evaluate the following limit, but using your plot, not calculus (you're welcome 😊):

$$\lim_{p \rightarrow \infty} \left( \underset{h}{\mathrm{argmin}} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p \right)$$

Some guidance:
- To answer the second prompt, try calling `h_star` with different arrays that you create. Try and see if you can find a pattern in the values that `h_star` returns when `p` is very large.
- If you experiment in the way we're suggesting above, you may run into _overflow_ errors, where the numbers you're dealing with are too big for Python to compute. (e.g., something like $|100 - 90|^{2000}$ is far too big to be represented).

<!-- END QUESTION -->

## Question 6: Algebra, Too 📐

---

In the coming lectures, we'll start formulating the problem of making predictions about future data given past data in terms of matrices and vectors. Why? The answer is simple: doing so will allow us to build models that use multiple input variables (i.e. features) in order to make predictions.

This question serves to review the key linear algebra knowledge you'll need to be familiar with as we start using matrices and vectors in lecture. If any of this feels foreign – and it's totally fine if it does! – review [LARDS: Linear Algebra Review for Data Science](https://practicaldsc.org/lin-alg/). We'll link to specific sections in LARDS for each part of this question.

Throughout this question, consider the following vectors in $\mathbb{R}^3$, where $\beta \in \mathbb{R}$ is a scalar:

$$
\vec{v}_1 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}, \quad 
\vec{v}_2 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}, \quad 
\vec{v}_3 = \begin{bmatrix} \beta \\ 1 \\ 2 \end{bmatrix}
$$

<!-- BEGIN QUESTION -->

### Question 6.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

For what value(s) of $\beta$ are $\vec{v}_1, \vec{v}_2,$ and $\vec{v}_3$ linearly **in**dependent?

<small><small>📕 To review, read LARDS [Section 5](https://practicaldsc.org/lin-alg/#linear-independence).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

For what value(s) of $\beta$ are $\vec{v}_1$ and $\vec{v}_3$ orthogonal?

<small><small>📕 To review, watch LARDS [Section 2](https://practicaldsc.org/lin-alg/#the-dot-product-angles-and-orthogonality).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

For what value(s) of $\beta$ are $\vec{v}_2$ and $\vec{v}_3$ orthogonal?

<small><small>📕 To review, watch LARDS [Section 2](https://practicaldsc.org/lin-alg/#the-dot-product-angles-and-orthogonality).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Regardless of your answers to the previous three parts, in this part, let $\beta = 3$.

Is the vector $\begin{bmatrix}
3 \\
5 \\
8
\end{bmatrix}$ in $\text{span}(\vec{v}_1, \vec{v}_2, \vec{v}_3)$? Why or why not?

<small><small>📕 To review, watch LARDS [Section 4](https://practicaldsc.org/lin-alg/#linear-combinations-and-span) and read [Section 5](https://practicaldsc.org/lin-alg/#linear-independence).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.5 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

What is the projection of the vector $\begin{bmatrix}
3 \\
15 \\
21
\end{bmatrix}$ onto $\vec{v}_1$?  Give your answer in the form of a vector.

<small><small>📕 To review, watch LARDS [Section 6](https://practicaldsc.org/lin-alg/#projecting-onto-a-single-vector).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.6 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

What is the orthogonal projection of the vector $\begin{bmatrix}
3 \\
15 \\
21
\end{bmatrix}$ 
onto $\text{span}(\vec{v}_1, \vec{v}_2)$?

The answer is a vector, $\vec{z}$, which can be written in the form:

$$\vec{z} = \lambda_1 \vec v_1 + \lambda_2 \vec v_2$$

**Your job** is to find the values of scalars $\lambda_1$ and $\lambda_2$, and then, the vector $\vec z$. As done in LARDS [Section 8](https://practicaldsc.org/lin-alg/#projecting-onto-the-span-of-multiple-vectors-again), one of the intermediate steps in answering this question involves defining a particular matrix $X$ and computing $(X^T X) ^{-1}X^T$.

<small><small>📕 To review, watch LARDS [Sections 6-8](https://practicaldsc.org/lin-alg/#projecting-onto-the-span-of-multiple-vectors-again).</small></small>

<!-- END QUESTION -->

## Finish Line 🏁

Congratulations! You're ready to submit Homework 7.

Remember, you'll submit Homework 7 as a **PDF** to the **Homework 7 (PDF)** assignment on Gradescope. You won't submit this notebook anywhere; Homework 7 will be entirely manually graded.