In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw06.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 6

# GPTEECS and Loss Functions

### EECS 398: Practical Data Science, Winter 2025

#### Due Tuesday, March 11th at 11:59PM (due after Spring Break!)
    
</div>

## Instructions

Welcome to Homework 6! In the first part of the homework (Question 1), you will apply your knowledge of cosine similarity and TF-IDF to implement a supercharged ChatGPT-like bot. Throughout the rest of the homework, you will develop your understanding of the theoretical foundations of machine learning – specifically, loss functions and simple linear regression – which will enable us to build useful, practical models in the latter half of the semester.

You are given 8 slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

<div class="alert alert-info" markdown="1">

### No Hidden Tests; Late Deadline; Disclaimer

Unlike other homeworks, Homework 6 <b>has no hidden tests</b>, because of its proximity to the Midterm Exam. This means the tests you see in your notebook are the exact same as the ones that will be used to grade your work on Gradescope. When you submit on Gradescope, you'll see your score shortly after you submit, once the autograder finishes running.
<br><br>
<b>Even though Homework 6 is due after the Midterm Exam, you should work on it before, since most of the homework is in scope for the exam!</b> In particular, we recommend working on Questions 2-5 before the exam, because they provide core practice with the constant model and simple linear regression, both of which will appear on the exam. 
- Question 6 involves linear algebra review, which will be important for the content we cover after the exam, but not relevant before (except for in how it relates to cosine similarity and dot products, which we covered in Lecture 10).
- Question 1 is more "applied", and while cosine similarity, bag of words, and TF-IDF will appear on the exam, the best way to practice with those ideas is by working on relevant old exam problems at the <a href="https://study.practicaldsc.org"><b>Study Site</b></a>.
</div>


<div class="alert alert-warning">

### Submission
    
This homework features a mix of autograded programming questions and manually-graded questions.
  
- Question 1 is **fully autograded**, and each part will say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 6 (Question 1; autograded problems)** assignment on Gradescope to have your code graded by the autograder.
    
- Questions 2-6 are **manually graded**, and say **[Written ✏️]** in their titles. For this question, **do not write your answers in this notebook**! Instead, write **all** of your answers in a separate PDF. Submit this separate PDF to the **Homework 6 (Questions 2-6; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**! Make sure to show your work for all written questions, as answers without work shown may not receive full credit.

    
Your Homework 6 submission time will be the **later** of your two individual submissions. Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll take your **most recent** submission. 
</div>
</div>

This homework is worth a total of **76 points**, 19 of which come from the autograder (Question 1), and **58 of which are manually graded by us (Questions 2-6)**. The number of points each question is worth is listed at the start of each question. **The 6 questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`.

In [None]:
import pandas as pd
import numpy as np
import os
import re
import groq
from IPython.display import Markdown

import plotly
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"
pd.options.plotting.backend = 'plotly'

import warnings
warnings.simplefilter('ignore')

## Question 1: GPTEECS 🤖

---

### Overview

Large Language Models (LLM), like GPT-4 by OpenAI, Claude by Anthropic, Llama by Meta, or DeepSeek-R1, are statistical models that were trained on massive datasets for the purpose of generating useful new text. [ChatGPT](https://chat.openai.com) and other similar chat interfaces make calls to an LLM API under-the-hood, and show you the results in a text message-like format.

Open ChatGPT or your favorite other LLM chat interface, and ask it:

> What are some courses related to machine learning I can take at Michigan, other than EECS 445?

Until very recently (when ChatGPT started being able to search the internet), ChatGPT would either tell you that it doesn't know what EECS 445 is, or make up an answer. Even if it did give you an answer, it's not necessarily clear whether it pulled the answer from a reliable source, or whether it's still true today (it may have found syllabi or course descriptions online several years ago, and could be hallucinating). 

### Retrieval-Augmented Generation (RAG)

A solution to this issue is **Retrieval-Augmented Generation (RAG)**. **In this question, we will use RAG to implement GPTEECS, a chat interface designed to answer questions about courses at the University of Michigan (UM).** Here's the general idea behind RAG, and how we'll use it in this question:

1. We want to implement a chat bot that can answer questions about something specific.<br><small>**Here**, we want our chat bot to answer questions classes at UM.</small>
1. To do so, we download and store documents that contain the relevant context that we wish our LLM knew about.<br><small>**Here**, we'll download the course descriptions of various classes at UM. We've already done this for you.</small>
1. Then, when the user asks a question – called a **query** – we determine which of our locally-stored documents (course descriptions) are most relevant in answering their question.<br><small>**Here**, when a user asks a question about UM classes, we'll determine the course description(s) that are most relevant.</small>
1. Once we find the most relevant documents, we send the user's query, **along with** the most relevant documents, to our language model, allowing it to find the answer for us with the context it needs.

<center><img src="imgs/retrieval-augmented-generation.png" width=700><br>(<a href="https://towhee.io/tasks/detail/pipeline/retrieval-augmented-generation">image source</a>)</center>

RAG enables organizations to create customized chat interfaces that are better equipped to answer questions about the organization than an out-of-the-box language model. For instance, if you operated a store and wanted an AI-powered customer support chat, you may use RAG to create a chat bot that knows about your store's catalog, return policies, etc. ChatGPT even allows you to make custom GPTs [yourself](https://openai.com/index/introducing-gpts/) by uploading customized knowledge bases, and these (likely) use a process similar to RAG.

### FAQs

- **How do we determine which documents are most relevant to the user's query?** Here, we'll implement this using TF-IDF and cosine similarity, as we've seen in [Lecture 10](https://practicaldsc.org/resources/lectures/lec10/lec10-filled.html)! In practice, more sophisticated, state-of-the-art techniques for converting text to numbers are used (if you're curious, look into "word embeddings").
- **Why not just send all of the documents to our language model, instead of finding the documents that are most relevant?** LLMs have a [context window](https://www.hopsworks.ai/dictionary/context-window-for-llms), which is a limit on the length of the input query they can take in. If your query is too long, an LLM may not be able to process it. (And, if it includes unnecessary information, it can be hard for the LLM to give you an accurate response.)

### Your Task
The file `data/courses.txt` contains course descriptions for every class offered at UM. Think of each description as its own document, even though all descriptions are technically just in one `.txt` file. All of these descriptions together comprise our **corpus**.

Shortly, using the ideas from Lecture 10, you will develop a working implementation of the following function:

```python
>>> top_n_similar_documents('probability theory and randomness', 4, bow)
['STATS 425', 'MATH 425', 'STATS 412', 'ECE 501']

```

And even cooler, you'll implement a function that can fully answer questions, like:

```python
>>> ask_gpteecs("What are some courses related to machine learning I can take at Michigan, other than EECS 445?")
```

And get back answers like:

> Here are 5 relevant courses to Machine Learning that you can take at the University of Michigan:
> - ECE 559 Optimization Methods in Signal Processing and Machine Learning: This course covers optimization methods for signal and image processing and machine learning problems.
> - EECS 553 Machine Learning (ECE): This course covers the fundamentals of supervised, unsupervised, and sequential learning, including linear and nonlinear regression, logistic regression, and neural networks.
> - ASTRO 416 Data Science for Astrophysicists: Although targeted towards astrophysics, this course covers essential skills in machine learning, including unsupervised and supervised learning, regularization, and neural networks.
> - EECS 492 Introduction to Artificial Intelligence: This course introduces the core concepts of AI, including search, logic, knowledge representation, reasoning, and decision making under uncertainty.
> - EECS 481 Software Engineering: Although not exclusively focused on machine learning, this course deals with software engineering principles, including development of large, complex software systems, and covers software development techniques that are applicable to machine learning projects.

### Question 1.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

First, let's figure out how to call a Large Language Model directly from our notebook. Another term for "calling" an API is "querying" the API – confusing, we know, since for us "querying" means "selecting a subset of rows in a DataFrame", but this is an industry-standard term.

OpenAI does have a Python API, but it's relatively limited on the free plan. Instead, we'll use tools from [Groq](https://groq.com/). Groq is a hardware company designing processors for training LLMs efficiently, and allows for fast, free access to open-source LLM APIs. We'll use the [Groq API](https://console.groq.com/docs/quickstart) to query Meta (formerly, Facebook)'s Llama 3 model.

Go [**here**](https://console.groq.com/docs/quickstart) and create a Groq API key. Then, complete the implementation of `query_llm`, a function that takes in a string (`query_string`) and returns the text response that results from passing `query_string` to Groq. The function has largely been implemented for you; most of what you need to do is create an API key and put it in the right place below.

<div class="alert alert-info">
    
APIs will limit the number of queries you can make per minute, and the number of tokens (words) you send per minute. We call these limits "rate limits". Each LLM also has a maximum "context window" size, which is the length (in words) of the largest query you can make. If you run into rate limit or context window issues while working on this homework, try switching to a new model. You can find the list of models and their IDs [**here**](https://console.groq.com/docs/models).
- We recommend starting with the `'llama-3.1-8b-instant'` as it offers very high usage limits, but you can test out different models if you'd like.
- If you want, you can use a _reasoning_ model, like the brand-new `'deepseek-r1-distill-qwen-32b'` model. Reasoning models "think" before they say a final answer, sometimes improving the quality of the final response. But, our queries are relatively simple, and the reasoning steps make the responses much longer, so we've chosen to _not_ use a reasoning model by default (`'llama-3.1-8b-instant'` is _not_ a reasoning model).

In [None]:
def query_llm(query_string):
    client = groq.Groq(
        api_key= ...
    )
    
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": query_string
            }
        ],
        model="llama-3.1-8b-instant",
        # temperature=0 # Try uncommenting this and running the call to query_llm below many times. What do you notice? Recomment it out afterwards.
    )

    return chat_completion.choices[0].message.content

# Feel free to change the input below to test out your implementation of query_llm.
# The Markdown function behaves like the print function,
# but renders text formatting (e.g. bolding, bullet points) when the output from Deepseek
# contains these elements.
Markdown(query_llm('Tell me a joke about data science'))

In [None]:
grader.check("q01_01")

Now, we can call `query_llm`! Run the cell below.

In [None]:
Markdown(query_llm('''
What are some courses related to machine learning I can take at Michigan, other than EECS 445? 
Keep it concise: just one paragraph.'''))

To experiment:
- Run the cell many times. You'll notice that the response is very different every time; most of the course numbers it tells you about are made up. Google them!
- Uncomment the line that says `temperature=0` in your definition of `query_llm`, and then run the above cell many times again. What do you notice now? (To see what argument is doing, go to the [documentation](https://console.groq.com/docs/api-reference#chat-create) and search for "temperature".) Recomment out the line before proceeding.
- If you remove "Keep it concise: just one paragraph.", what do you notice?

Now we have a way of passing queries to a LLM and getting back results. Right now, it's not knowledgeable enough to answer questions about specific UM classes. Soon, we'll change that.

We'll get back to using `query_llm` in the final part of this question. For now, we need to switch our attention to implementing RAG – that is, being able to find the course descriptions that are most related to our input query. Once we implement it, when we pass our (new) function the input `'What are some courses related to machine learning I can take at Michigan, other than EECS 445? Keep it concise: just one paragraph.'`, it'll provide accurate, up-to-date information about real courses, because we'll send real course descriptions to the LLM along with the original input. **Keep this goal in mind. The next few parts may seem unrelated, but they all come together at the end!**

### Question 1.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

A **token** is an alphanumeric string. In Lecture 10, we referred to tokens as "terms". Before computing any numbers, we need to find the terms in each course description, i.e. we need to **tokenize** each course description.

Complete the implementation of the function `tokenize`, which takes in a string (`string`) of text and returns a list containing all of the tokens in `string`. Convert all characters to lowercase before extracting tokens.

Example behavior is given below.

```python
>>> tokenize("EECS 398 Practical Data Science's about data management and applied machine learning.")
['eecs',
 '398',
 'practical',
 'data',
 'science',
 's',
 'about',
 'data',
 'management',
 'and',
 'applied',
 'machine',
 'learning']
>>> tokenize('This course is an introduction to the modern. qualitative. theory... of ordinary differential equations')
['this',
 'course',
 'is',
 'an',
 'introduction',
 'to',
 'the',
 'modern',
 'qualitative',
 'theory',
 'of',
 'ordinary',
 'differential',
 'equations']
```

Note that this part is only worth 1 point, so it shouldn't take very long!

In [None]:
def tokenize(string):
    ...

# Feel free to change the input below to test out your implementation of tokenize.
tokenize("EECS 398 Practical Data Science's about data management and applied machine learning.")

In [None]:
grader.check("q01_02")

Before we move onto Question 1.3, it's worth mentioning that in practice, we'd put a bit more care into tokenizing our documents. For one, we might **lemmatize** our tokens, which would allow us to group words like `'eating'`, `'ate'`, and `'eatery'` all to `'eat'`. We've omitted such steps here for simplicity.

Where do the course descriptions come from? They're all already stored in `courses.txt`, which we've scraped for you. Run the cell below to extract course information out of `courses.txt` and store it in the DataFrame `courses_df`.

In [None]:
# Run this cell, and don't change it!
def extract_course_data(path):
    # path = 'data/courses.txt'
    with open(path, 'r', encoding='utf-8') as file:
        content = file.read()
    
    course_sections = content.split("# COURSE")[1:]  # Ignore first empty split.
    
    course_codes = []
    course_titles = []
    descriptions = []
    for section in course_sections:
        lines = section.strip().split("\n")
        course_code = lines[0].split('-')[0].strip()
        course_title = lines[0].split('-')[1].strip()
        description = " ".join(lines[1:]) 
        # Some courses have an empty description, so just fill it with placeholder to take care of possible NaNs later on.
        if not description:
            description = 'No course description'
            
        description = course_code + ' ' + course_title + ' ' + description
        course_codes.append(course_code)
        descriptions.append(description)
    
    courses = pd.DataFrame({
        'Course' : course_codes,
        'Description' : descriptions
    })
    
    courses = courses.set_index('Course')
    return courses

courses_df = extract_course_data('data/courses.txt')
courses_df

Throughout the rest of this homework, we'll need `courses_df` to be defined as it is above. In the next part, we'll compute the frequencies of every token in every course description. **For the rest of this question, by "course description", we're referring to a value from the `'Description'` column of `courses_df`, which technically also includes the course number and title.**

### Question 1.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `courses_to_bow`, which takes in the DataFrame `courses_df` and returns the corresponding **bag of words matrix** as a DataFrame, with:
- One row per course, indexed by the course code, just like in `courses_df`.
- One column per unique word (token) among all course descriptions (i.e. across the entire corpus). The order of the columns in the returned DataFrame does not matter. 
- Values corresponding to the number of occurrences of each word in each course description.

Example behavior is given below.

```python
>>> out = courses_to_bow(courses_df)
>>> out.shape
(1638, 10865) # Same number of rows as courses_df
>>> out.loc['EECS 280', 'data']
4
```

Some guidance:
- You must implement all of the steps by hand, i.e. no using `sklearn`'s `CountVectorizer`.
- Since we already have a function for tokenizing each description, it's not necessary to use regular expressions to count the number of occurrences of particular words in each document. Look into the list `count` method, which you can use in conjunction with a `for`-loop or the Series `apply` method. Our solution follows the work in Lecture 10 closely. That said, feel free to use a `for`-loop if needed (we did).
- Our solution takes ~10-20 seconds to run; make sure yours is similarly quick.

In [None]:
def courses_to_bow(courses_df):
    # Make sure your solution does not actually modify courses_df!
    ...

# Uncomment the two lines below once you've implemented courses_to_bow.
# The autograder tests will not work unless you've run these two lines.
# bow = courses_to_bow(courses_df)
# bow

In [None]:
grader.check("q01_03")

**Make sure that throughout the rest of your notebook, `bow` is defined exactly as below!**

In [None]:
# Run this cell.
bow = courses_to_bow(courses_df)
bow.head()

### Question 1.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `bow_to_tfidf`, which takes in a bag of words matrix (`bow`) returned by `courses_to_bow`. `bow_to_tfidf` should return a DataFrame with the same index and column names as `bow`, in the same order, but with all values converted to TF-IDFs – that is, the outputted DataFrame should contain the TF-IDF of every word in every course description.

Example behavior is given below.

```python
# Here, we're referring to the globally-defined bow.
>>> out = bow_to_tfidf(bow)
>>> out.shape == bow.shape
True
>>> out.loc['EECS 280', 'data']
0.19708242779519058
```

Some guidance:
- Follow our logic from Lecture 10 to convert `bow` to a TF-IDF matrix. Your implementation here should be relatively short (< 10 lines).
- While not strictly required (in that we won't test it), we recommend you implement `compute_idfs`, which takes in a DataFrame like `bow` and returns a **Series** containing the inverse document frequency (IDF) of each word in `bow`. Not only will this help compartmentalize your work for this question, but it'll make your life much easier in Question 1.5, when you'll again need to use the IDFs of every word in the corpus.

In [None]:
def compute_idfs(bow):
    # Not required, but suggested!
    ...

def bow_to_tfidf(bow):
    ...

bow_to_tfidf(bow).head()

In [None]:
grader.check("q01_04")

Before we move forward, it's worth stopping and looking at what we've already accomplished. Run the cell below to see the 5 words with the highest TF-IDFs in each course description.

In [None]:
def five_largest(row):
    return ', '.join(row.index[row.argsort()][-5:])

bow_to_tfidf(bow).apply(five_largest, axis=1)

Compare that to the 5 words with the highest frequences in each course description:

In [None]:
bow.apply(five_largest, axis=1)

Hopefully, the value of TF-IDF is clear, but it's also clear that TF-IDF isn't perfect in summarizing documents. But, as we'll soon see, it'll serve our purposes well!

Before you move to Question 1.5, there's one piece of syntax that you'll find useful: the Series `reindex` method. Here's an example of how it works:

In [None]:
things = pd.Series({'a': 2, 'b': 5, 'c': 1})
things

In [None]:
stuff = pd.Series({'a': 'hello', 'b': 'hi', 'x': 9})
stuff

In [None]:
things.reindex(stuff.index)

In [None]:
things.reindex(stuff.index).fillna(0)

### Question 1.5 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `new_query_to_tfidf`, which takes in a string (`query_string`) and a bag of words matrix (`bow`) and returns a Series such that:
- The index contains the same labels as `bow`'s columns (meaning that if `bow` has 10865 columns, the outputted Series should have 10865 elements).
- The values contain the TF-IDF of each word, using `query_string` to compute TFs and **the entire corpus of course descriptions (not including the new query)** to compute IDFs.

Example behavior is given below.

```python
>>> out = new_query_to_tfidf('yooo I am very very very interested in biochemistry and cellular biology courses', bow)
>>> out.shape
(10865,)

# Most of the values in out are 0, since
# "yooo I am very very very interested in biochemistry and cellular biology courses"
# doesn't contain most of the 10865 words in bow.
# Since 'yooo' is not in bow.columns, it doesn't appear in the index of out, either.
# The order of the returned Series does not matter; your Series may be in a different order.
```
```
>>> out[out > 0]
cellular        0.346989
i               0.220664
biochemistry    0.361014
in              0.020182
courses         0.248219
and             0.007285
interested      0.291563
very            0.930908
biology         0.274815
dtype: float64
```

To be clear, the TF-IDF of a word $t$ in a new query string $q$ is:

$$\text{tfidf}(t, q) = \underbrace{\frac{\text{# of occurrences of $t$ in $q$}}{\text{total # of tokens in $q$}}}_{\text{computed using } q \: (\texttt{query_string})} \cdot \underbrace{\log \left(\frac{\text{total # of course descriptions}}{\text{# of course descriptions in which $t$ appears}} \right)}_{\text{computed solely using bow}}$$

Note that this means that the IDFs of each word have nothing to do with the `query_string` that is passed in. This is precisely why we suggested you implement `compute_idfs(bow)` in the previous part – because it would help your implementation of `bow_to_tfidf`, and also help your implementation of `new_query_to_tfidf`.

Some additional guidance:
- This function should only take a few lines to implement, but requires combining several steps, going all the way back to Question 1.2. Think about how the `reindex` method might be useful.
- In the function signature below, you'll see `new_query_to_tfidf(query_string, bow=bow)`. `bow=bow` sets the default value of the `bow` argument to the globally-defined value of `bow`, meaning if we only pass one argument (`query_string`) to `new_query_to_tfidf`, it will automatically use the global `bow`. It's important for our function to be able to take in bag of words matrices other than our globally-defined `bow`, in case we want to use it on a different corpus of documents. But, most of the time we will call it on the global `bow`, so this is done for convenience.

In [None]:
def new_query_to_tfidf(query_string, bow=bow):
    ...

# Feel free to change the input below to test out your implementation of new_query_to_tfidf.
# Remember that the order of the returned Series does not matter.
out = new_query_to_tfidf('yooo I am very very very interested in biochemistry and cellular biology courses')
out[out > 0]

In [None]:
grader.check("q01_05")

Let's take stock of what we have so far.
- We have the TF-IDFs of every word in every document in our corpus. This means that we have a **vector representation** of each course description.
- We have a function that can take any query string and turn it into a **vector** of TF-IDF scores, as well.

Now, we can use techniques from Lecture 10 – specifically, cosine similarity – to find the course descriptions that are most similar (and, hence, most relevant) to our query string!

### Question 1.6 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `top_n_similar_documents`, which takes in a string (`query_string`), a positive integer `n`, and a bag of words matrix (`bow`) and returns a list containing the course codes of the `n` most similar courses to `query_string`, based on course descriptions.

Use cosine similarity to measure the similarity between two vectors; you can implement cosine similarity however you'd like. Remember that course codes are stored in the index of `bow`. The documents in the returned list should be sorted in **decreasing order of similarity**.

Example behavior is given below.

```python
>>> top_n_similar_documents('yooo I am very interested in biochemistry and cellular biology courses', 3)
['CHEM 351', 'BIOLOGY 172', 'CHEM 218']

>>> top_n_similar_documents("What are some courses related to machine learning I can take at Michigan, other than EECS 445?", 5)
['EECS 445', 'ALA 171', 'EECS 492', 'MECHENG 499', 'ECE 559']
```

In [None]:
def top_n_similar_documents(query_string, n, bow=bow):
    ...

# Feel free to change the inputs below to test out your implementation of top_n_similar_documents.
top_n_similar_documents('yooo I am very interested in biochemistry and cellular biology courses', 3)

In [None]:
grader.check("q01_06")

Awesome! You've implemented the retrieval step in RAG. That is, given a query, you're able to automatically find the most relevant documents in our "knowledge database" for answering that query. Note that this process isn't flawless, as we see below:

In [None]:
top_n_similar_documents("What are some courses related to machine learning I can take at Michigan, other than EECS 445?", 
                        n=20)

<b><span style="color:red">Stop</span></b> and think about why courses that are truly related to EECS 445, like STATS 415, don't appear in the list of 20 "related" courses above, and think about what impacts this will have on our results in the next step.

Once you've done that, it's time for that final step: passing a `query_string`, along with the contents of the most relevant documents, to a Large Language Model (which we already learned how to access, using `query_llm`).

### Question 1.7 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `ask_gpteecs`, which takes in a string (`query_string`) containing a question about University of Michigan courses, a positive integer `n`, and a bag of words matrix `bow`. `ask_gpteecs` should return a **string** containing the result of:

- querying a Large Language Model using the function `query_llm` from Question 1.1,
- where the query contains **both** the contents of `query_string` and
- the **top `n`** most similar course descriptions (`n=20` by default),
- stitched together in a way that you deem appropriate.

Here's what we mean by "in a way that you deem appropriate." Suppose our query is `'yooo I am very interested in biochemistry and cellular biology courses'`, and suppose `n=3`.
- The top 3 most similar courses to this query are `'CHEM 351'`, `'BIOLOGY 172'`, and `'CHEM 218'`.
- If we just ask a LLM, `'yooo I am very interested in biochemistry and cellular biology courses'`, it won't know anything about any of these three courses. If we ask it, `'yooo I am very interested in biochemistry and cellular biology courses, tell me about them: CHEM 351, BIOLOGY 172, CHEM 218'`, it still won't know anything about those courses.
- Instead, once we identify which 3 courses are most relevant, we need to create a new `query_string` that looks something like:

```python
'''
Hi! I'm looking to answer this query that a student sent me, regarding courses at the University of Michigan:

yooo I am very interested in biochemistry and cellular biology courses

Here are some potentially relevant courses from my knowledge base; they might not all be relevant, so double-check.

CHEM 351 Fundamentals of Biochemistry This course is designed to serve as an introduction to biochemistry for students intending to pursue the BS major in biochemistry and for others who are interested in gaining an overview of the fundamental chemistry underlying cellular functions. The material includes an introduction to the structures of biological macromolecules and an overview of the fundamental cellular processes associated with metabolism, biosynthesis, and replication. It is taught from a chemical perspective with and emphasis on understanding biochemical phenomena through chemical structure and mechanism.

BIOLOGY 172 Introductory Biology BIOLOGY 172 is a one-term course in molecular, cellular, and developmental biology that, together with BIOLOGY 171 and 173, collectively forms the introductory biology course sequence.

CHEM 218 Independent Study in Biochemistry This course provides an introduction to independent biochemistry research under the direction of a faculty member whose project is in the biochemistry area.  The Chemistry Department encourages students to get involved with undergraduate research as early as possible.
'''
```

- Remember, course descriptions are stored in their original form in `courses_df`.
- You can structure your final query string however you'd like, and you're encouraged to experiment with different phrasings to see if they influence your results; you can start by copying the example format above, but then try and make it your own. This is called **prompt engineering**.
- You'll need to figure out a way of programmatically adding the course information to your prompt string. By default, we've set `n=20`, in case the most similar descriptions aren't actually relevant and the most relevant descriptions have lower similarities. But when calling `ask_gpteecs`, we could set `n` to something else.

In [None]:
def ask_gpteecs(query_string, n=20, bow=bow):
    ...

# Feel free to change the inputs below to test out your implementation of ask_gpteecs.
# The Markdown function behaves like the print function,
# but renders text formatting (e.g. bolding, bullet points) when the output from the LLM contains these elements.
Markdown(ask_gpteecs('yooo I am very interested in biochemistry and cellular biology courses'))

In [None]:
grader.check("q01_07")

**Great work!** You've now implemented Retrieval-Augmented Generation, and have your very own ChatGPT-like interface that knows about University of Michigan courses.

Let's wrap up by asking GPTEECS the question we posed at the start of this exploration:

In [None]:
Markdown(ask_gpteecs("What are some courses related to machine learning I can take at Michigan, other than EECS 445?"))

Feel free to keep toying with `query_llm` (see the [Groq documentation here](https://console.groq.com/docs/quickstart) to see what you can customize) and `ask_gpteecs` to try and improve the performance of your implementation.

## Question 2: Relative Squared Loss 🧑‍🧑‍🧒‍🧒

---

In [Lecture 11](https://practicaldsc.org/resources/lectures/lec11/lec11-filled.pdf), we introduced the "modeling recipe" for making predictions:

1. Choose a model.
1. Choose a loss function.
1. Minimize average loss to find optimal model parameters.

The first instance of this recipe saw us choose:
1. The constant model, $H(x_i) = h$.
1. The squared loss function: $L_\text{sq}(y_i, h) = (y_i - h)^2$.
1. The average squared loss across our entire dataset, then, was:
$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2$$
which, using calculus, we showed is minimized when: $$h^* = \text{Mean}(y_1, y_2, ..., y_n)$$
This means that using the squared loss function, the **best** constant prediction is $h^* = \text{Mean}(y_1, y_2, ..., y_n)$.

In this question, you will find the best constant prediction when using a different loss function. In particular, here, we'll explore the **relative squared loss** function, $L_{\text{rsq}}(y_i, h)$:

$$L_{\text{rsq}}(y_i, h) = \frac{(y_i - h)^2}{y_i}$$

Throughout this question, assume that each of $y_1, y_2, ..., y_n$ is positive.

<div class="alert alert-success">
    
Before attempting this question, you may want to watch [**this video 🎥**](https://youtu.be/NSIEP74ifyg), which walks through most of the proof of the above, in addition to reviewing Lecture 11's [**filled slides**](https://practicaldsc.org/resources/lectures/lec11/lec11-filled.pdf).
    
</div>

<!-- BEGIN QUESTION -->

### Question 2.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Determine $\frac{d}{d h} L_{\text{rsq}}(h)$, the derivative of the relative squared loss function with respect to $h$.

(Technically, this is a **partial** derivative, since there are other variables in the definition of $L_\text{rsq}(h)$, but in our setting, $h$ is the only _unknown_.)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

What value of $h$ minimizes average loss when using the relative squared loss function – that is, what is $h^*$? Your answer should only be in terms of the variables $n, y_1, y_2, ..., y_n$, and any constants.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Let $C(y_1, y_2, ..., y_n)$ be your minimizer $h^*$ from Question 2.2. That is, for a particular dataset $y_1, y_2, ..., y_n$, $C(y_1, y_2, ..., y_n)$ is the value of $h$ that minimizes empirical risk for relative squared loss on that dataset.

What is the value of $\displaystyle\lim_{y_4 \rightarrow \infty} C(1, 3, 5, y_4)$ in terms of $C(1, 3, 5)$? Your answer should involve the function $C$ and/or one or more constants.

Some guidance: To notice the pattern, evaluate $C(1, 3, 5, 100)$, $C(1, 3, 5, 10000)$, and $C(1, 3, 5, 1000000)$.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

What is the value of $\displaystyle\lim_{y_4 \rightarrow 0} C(1, 3, 5, y_4)$? Again, your answer should involve the function $C$ and/or one or more constants.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Based on the results of Questions 2.3 and 2.4, when is the prediction $C(y_1, y_2, ..., y_n)$ robust to outliers? When is it not robust to outliers?

<!-- END QUESTION -->

## Question 3: Bye, Calculus 👋

---

As we discussed in the previous question, in Lecture 11, we found that $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean squared error:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i-h)^2$$

To arrive at this result, we used calculus: we took the derivative of $R_\text{sq}(h)$ with respect to $h$, set it equal to 0, and solved for the resulting value of $h$, which we called $h^*$.

In this question, we will minimize $R_\text{sq}(h)$ in a way that **doesn't** use calculus. The general idea is this: if $f(x) = (x - c)^2 + k$, then we know that $f$ is a quadratic function that opens upwards with a vertex at $(c, k)$, meaning that $x = c$ minimizes $f$. As we saw in class (see [Lecture 11, Slide 29](https://practicaldsc.org/resources/lectures/lec11/lec11-filled.pdf#page=29)), $R_\text{sq}(h)$ is a quadratic function of $h$!

Throughout this problem, let $y_1, y_2, ..., y_n$ be an arbitrary dataset, and let $\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i$ be the mean of the $y$'s.


<!-- BEGIN QUESTION -->

### Question 3.1 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

What is the value of $\sum_{i = 1}^n (y_i - \bar{y})$? Show your work.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.2 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Show that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n \left( (y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - h) + (\bar{y} - h)^2 \right)$$

Some guidance:
- To proceed, start by rewriting $y_i - h$ in the definition of $R_\text{sq}(h)$ as $(y_i - \bar{y}) + (\bar{y} - h)$. Why is this a valid step?
- Make sure not to expand unnecessarily. Your work should only take ~3 lines.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.3 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Show that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2 + (\bar{y} - h)^2$$

This is called the **bias-variance decomposition** of $R_\text{sq}(h)$, which is an idea we'll revisit in the coming weeks.

Some guidance: At some point, you will need to use your result from Question 3.1.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Why does the result in Question 3.3 prove that $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ minimizes $R_\text{sq}(h)$?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

In Question 3.3, you showed that:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - \bar{y})^2 + (\bar{y} - h)^2$$

Take a close look at the equation above, then fill in the blank below with **a single word**:

> The value of $R_\text{sq}(h^*)$, when $h^* = \text{Mean}(y_1, y_2, ..., y_n)$, is equal to the ____ of the data.

<!-- END QUESTION -->

## Question 4: More and More Losses 🅻

---

As we mentioned in Questions 2 and 3, $h^* = \text{Mean}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean squared error, i.e. average squared loss:

$$R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i-h)^2$$

Related, $h^* = \text{Median}(y_1, y_2, ..., y_n)$ is the constant prediction that minimizes mean absolute error, i.e. average absolute loss:

$$R_{\text{abs}}(h) = \frac{1}{n} \sum_{i=1}^n \left|y_i - h\right|$$

You may notice that the formulas for $R_\text{sq}(h)$ and $R_\text{abs}(h)$ look awfully similar – they're nearly identical besides the exponent. More generally, for any positive integer $p$, define the $L_p$ loss as follows:

$$L_p(y_i, h) = |y_i - h|^p$$

With this definition, $L_2$ loss is the same as squared loss and $L_1$ loss is the same as absolute loss. The corresponding average loss, for any value of $p$, is then:

$$ R_{p}(h) = \frac{1}{n} \sum_{i=1}^n \left|y_i - h\right| ^ p $$

Written in terms of $R_p(h)$, we know – from the top of this question – that:

- The minimizer of $R_1(h)$ is $\text{Median}(y_1, y_2, ..., y_n)$:

$$\text{Median}(y_1, y_2, ..., y_n) = \underset{h}{\mathrm{argmin}} \: R_1(h)$$

- The minimizer of $R_2(h)$ is $\text{Mean}(y_1, y_2, ..., y_n)$:

$$\text{Mean}(y_1, y_2, ..., y_n) = \underset{h}{\mathrm{argmin}} \: R_2(h)$$

But what constant prediction $h^*$ minimizes $R_3(h)$, or $R_{10}(h)$, or $R_{10000}(h)$? In this question, we'll explore this idea – more specifically, we'll study how $h^*$ changes as $p$ (the exponent on $|y_i - h|$) increases.

[Lecture 11](https://practicaldsc.org/resources/lectures/lec11/lec11-filled.pdf#page=37) (and the [video linked above](https://youtu.be/NSIEP74ifyg)) worked through how to solve for constant prediction $h^*$ that minimized average squared loss (i.e. minimized $R_2(h)$), and we linked to [another video](https://youtu.be/0s7M8OsnBNA?si=lHm6eN3rns7PzPOW) that works through a similar derivation for average absolute loss (i.e. $R_1(h)$). Unfortunately, $p = 1$ and $p = 2$ are the only cases in which we can solve for the minimizer to $R_p(h)$ by hand. 

For all other values of $p$, there is no closed-form solution (i.e. no "formula" for the best constant prediction), and so we need to approximate the solution using the computer. Later in the class, we'll learn how to minimize functions using code we write ourselves (the idea is called gradient descent if you're curious), but for now, we're going to use `scipy.optimize.minimize`, which does the hard work for us.

The `minimize` function is a versatile tool from the `scipy` library that can help us find the input that minimizes the output of a function. Let's test it out.

In [None]:
from scipy.optimize import minimize

Below, we've defined and plotted a quadratic function. We can see 👀 that it's minimized when $x = -4$.

In [None]:
def f(x):
    return (x + 4) ** 2 - 1

In [None]:
xs = np.linspace(-20, 20)
ys = f(xs)
px.line(x=xs, y=ys)

In [None]:
# To call minimize, we have to provide an array of initial "guesses"
# as to where the minimizing input might be.
# For our purposes, using 0 as an initial guess will work fine.
minimize(f, x0=[0])

Above, the `x` attribute of the output tells us that the minimizing input to `f` is `-4.0000`, which is what we were able to see ourselves! Cool.

In this question, we'll deal with the following example array of values, `vals`:

In [None]:
vals = np.array([ 10,  10,  10,  10,  15,  15,  15,  15,  15,  20,  25,  50,  50, 150])
vals

For context, let's see what the distribution of `vals` looks like:

In [None]:
pd.Series(vals).hist(nbins=40)

To reiterate, the constant prediction $h^*$ that minimizes $R_1(h)$ for `vals` is:

In [None]:
np.median(vals)

And the constant prediction $h^*$ that minimizes $R_2(h)$ for `vals` is:

In [None]:
np.mean(vals)

### Question 4.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">0 Points</div>

Complete the implementation of the function `h_star`, which takes in a positive integer `p` and an array `vals` and returns the value of the constant prediction $h^*$ that minimizes average $L_p$ loss for `vals`, i.e. the value of $h^*$ that minimizes $R_p(h)$ for `vals`. Example behavior is given below.

```python
>>> h_star(1, vals)
2.999999995854613

>>> h_star(2, vals)
6.499999104239989
```

Some guidance:
- Your solution should use `minimize`, and will likely involve defining a helper function inside.
- It's okay if your example values are slightly different than those above, but they should be roughly the same. (So, it's fine if `h_star(1, vals)` gives you `1.9999999920558864` or something similar.)

<div class="alert alert-warning">

**We're not autograding Question 4.1, and it's not worth any points.** But, you need to do it in order to answer Question 4.2, which is worth points (and which you will answer on paper!).
    
</div>

In [None]:
def h_star(deg, vals):
    ...

# Feel free to change this input to make sure your function works correctly.
h_star(2, vals)

Before proceeding, make sure that the following cells both say `True`, otherwise you did something incorrectly:

In [None]:
np.isclose(h_star(1, vals), np.median(vals))

In [None]:
np.isclose(h_star(2, vals), np.mean(vals))

Once you have a working implementation of `h_star`, run the cell below.

In [None]:
ps = np.arange(1, 61)
hs = [h_star(p, vals) for p in ps]
px.line(x=ps, y=hs).update_layout(xaxis_title=r'$p$', yaxis_title=r'$h^* = \text{minimizer of } R_p(h)$')

It seems like as $p$ increases, the value of $h^*$ that minimizes $R_p(h)$ approaches some fixed value. But what is that value? For context, look at `vals` again:

In [None]:
vals

<div class="alert alert-danger">
    
If your graph has a sharp sudden drop, that is **not** what we're referring to here. That sharp drop is instead a consequence of choosing very large values of $p$, which correspond to very large exponents, leading to numerical overflow, i.e. numbers too big for Python to represent. $|100 - 90|^{2000}$ is too big for Python!

If you run into this issue, change `61` to something smaller in the line `ps = np.arange(1, 61)`. As $p$ increases, you should see the graph of $h^*$ look more and more like a flat line.    
</div>

<!-- BEGIN QUESTION -->

### Question 4.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Use the plot above to answer the following prompts:

1. In the `vals` dataset, as $p$ increases, what does the value of $h^*$ that minimizes $R_p(h)$ approach?
1. In any general dataset of values $y_1, y_2, ..., y_n$, as $p$ increases, what does the value of $h^*$ that minimizes $R_p(h)$ approach? Why?

Put another way, we're asking you to evaluate the following limit, but using your plot, not calculus (you're welcome 😊):

$$\lim_{p \rightarrow \infty} \left( \underset{h}{\mathrm{argmin}} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p \right)$$

To answer the second prompt, try calling `h_star` with different arrays that you create. Try and see if you can find a pattern in the values that `h_star` returns when `p` is very large.

<!-- END QUESTION -->

## Question 5: Zoe's Bakery 🧁

---

Zoe owns a bakery and wants to figure out how to sell the most baked goods.

<!-- BEGIN QUESTION -->

### Question 5.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
For each of her five baked goods, Zoe recorded the cost in dollars for making the baked good, $x$, and the number of orders for that baked good on a particular day, $y$.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>baked good</th>
      <th>cost ($x$)</th>
      <th>number of baked goods sold ($y$)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cookies</td>
      <td>4</td>
      <td>70</td>
    </tr>
    <tr>
      <td>Brownies</td>
      <td>11</td>
      <td>80</td>
    </tr>
    <tr>
      <td>Croissants</td>
      <td>8</td>
      <td>40</td>
    </tr>
    <tr>
      <td>Cupcakes</td>
      <td>7</td>
      <td>57</td>
    </tr>
    <tr>
      <td>Muffins</td>
      <td>5</td>
      <td>43</td>
    </tr>
  </tbody>
</table>

For example, Zoe spent \$11 making brownies, and sold 80 brownies.

Find the optimal parameters $c_0^*$ and $c_1^*$ that minimize mean squared error for the hypothesis function $H(x_i) = c_0 + c_1 x_i$, which predicts the number of baked goods sold of a particular item as a function of the cost of baking that item. Give exact, fractional values for $c_0^*$ and $c_1^*$; do not round.

You may use a calculator, but you must show all of your work when you submit.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Let's interpret the meaning of the hypothesis function $H(x_i) = c_0^*+c_1^*x_i$ that you found in Question 5.1.

- What does $50 \cdot c_1^*$ represent in terms of Zoe's bakery?
- What does the reciprocal of the slope, $\frac{1}{c_1^*}$, represent in terms of Zoe's bakery?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>
What is the mean squared error, $\text{MSE}_x$, for this dataset, using the line you found in Question 5.1? Round your final answer to two decimal places. Again, you may use a calculator, but you must show all of your work.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Zoe knows that baking each baked good takes a significant amount of time. She decides to quantify the value of baking time in terms of the number of baked goods sold. For each baked good she baked, Zoe recorded the number of hours to bake one unit of the good, $z$, and the number of items sold on a particular day, $y$.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>baked good</th>
      <th>baking time (z)</th>
      <th>number of baked goods sold (y)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cookies</td>
      <td>30</td>
      <td>70</td>
    </tr>
    <tr>
      <td>Brownies</td>
      <td>44</td>
      <td>80</td>
    </tr>
    <tr>
      <td>Croissants</td>
      <td>38</td>
      <td>40</td>
    </tr>
    <tr>
      <td>Cupcakes</td>
      <td>36</td>
      <td>57</td>
    </tr>
    <tr>
      <td>Muffins</td>
      <td>32</td>
      <td>43</td>
    </tr>
  </tbody>
</table>

Find the optimal parameters $d_0^*$ and $d_1^*$ that minimize mean squared error for the hypothesis function $H(z_i) = d_0 + d_1 z_i$, which predicts the number of baked goods sold of a particular item as a function of the baking time of that item. Give exact, fractional values for $d_0^*$ and $d_1^*$; do not round. You may use a calculator, but you must show all of your work.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;"> 2 Points</div>
What is the mean squared error, $\text{MSE}_z$, for this dataset, using the line you found in Question 5.4? Round your final answer to two decimal places. Again, you may use a calculator, but you must show all of your work.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.6 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>
You should have found that $\text{MSE}_x = \text{MSE}_z$, which says that for this data, the mean squared error is the same if we use the variable $x$ or the variable $z$ to make our hypothesis function $H$. This happens because the number of hours required to bake one unit of a baked good ($z$) is linearly related to the cost of baking that baked good ($x$) by the formula: $$\text{baking time} = 22 + 2 \cdot \text{cost}$$

In the rest of this question, we'll verify some general properties concerning the scenario where we predict some variable $y$ based on $x$, as compared to predicting $y$ based on $z$, when $z$ is a linear transformation of $x$. We'll no longer use the bakery data given above, but we'll prove properties in general.

First, suppose we have a dataset $\{x_1, x_2, \dots, x_n\}$ and we define a dataset $\{z_1, z_2, \dots, z_n\}$ by the linear transformation:
$$z_i = ax_i + b$$

Suppose also we have a dataset $\{y_1, y_2, \dots, y_n\}$.

Let $c_0^*$ and $c_1^*$ be the optimal intercept and slope of the regression line (that is, the optimal linear hypothesis function) for $y$ with $x$ as the predictor variable,
$$H(x_i) = c_0^* + c_1^* x_i$$
Similarly, let $d_0$ and $d_1$ be the intercept and slope of the regression line for $y$ with $z$ as the predictor variable,
$$H(z_i) = d_0^* + d_1^*z_i$$

Express $d_0^*$ and $d_1^*$ in terms of $c_0^*$, $c_1^*$, $a$, and $b$, and/or one or more constants.

_Hint: You can use the fact that if $y_i = ax_i + b$, then $\bar{y} = a \bar{x} + b$ without proof._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 5.7 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Let $\text{MSE}_x$ be the mean squared error for the dataset $\{y_1, y_2, \dots, y_n\}$ using the hypothesis function:
$$H(x_i) = c_0^* + c_1^* x_i.$$
Similarly, let $\text{MSE}_z$ be the mean squared error for the dataset $\{y_1, y_2, \dots, y_n\}$ using the hypothesis function:
$$H(z_i) = d_0^* + d_1^* z_i$$
Show that $\text{MSE}_x = \text{MSE}_z$.

<!-- END QUESTION -->

## Question 6: Algebra, Too 📐

---

In the coming lectures, we'll start formulating the problem of making predictions about future data given past data in terms of matrices and vectors. Why? The answer is simple: doing so will allow us to build models that use multiple input variables (i.e. features) in order to make predictions.

This question serves to review the key linear algebra knowledge you'll need to be familiar with as we start using matrices and vectors in lecture. If any of this feels foreign – and it's totally fine if it does! – review the following pages of our [Linear Algebra Guide](https://practicaldsc.org/guides/linear-algebra/):
1. [Vectors and angles](https://practicaldsc.org/guides/linear-algebra/vectors-angles/).
1. [Linear combinations](https://practicaldsc.org/guides/linear-algebra/linear-combinations/).
1. [Matrices](https://practicaldsc.org/guides/linear-algebra/matrices/).
1. [Projections](https://practicaldsc.org/guides/linear-algebra/projections/).


We'll link to specific sections of our linear algebra guide for each part of this question.

<div class="alert alert-success">
    
Don't worry – the content in Question 6 is **not** in scope for the Midterm Exam.
    
</div>

Throughout this question, consider the following vectors in $\mathbb{R}^3$, where $\beta \in \mathbb{R}$ is a scalar:

$$
\vec{v}_1 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}, \quad 
\vec{v}_2 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}, \quad 
\vec{v}_3 = \begin{bmatrix} \beta \\ 1 \\ 2 \end{bmatrix}
$$

<!-- BEGIN QUESTION -->

### Question 6.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

For what value(s) of $\beta$ are $\vec{v}_1$ and $\vec{v}_3$ orthogonal?

<small><small>📕 To review, read the guide on [Vectors and angles](https://practicaldsc.org/guides/linear-algebra/vectors-angles/).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

For what value(s) of $\beta$ are $\vec{v}_2$ and $\vec{v}_3$ orthogonal?

<small><small>📕 To review, read the guide on [Vectors and angles](https://practicaldsc.org/guides/linear-algebra/vectors-angles/).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

For what value(s) of $\beta$ are $\vec{v}_1, \vec{v}_2,$ and $\vec{v}_3$ linearly **in**dependent?

<small><small>📕 To review, read the guide on [Linear combinations](https://practicaldsc.org/guides/linear-algebra/linear-combinations/).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Regardless of your answers to the previous three parts, in this part, let $\beta = 3$.

Is the vector $\begin{bmatrix}
3 \\
5 \\
8
\end{bmatrix}$ in $\text{span}(\vec{v}_1, \vec{v}_2, \vec{v}_3)$? Why or why not?

<small><small>📕 To review, read the guide on [Linear combinations](https://practicaldsc.org/guides/linear-algebra/linear-combinations/).</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.5 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

What is the projection of the vector $\begin{bmatrix}
3 \\
15 \\
21
\end{bmatrix}$ onto $\vec{v}_1$?  Give your answer in the form of a vector.

<small><small>📕 To review, read the guide on [Projections](https://practicaldsc.org/guides/linear-algebra/projections/). The first video is all you need for this part.</small></small>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6.6 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

What is the orthogonal projection of the vector $\begin{bmatrix}
3 \\
15 \\
21
\end{bmatrix}$ 
onto $\text{span}(\vec{v}_1, \vec{v}_2)$?

The answer is a vector, $\vec{z}$, which can be written in the form:

$$\vec{z} = \lambda_1 \vec v_1 + \lambda_2 \vec v_2$$

**Your job** is to find the values of scalars $\lambda_1$ and $\lambda_2$, and then, the vector $\vec z$. As done in the [Projections](https://practicaldsc.org/guides/linear-algebra/projections) guide, one of the intermediate steps in answering this question involves defining a particular matrix $X$ and computing $(X^T X) ^{-1}X^T$.

<small><small>📕 To review, read the guide on [Projections](https://practicaldsc.org/guides/linear-algebra/projections/). </small></small>

<!-- END QUESTION -->

## Finish Line 🏁

Congratulations! You're ready to submit Homework 6.

You need to submit Homework 6 twice:

### To submit the manually graded problems (Questions 2-6; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 6 (Questions 2-6; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Question 1; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under **Homework 6 (Question 1; autograded problems)**.
4. Stick around while the Gradescope autograder grades your work. **Remember that Homework 6 has no hidden tests! This means the tests you see in your notebook are the exact same as the ones that will be used to grade your work on Gradescope. When you submit on Gradescope, you'll see your score shortly after you submit, once the autograder finishes running.** 
5. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 6 submission time will be the **later** of your two individual submissions.