In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 10

# Gradient Descent and Classification

### EECS 398: Practical Data Science, Winter 2025

#### Due Friday, April 11th at 11:59PM (note the later than usual deadline)
    
</div>

## Instructions

Welcome to Homework 10, and thanks for your patience! In this homework, you'll get your hands dirty and implement gradient descent from scratch, and gain familiarity with various classification algorithms and their behaviors. This homework touches on ideas from Lectures 20 and 21.

You are given **eight** slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.

<div class="alert alert-warning">
    
This homework is **fully autograded, and has no hidden tests**. All you need to do is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 10** assignment on Gradescope to make sure your final score matches the test cases in this notebook.
</div>

This homework is worth a total of **39 points**, all of which come from the autograder. The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`. 

In [None]:
import hw10_util as util

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

from IPython.display import Markdown

## Question 1: Spam or Ham 🍔

---

Whenever you receive an email at your @umich.edu account, Gmail's spam filter predicts whether the email is **spam** or **ham** (not spam). In other words, they're performing binary classification! **In this question, we'll build our own spam classifiers using techniques from class.**

Run the cell below to load in a dataset of emails, all labeled as either `'spam'` or `'ham'`.

In [None]:
emails = pd.read_csv('data/spam_ham_dataset.csv')
emails.head()

In keeping consistent with what we've seen in class, **we'll convert `'spam'` to 1 and `'ham'` to 0**.

In [None]:
emails['label'] = (emails['label'] == 'spam').astype(int)

Before we do any analysis, we'll perform a train-test split, as usual.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(emails[['text']], emails['label'], random_state=98)

In the training set, about 30% of emails are spam (1) and about 70% are not spam (0).

In [None]:
y_train.value_counts(normalize=True)

Right now, **none of our features are numeric**, so we'll need to engineer numeric features out of the `'text'` column. Fortunately, we learned how to create text features out of documents in [Lecture 10](https://practicaldsc.org/resources/lectures/lec10/lec10-filled.html), when we learned about the bag of words model and TF-IDF!

Each value in `'text'` clearly starts with a subject (from `'Subject:'` to `'\r\n'`), which is separate from the body of the email (everything after the first `'\r\n'`). For simplicity's sake, we'll treat all words in `'text'` as being equivalent and **not** distinguish between email subjects and bodies.

### Question 1.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

To start, you'll build a `sklearn` Pipeline that does the following to predict the `'label'` of an email:
- Uses `sklearn`'s `TfidfVectorizer` to turn all emails in the `'text'` column into TF-IDF features. `TfidfVectorizer` **automatically** creates a "TF-IDF matrix" as seen in Lecture 10, with one row per document (email) and one column per unique term. **It uses regular expressions and removes punctuation all for us; we fortunately don't need to think about those details.**
- Fits a `LogisticRegression` model with the default settings. While we will only formally introduce logistic regression (which is a classification technique!) in Lecture 22, it works similarly to `KNeighborsClassifier` and `DecisionTreeClassifier` in that it can be used for binary classification. The added benefit of logistic regression is that it allows us to predict the **probability** of belonging to a class, i.e. it will allow us to predict the probability that an email is spam (class 1) or ham (class 0), given its text.

Complete the implementation of the function `create_pipe_tfidf`, which takes in a DataFrame like `X_train` and a Series like `y_train` and returns a **fit** Pipeline that follows all of the steps above.

Example behavior is given below.

```python
>>> pipe = create_pipe_tfidf(X_train, y_train)

# As we'll see in Lecture 22, fit LogisticRegression estimators
# have a predict_proba method, which returns the predicted probabilities of each class.
# Here, we're seeing that this email has a 59.45% predicted probability of being ham (class 0),
# and a 40.55% predicted probability of being spam (class 1).
>>> pipe.predict_proba(pd.DataFrame([{
    'text': 'hey eecs 398 students attached is where you are sitting for the exam'
}]))
array([[0.59450521, 0.40549479]])
```

Some guidance:
- **Our implementation is just two lines long**, one of which is fairly long.
- We created a Pipeline with three components, one of which is a `FunctionTransformer` instance. The `FunctionTransformer` takes in an input DataFrame with one column and returns a Series containing just that column; this is necessary because `X_train` is a DataFrame, but `TfidfVectorizer` expects a Series/1D array of documents.
- This part has no hidden test cases, since the rest of Question 1 depends on a correct implementation of `create_pipe_tfidf`.

In [None]:
from sklearn.pipeline import make_pipeline, FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

def create_pipe_tfidf(X_train, y_train):
    ...

emails = pd.read_csv('data/spam_ham_dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(emails[['text']], emails['label'], random_state=98)

# Feel free to change these inputs to make sure your function works correctly.
pipe = create_pipe_tfidf(X_train, y_train)
pipe.predict_proba(pd.DataFrame([{
    'text': 'hey eecs 398 students attached is where you are sitting for the exam'
}]))

In [None]:
grader.check("q01_01")

We tested your code using the `predict_proba` method of your fit Pipeline. But, as we'll see in Lecture 22, the `predict` method of a fit  `LogisticRegression` model predicts the class with the larger probability.

In [None]:
pipe = create_pipe_tfidf(X_train, y_train)
pipe.predict(pd.DataFrame([{
    'text': 'hey eecs 398 students attached is where you are sitting for the exam'
}]))

Let's make a nicer UI for this! Run the cell below and play with the resulting widget. (This may not work in VSCode, in which case you can call `predict_on_new_email` directly.)

In [None]:
def predict_on_new_email(email, pipe):
    pred = pipe.predict(pd.DataFrame([{'text': email}]))[0]
    pred_proba = pipe.predict_proba(pd.DataFrame([{'text': email}]))[0, 1]
    if pred == 'spam':
        piece = 'Spam ❌'
    else:
        piece = 'Not Spam ✅'
    display(Markdown(f'### Predicted {piece} \n Spam Probability: {round(pred_proba * 100, 2)}%'))


def email_widget(pipe):    
    from ipywidgets import interact
    interact(lambda email: predict_on_new_email(email, pipe=pipe), email='replace this text!');

email_widget(pipe)

The classifier doesn't seem to work particularly intuitively. Try inputting a non-spammy email, like `'hi mom I love you'`, and you'll see a relatively high predicted probability of spam. **Why do you think this is the case?**

You may be curious to know **where** the emails came from. For that, read [**this Wikipedia article**](https://en.wikipedia.org/wiki/Enron_Corpus), then manually look through a few of the emails in `X_train`. You should notice terms like `'enron'` and `'gas'` are common.

In [None]:
X_train

All of that said, as you see below, the **test** accuracy of our fit classifier is quite high.

In [None]:
pipe = create_pipe_tfidf(X_train, y_train)
pipe.score(X_test, y_test)

### Question 1.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

But, what are the precision, recall, and false positive rate of our fit classifier? Instead of using the built-in implementations of precision and recall in `sklearn`, we'll have you implement functions that calculate these metrics manually. Remember:

$$\text{precision} = \frac{TP}{TP + FP}$$

$$\text{recall} = \frac{TP}{TP + FN}$$

$$\text{FPR} = \frac{FP}{FP + TN}$$

where $TP$ is the number of true positives, $FP$ is the number of false positives, $FN$ is the number of false negatives, and $TN$ is the number of true negatives. Here, a "positive" is a prediction of spam, and a "negative" is a prediction of ham.

Complete the implementations of the following two functions:

- `calculate_precision`, which takes in two binary Series/1D arrays, `y_actual` and `y_pred`, and returns the **precision** of the predictions `y_pred` relative to the actual $y$-values `y_actual`.
- `calculate_recall`, which takes in two binary Series/1D arrays, `y_actual` and `y_pred`, and returns the **recall** of the predictions `y_pred` relative to the actual $y$-values `y_actual`.
- `calculate_fpr`, which takes in two binary Series/1D arrays, `y_actual` and `y_pred`, and returns the **false positive rate (FPR)** of the predictions `y_pred` relative to the actual $y$-values `y_actual`.

Example behavior is given below.

```python
>>> y_actual_ex = np.array([0, 1, 1, 0, 1, 1, 1])
>>> y_pred_ex = np.array([  1, 1, 0, 0, 0, 1, 1])

# 3 true positives, 1 false positive: 3 / (3 + 1) = 0.75.
>>> calculate_precision(y_actual_ex, y_pred_ex)
0.75

# 3 true positives, 2 false negatives: 3 / (3 + 2) = 0.6.
>>> calculate_recall(y_actual_ex, y_pred_ex)
0.6

# 1 false positive, 1 true negative: 1 / (1 + 1) = 0.5.
>>> calculate_fpr(y_actual_ex, y_pred_ex)
0.5
```

Remember that you **cannot** use any existing implementations, and also **cannot** use any loops! Each implementation here should be very short.

In [None]:
def calculate_precision(y_actual, y_pred):
    ...

def calculate_recall(y_actual, y_pred):
    ...

def calculate_fpr(y_actual, y_pred):
    ...

# Feel free to change these inputs to make sure your functions work correctly.
y_actual_ex = np.array([0, 1, 1, 0, 1, 1, 1])
y_pred_ex = np.array([  1, 1, 0, 0, 0, 1, 1])
display(Markdown(f'#### Precision: {round(calculate_precision(y_actual_ex, y_pred_ex), 5)}'))
display(Markdown(f'#### Recall: {round(calculate_recall(y_actual_ex, y_pred_ex), 5)}'))
display(Markdown(f'#### FPR: {round(calculate_fpr(y_actual_ex, y_pred_ex), 5)}'))

In [None]:
grader.check("q01_02")

### Question 1.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Now that we have implementations of `calculate_precision`, `calculate_recall`, and `calculate_fpr`, we can use them to measure our Pipeline's performance.

Below, complete the implementation of the function `pipeline_stats`, which takes in a fit Pipeline (like `pipe`), and a test set like `X_test` and `y_test`, and returns a **Series** with the following index-value pairs:
- `'% spam correctly filtered'`: The percentage **of actually spam emails** that our classifier **correctly** identifies as spam, in the test set.
- `'% lost'`: The percentage **of actually ham (not spam) emails** that our classifier **incorrectly** identifies as spam, in the test set.
- `'% incorrectly flagged'`: The percentage **of predicted spam emails** that **aren't** actually spam, in the test set.

Example behavior is given below.

```python
>>> pipe = create_pipe_tfidf(X_train, y_train)
>>> stats = pipeline_stats(pipe, X_test, y_test)

# Between 4% and 5% of the emails that are predicted to be spam are actually ham.
>>> 4 < stats.loc['% incorrectly flagged'] < 5
```

Some guidance:
- To determine your Pipeline's predictions, use its `predict` method (not `predict_proba`).
- By default, the values in `y_test` – and the values that result from `pipe.predict(...)` – won't be binary, but will be strings. You'll need to convert them to binary Series/1D arrays, so that they can work with the functions you defined in the previous part.
- Most of the work is in deeply understanding what you're being asked to calculate – the code you'll write here is relatively short.

In [None]:
def pipeline_stats(pipe, X_test, y_test):
    ...

emails = pd.read_csv('data/spam_ham_dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(emails[['text']], emails['label'], random_state=98)
pipe = create_pipe_tfidf(X_train, y_train)

# Feel free to change these inputs to make sure your function works correctly.
# In particular, we may call your function on subsets of X_test and y_test!
pipeline_stats(pipe, X_test, y_test)

In [None]:
grader.check("q01_03")

If you did everything correctly, `pipeline_stats(pipe, X_test, y_test)` should tell you that **the percentage of predicted spam emails that aren't actually spam is somewhere around 4.5%**. This seems relatively low, which at first may sound appealing. But, consider that you receive 100s of **real, non-spam** emails a week – if 4.5% of them are incorrectly flagged as spam, you may miss out on seeing dozens of real emails, which is unacceptable! Let's see if we can dig deeper into how our Pipeline is making predictions.

Linear regression is a **paramet**ric model, with one **paramet**er per feature. Logistic regression is **also** a parametric model, with one parameter per feature. Assuming the parameters $w_0^*, w_1^*, ..., w_d^*$ have already been found to minimize empirical risk, the logistic regression model makes predictions as follows:

$$P(y_i = 1 | \vec{x}_i) = \sigma (w_0^* + w_1^* x_i^{(1)} + w_2^* x_i^{(2)} + ... + w_d^* x_i^{(d)})$$

In our particular context:

$$P(\text{email } i \text{ is spam}) = \sigma \big( w_0^* + w_1^* \cdot \text{tf-idf(``http", email $i$}) + w_2^* \cdot \text{tf-idf(``thanks", email $i$}) + w_3^* \cdot \text{tf-idf(``michigan", email $i$}) \: + \: ...  \big)$$

- The $\sigma$ represents the logistic function, or sigmoid function, $\sigma(t) = \frac{1}{1 + e^{-t}}$. One of the reasons we use it is because it guarantees our predicted probabilities are between 0 and 1.
- By default, `sklearn`'s implementation of `LogisticRegression` predicts class 1 if the predicted probability is above 0.5, and class 0 otherwise.
- The larger the input to $\sigma( \cdot )$ is, the closer the predicted probability is to 1:

In [None]:
def sigma(t):
    return 1 / (1 + np.e ** (-t))

sigma(10)

In [None]:
sigma(-0.5)

In [None]:
xs = np.linspace(-5, 5)
ys = sigma(xs)
px.line(x=xs, y=ys)

In the explanation above, $\text{"http"}$, $\text{"thanks"}$, and $\text{"michigan"}$ were used as examples of what the first three words in our corpus may be. Remember, here, all $d$ features being used in our model correspond to TF-IDF scores. As we learned in [Lecture 10](https://practicaldsc.org/resources/lectures/lec10/lec10-filled.html), 

$$\text{tf-idf(word $j$, email $i$)}$$ 

is large when word $j$ is important to email $i$ – that is, when word $j$ is common in email $i$, but rare across other emails.

**Here's the main idea we'll now explore**: the features with the largest coefficients (i.e. optimal parameters) influence the predictions the most! Since all TF-IDF scores are non-negative:
- If word $j$'s coefficient is very large and **positive**, it means that as $\text{tf-idf(word $j$, email $i$})$ increases, the predicted probability that the email is **spam** increases.
- If word $j$'s coefficient is very large and **negative**, it means as $\text{tf-idf(word $j$, email $i$})$ increases, the predicted probability that the email is **ham (not spam)** increases.
- If word $j$'s coefficient is close to 0, it means that the value of $\text{tf-idf(word $j$, email $i$})$ does not change the predicted probability of being spam very much.

So, **for example**, if we had:

$$P(\text{email } i \text{ is spam}) = \sigma \big( -5.5 + 3.1 \cdot \text{tf-idf("http", email $i$}) - 10 \cdot \text{tf-idf("thanks", email $i$}) + 0.02 \cdot \text{tf-idf("michigan", email $i$}) \: + \: ...  \big)$$

then, if $\text{"thanks"}$ is very important to email $i$, then the predicted probability that email $i$ is spam would be relatively low. This is because the input to $\sigma( \cdot )$ would involve $-10 \cdot \text{some relatively large number}$, and the more negative the input to $\sigma( \cdot )$ is, the smaller the predicted probability of being spam is.

### Question 1.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Given the intuition above, let's try and understand how our particular Pipeline is making predictions.

Below, complete the implementation of the function `d_largest_coefficients`, which takes in a fit Pipeline like `pipe`, a positive integer `d`, and an integer `sign` which is either `1` or `-1`. `d_largest_coefficients` should:
- Determine the coefficient of each word (i.e. each TF-IDF feature) used by `pipe`.
- If `sign == 1`:
    - Return a horizontal bar chart visualizing the coefficients of the `d` words with the **largest, most positive coefficients**.
- If `sign == -1`:
    - Return a horizontal bar chart visualizing the coefficients of the `d` words with the **largest, most negative coefficients**.
 
Example behavior is given below.

```python
>>> pipe = create_pipe_tfidf(X_train, y_train)
>>> d_largest_coefficients(pipe, d=5, sign=1)
```

<div align="left">
<img src="imgs/pos_coefs.png", width=400>
</div>

```python
>>> d_largest_coefficients(pipe, d=5, sign=-1)
```

<div align="left">
<img src="imgs/neg_coefs.png", width=400>
</div>

Some guidance:
- You'll need to use some of the Pipeline methods and attributes introduced in [Lecture 17](https://practicaldsc.org/resources/lectures/lec17/lec17-filled.html) to get the names and coefficients of each feature.
- As in the examples above, make sure your bar charts are sorted such that **the longest bar is always at the top**.

In [None]:
def d_largest_coefficients(pipe, d, sign):
    ...

# Feel free to change these inputs to make sure your function works correctly.
emails = pd.read_csv('data/spam_ham_dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(emails[['text']], emails['label'], random_state=98)
pipe = create_pipe_tfidf(X_train, y_train)

# Feel free to change these inputs to make sure your function works correctly.
d_largest_coefficients(pipe, d=5, sign=1)

In [None]:
grader.check("q01_04")

The function `d_largest_coefficients` should make clear which words increase the predicted probability that an email is spam, and which decrease that probability.

Below, we reintroduce the interactive widget you saw after Question 1.1. Can you try and make the predicted probability 99.9%? 0.01%?

In [None]:
email_widget(pipe)

### Question 1.5 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

The Pipeline we produced in Question 1.1 achieved a very high accuracy, precision, and recall, without us really needing to do anything. By default, `TfidfVectorizer` uses all words in the corpus, i.e. it uses a very large **vocabulary**. You should verify that the Pipeline you've produced creates **43061 features**, i.e. uses 43061 words.

Recall, $L_1$ regularization – called LASSO in the context of linear regression – is a form of regularization that not only prevents overfitting, but also encourages **sparsity**, in that many of the optimal parameters end up being set to 0. In Homework 9, we used LASSO as a form of feature selection – coefficients that were set to 0 are "turned off" or "disabled" by the model, since they're not needed in order to maximize cross-validation performance.

Let's use $L_1$-regularized logistic regression here, to determine **which** words are actually important for building a generalizable model. Typically, when regularizing, we've used `GridSearchCV` to use cross-validation to find an optimal choice of $\lambda$, the regularization hyperparameter. But instead, here, we'll pick a value of $\lambda$ for you to use in advance, to illustrate a point.

Below, complete the implementation of the function `create_pipe_l1_reg`, which takes in a DataFrame like `X_train` and a Series like `y_train` and returns a **tuple** such that:
- the first element is a fit Pipeline, just like the one in Question 1.1, but instead with an $L_1$-regularized logistic regression model at the end (see details below).
- the second element is the **number of non-zero coefficients in the resulting fit model**, i.e. the number of words that the final model ended up using, after $L_1$ regularization assigned a coefficient of 0 to some words.

Example behavior is given below.

```python
>>> pipe_l1_reg, num_words = create_pipe_l1_reg(X_train, y_train)
>>> pipe_l1_reg.predict_proba(pd.DataFrame([{
    'text': 'hey eecs 398 students attached is where you are sitting for the exam'
}]))
array([[0.64731552, 0.35268448]])

>>> num_words
31
```

Some guidance:

- The `LogisticRegression` class supports regularization directly, but you'll need to look into the arguments that its constructor accepts. In our implementation, we only specified four arguments to `LogisticRegression`:
    - `C=0.2`: `C` is the name for the regularization hyperparameter for `LogisticRegression` in `sklearn`, and **it behaves opposite to $\lambda$ / `alpha` in `Ridge`/`Lasso`, in that small values of `C` imply _more_ regularization**! So, `C=0.2` will result in a very regularized model.
    - `penalty`: Read the spec above to see what this should be.
    - `solver`: Required to change the `penalty` from the default; the role of the solver is to minimize empirical risk and find optimal model parameters, similar to gradient descent.
    - `random_state=98`: The solver you'll choose above behaves non-deterministically; set the `random_state` to `98` so that you get the same results (as us) every time.
- **You'll _have_ to read the documentation!**

In [None]:
def create_pipe_l1_reg(X_train, y_train):
    ...

emails = pd.read_csv('data/spam_ham_dataset.csv')
X_train, X_test, y_train, y_test = train_test_split(emails[['text']], emails['label'], random_state=98)

# Feel free to change these inputs to make sure your function works correctly.
pipe_l1_reg, num_words = create_pipe_l1_reg(X_train, y_train)
print(num_words)
pipe_l1_reg.predict_proba(pd.DataFrame([{
    'text': 'hey eecs 398 students attached is where you are sitting for the exam'
}]))

In [None]:
grader.check("q01_05")

If you completed Question 1.5 correctly, you should have noticed that the number of coefficients with non-zero values was relatively small, compared to the total number of words in the corpus!

In [None]:
pipe_l1_reg, num_words = create_pipe_l1_reg(X_train, y_train)
num_words

But, even just by using a handful of words from the corpus, we're still able to achieve relatively high accuracy!

In [None]:
pipe_l1_reg.score(X_test, y_test)

See if you can figure out which 31 words it ended up using – and see if you can do it in one line below (using what you've learned in Question 1.4!).

In [None]:
...

Nice work! You've built a spam email classifier, and more importantly, you **understand** how it works under-the-hood.

If you're curious to learn more about how Gmail's spam classifier works, [read this article](https://workspace.google.com/blog/identity-and-security/an-overview-of-gmails-spam-filters).

## Question 2: Vino 🍷

If you're not super familiar with wine – and, legally, perhaps you're not supposed to be! – wine is an alcoholic beverage made from **grapes** 🍇. There are several different types of grapes (formally known as "cultivars") grown for the purposes of making wine. In this question, we'll build various classifiers to predict the type of grape used to create a particular wine, given other properties of that wine.

Run the cell below to load in the full dataset, taken from the [here](https://archive.ics.uci.edu/dataset/109/wine). The dataset was originally collected in 1991 in a region of Italy.

In [None]:
wine = pd.read_csv('data/wine.csv')
wine

The first column, `'Class'`, contains the type of grape used to produce the wine. There are three classes:

In [None]:
wine['Class'].value_counts().sort_index()

The other 13 columns describe various attributes of the wines. Conveniently, they're all already numerical, so we don't need to perform any one hot encoding.

Before we build any models or draw any visualizations, let's perform a train-test split. The function below performs such a train-test split; we've implemented a function for this to make testing your code easier.

In [None]:
from sklearn.model_selection import train_test_split

def split_wine_data(path='data/wine.csv'):
    wine = pd.read_csv(path)
    return train_test_split(wine.iloc[:, 1:], wine.iloc[:, 0], random_state=98)
    
X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()

Run the cell below to draw a scatter plot. The plotting code is abstracted away in the file `hw10_util.py` to keep this notebook shorter.

In [None]:
util.wine_scatter(X_train_wine, y_train_wine)

Remember, each point corresponds to a different wine. You'll notice there are three variables on display:
- The class (type) of grape used to create the wine.
- The wine's alcohol by volume percentage. For instance, an `'Alcohol'` value of 13 means that 13% of the wine, by volume, is alcohol; the other 87% is made up of water, grape, etc. Larger values correspond to "stronger" wines.
- The wine's color intensity; presumably, larger values mean darker colors.

To start, we'll aim to predict the class of grape of a wine, given its alcohol and color intensity values. Later, we'll consult the other features in the dataset. Let's get started!

### Question 2.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `create_wine_tree`, which takes in a training set like `X_train_wine` and `y_train_wine` and returns a fit **decision tree**, such that:

- The `max_depth` hyperparameter is chosen via cross-validation. Try all values between 1 and 25, inclusive, and use the default 5-fold cross-validation.
- The tree is only trained using the `'Alcohol'` and `'Color Intensity'` features.

Example behavior is given below.

```python
# Technically, wine_tree is a fit GridSearchCV object,
# not a fit DecisionTreeClassifier object.
>>> wine_tree = create_wine_tree(X_train_wine, y_train_wine)
>>> wine_tree
```

<div align="left">
<img src="imgs/wine_tree_repr.png" width=300>
</div>

```python
# Note that we're only specifying 'Alcohol' and 'Color Intensity' values when
# making predictions.
>>> wine_tree.predict(pd.DataFrame([{
    'Alcohol': 13,
    'Color Intensity': 5
}]))
array(['Grape 3'], dtype=object)
```

Some guidance:
- Since there is randomness in how decision trees are fit, set `random_state=98` when instantiating your `DecisionTreeClassifier`.

In [None]:
def create_wine_tree(X_train_wine, y_train_wine):
    # Hint: You'll need to import several classes yourself.
    # Do so within your function.
    ...
    
X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()

# Feel free to change these inputs to make sure your function works correctly.
wine_tree = create_wine_tree(X_train_wine, y_train_wine)
wine_tree.predict(pd.DataFrame([{
    'Alcohol': 12.5,
    'Color Intensity': 3
}]))

In [None]:
grader.check("q02_01")

One benefit to using a decision tree is that we can visualize the resulting model as... a tree 🌲! Run the cell below to see your fit tree.

In [None]:
X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()
wine_tree = create_wine_tree(X_train_wine, y_train_wine)
util.show_wine_decision_tree(wine_tree.best_estimator_, X_train_wine[['Alcohol', 'Color Intensity']]);

While the maximum depth of your tree is 4 (which should have been the optimal value of `max_depth` that `GridSearchCV` found), you'll notice that some branches terminate before a max depth of 4.

Let's visualize the decision boundaries of your fit decision tree, in terms of the feature space (that is, in a plot of `'Color Intensity'` of `'Alcohol'`). The scatter plot drawn is of the **training set**.

In [None]:
util.show_wine_decision_boundary(wine_tree, X_train_wine, y_train_wine, title=f"Decision Tree of Depth {wine_tree.best_params_['max_depth']}, Drawn on Training Set")

As we learned in Lecture 21, decision trees partition the feature space into rectangles!

Let''s switch our attention to our tree's performance on the **test set**. First, we'll compute the accuracy of your model, just so we have a baseline to refer to later on.

In [None]:
wine_tree.score(X_test_wine[['Alcohol', 'Color Intensity']], y_test_wine)

And finally, let's draw the **confusion matrix** for your decision tree's predictions – using the **test** set once again.

In [None]:
util.show_wine_confusion_matrix(wine_tree, X_test_wine[['Alcohol', 'Color Intensity']], y_test_wine)

Above, we're seeing that for instance, in 2 cases, we predicted Grape 1 for a wine that actually used Grape 3.

With these baselines in mind, let's see how other classification techniques perform!

### Question 2.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

In this question, your job is to train **six** different classifiers on the wine training set, so that we can compare and contrast their results. 

Complete the implementation of the function `train_wine_models`, which takes in a training set like `X_train_wine` and `y_train_wine` and returns a **dictionary** in which
- the keys are model names as strings (specified below), and
- the values are corresponding **already-fit** model objects, fit on just the `'Alcohol'` and `'Color Intensity'` features of `X_train_wine` like before.

The models you need to fit are described below.

| Model Name | Model Description |
| --- | --- |
| `'Decision Tree (max_depth=4)'` | Decision tree classifier with maximum depth hard-coded to 4 |
| `'Random Forest (max_depth=4)'` | Random forest classifier with maximum depth hard-coded to 4 |
| `'KNN (k=10)'` | $k$-nearest neighbors classifier with $k=10$ |
| `'Naive Bayes'` | Gaussian Naïve Bayes classifier |
| `'Logistic Regression'` | Logistic regression* (see "Some guidance" for more details) with `C=np.inf` (no regularization; `sklearn` regularizes by default)|
| `'Neural Network'` | Multi-layer perceptron, i.e. a basic neural network |

Example behavior is given below.

```python
>>> model_dict = train_wine_models(X_train_wine, y_train_wine)
>>> model_dict.keys()
dict_keys(['Decision Tree (max_depth=4)', 'Random Forest (max_depth=4)', 'KNN (k=10)', 'Naive Bayes', 'Logistic Regression', 'Neural Network'])

>>> model_dict['Naive Bayes'].predict(pd.DataFrame([{'Alcohol': 15, 'Color Intensity': 15}]))[0]
'Grape 3'
```

Some guidance:
- **In all models that accept a `random_state` argument, set `random_state=98`.** Otherwise, if a hyperparameter isn't specified, don't set it. You shouldn't need to grid search anything in this question.
- We've only covered a few of the models listed above in lecture. Fortunately, the `sklearn` model interface works the same for the rest too, you just need to determine what they're called and how to use them.
- *As we'll see in Lecture 22, logistic regression is naively designed for binary classification, but can be extended to support multiple classes, in a way called "multinomial" logistic regression. Since `y_train_wine` has three classes, multinomial logistic regression will be performed automatically, so you don't really need to think about this.
- In Question 2.3, we're going to retrain the classifiers above, but with some feature engineering steps. To make this process easier, we recommend defining a helper function that returns a dictionary of **un-fit** model instances, that `train_wine_models` then takes in, loops through, and trains. If you do this, you can reuse your helper function later on.

In [None]:
def train_wine_models(X_train_wine, y_train_wine):
    ...

X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()
model_dict = train_wine_models(X_train_wine, y_train_wine)
# Feel free to experiment with the behavior of any of these models below.
model_dict['Neural Network'].predict(pd.DataFrame([{'Alcohol': 15, 'Color Intensity': 15}]))[0]

In [None]:
grader.check("q02_02")

Now that you've done the hard work of specifying these various models, let's look at how well they perform. First, let's visualize all six decision boundaries on the **training** set.

In [None]:
util.show_wine_decision_boundaries_grid(model_dict, X_train_wine, y_train_wine)

<div class="alert alert-success">

Before proceeding, **look 👀** at how all six decision boundaries are shaped, and think about _why_ they are shaped the way they are.

</div>

How do they perform on the test set?

In [None]:
util.compute_and_plot_accuracies(model_dict, X_train_wine, y_train_wine, X_test_wine, y_test_wine, features=['Alcohol', 'Color Intensity'])

Interesting! Many of our models happen to perform better on the test set than they do on the training set. **Think about some possible reasons as to how this may have happened!**

It seems that the (multinomial) logistic regression and Naive Bayes models happen to perform the best on this specific test set, both with a **test accuracy of 88.89%**. That's not to say logistic regression and Naive Bayes are _always_ better than the other techniques. If we tuned hyperparameters for various other models, we could almost surely improve our performance.

Until now, we've only used two features to predict the type of grape used in a wine – `'Alcohol'` and `'Color Intensity'` – though there are 11 others that we haven't used. We also haven't _processed_ any of the features. In the next part of this question, we'll address the latter point.

### Question 2.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `train_standardized_wine_models`, which takes in a training set like `X_train_wine` and `y_train_wine` and returns a **dictionary** with the exact same keys as the dictionary returned by `train_wine_models`. The values in the returned dictionary should still be **already-fit** models, the difference here being that all models should be **Pipelines that start with a `StandardScaler`**, i.e. all models must first standardize their features before training.

Example behavior is given below.

```python
>>> standardized_model_dict = train_wine_models(X_train_wine, y_train_wine)

# Keep the same specific hyperparameters (and names) we specified in Question 2.2!
>>> standardized_model_dict.keys()
dict_keys(['Decision Tree (max_depth=4)', 'Random Forest (max_depth=4)', 'KNN (k=10)', 'Naive Bayes', 'Logistic Regression', 'Neural Network'])

>>> standardized_model_dict['Naive Bayes']
```

<div align="left">
<img src="imgs/pipe.png" width=200>
</div>

Some guidance:
- You **can't** start by calling `train_wine_models`, because the models in the dictionary that `train_wine_models` returns are all already trained. Instead, try following the guidance in Question 2.2 about writing a helper function.
- Try and avoid copy-and-pasting: our implementation below only has one call to `make_pipeline`, and one call to `fit` (both in a `for`-loop).

In [None]:
def train_standardized_wine_models(X_train_wine, y_train_wine):
    ...

X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()
standardized_model_dict = train_standardized_wine_models(X_train_wine, y_train_wine)
# Feel free to experiment with the behavior of any of these models below.
standardized_model_dict['Naive Bayes']

In [None]:
grader.check("q02_03")

What was the effect of standardizing on model performance? Let's see:

In [None]:
util.compare_model_dictionaries(model_dict, standardized_model_dict, X_train_wine, y_train_wine, X_test_wine, y_test_wine, features=['Alcohol', 'Color Intensity'])

Let's summarize what we're seeing:

| Performance UNCHANGED due to standardization | Performance CHANGED due to standardization |
| --- | --- |
| Logistic regression<br>Naive Bayes<br>Random forest<br>Decision tree | Neural network<br>KNN |

It seems that in some cases, standardizing our features impacts model performance, and in other cases, it does not! **Why?**

<div class="alert alert-success">

**Action Item**: Do a little bit of research into all six models above. In which of them is model performance impacted by standardization? In the case of logistic (and linear) regression, why do we still sometimes standardize, even though it doesn't impact performance? Would the results be different if we regularized our logistic regression model (i.e. set `C=1` in `LogisticRegression`)?

You don't need to submit your answers to these prompts anywhere (since this homework is fully autograded), but you **must** think about them!

</div>

### Question 2.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

You may have noticed that the best models above still only achieve **88.89%** test set accuracy. Complete the implementation of the function `create_best_wine_model`, which takes in a training set like `X_train_wine` and `y_train_wine` and returns a **single, already-fit model** that achieves **at least 91% test set accuracy**. As before, your model should be trained using just the `'Alcohol'` and `'Color Intensity'` features of `X_train_wine`.

Some guidance:
- You're free to use any classifier in `sklearn`, though it's possible to satisfy the requirements of the assignment using one of the types of models you've already used somewhere in Question 2.
- Feel free to tune hyperparameters and engineer features as you wish.
- If your model accepts a `random_state` argument, provide one, so that you don't accidentally achieve an accuracy of under 91% on Gradescope.

In [None]:
def create_best_wine_model(X_train_wine, y_train_wine):
    ...

X_train_wine, X_test_wine, y_train_wine, y_test_wine = split_wine_data()

# The value below must be above 91%!
best_model = create_best_wine_model(X_train_wine, y_train_wine)
best_model.score(X_test_wine[['Alcohol', 'Color Intensity']], y_test_wine)

In [None]:
grader.check("q02_04")

Feel free to continue experimenting with other features in the data and other types of classifiers built into `sklearn`. Happy exploring! 🍷

## Question 3: Descendents 🧑‍🧑‍🧒‍🧒

In this question, you'll develop a deep understanding of gradient descent, a numerical method (first introduced in [Lecture 20](https://practicaldsc.org/resources/lectures/lec20/lec20-filled.html)) designed to minimize functions computationally. We'll switch our focus back to regression, unlike in the first two questions, which were about classification.

To motivate our specific goals, let's look at the commute times dataset that we're now very familiar with from lecture.

In [None]:
df = pd.read_csv('data/commute-times.csv')
df.head()

One of our running examples has been to build a simple linear regression model of the form $H(x_i) = w_0 + w_1 x_i$, that predicts commute time in `'minutes'` given `'departure_hour'`.

In [None]:
fig = px.scatter(df, x='departure_hour', y='minutes').update_layout(xaxis_title='Home Departure Time (AM)', yaxis_title='Minutes', title='Commuting Time vs. Home Departure Time')
fig

The "default" approach has been to choose squared loss, meaning that we choose the intercept, $w_0^*$, and slope, $w_1^*$, that minimize **mean squared error**:

$$R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n \left( y_i - (w_0 + w_1 x_i) \right)^2$$

We solved for $w_0^*$ and $w_1^*$ by hand using calculus in Lecture 12, and in Lectures 14 and 15 we looked at a solution that was derived using linear algebra (which is how we found the normal equations). For reference, we'll calculate these below here.

In [None]:
from sklearn.linear_model import LinearRegression
baseline = LinearRegression()
baseline.fit(df[['departure_hour']], df['minutes'])
w_squared_loss = baseline.intercept_, baseline.coef_[0]
w_squared_loss

So, under squared loss, we have $w_0^* = 142.45$ and $w_1^* = -8.19$. Keep these values in mind throughout the question. For reference, here's what <b><span style="color:red">the line that minimizes mean squared error</span></b> looks like:

In [None]:
fig.add_trace(
    go.Scatter(x=[5.5, 11.5], y=[baseline.predict([[5.5]])[0], baseline.predict([[11.5]])[0]], mode='lines', line=dict(color='red'), name='Best Line (MSE)')
)

<br>

Another approach is to choose absolute loss, meaning that we choose intercept and slope that minimize **mean absolute error**:

$$R_\text{abs}(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n \left| y_i - (w_0 + w_1 x_i) \right|$$

In Homework 7, Question 3, you implemented a brute-force computational routine that found the optimal slope and intercept in $O(n^3)$ time.

<br>

What we'll explore here is a **new** loss function, Tukey's loss function, named after [John Tukey](https://en.wikipedia.org/wiki/John_Tukey), the creator of the box plot. It is defined as follows:

$$L_T(y_i, H(x_i)) = \begin{cases} 1 - \left( 1 - \left( \frac{y_i - H(x_i)}{50} \right)^2 \right)^3 && \text{if} \: |y_i - H(x_i)| \leq 50, \\ 1 && \text{otherwise} \end{cases}$$

To make sense of how the loss function behaves, let's graph it.

In [None]:
def tukey(y_actual, y_pred, c=50):
    error = y_actual - y_pred
    if np.abs(error) <= c:
        return 1 - (1 - (error / c) ** 2) ** 3
    else:
        return 1

hs = np.linspace(-200, 200, 10000)
px.line(x=hs, y=[tukey(0, h) for h in hs], title='Tukey Loss').update_layout(xaxis_title='h', yaxis_title='L(0, h)')

Note that Tukey loss is defined **piecewise**. For predictions in which $|y_i - H(x_i)|$ is more than 50, the loss is capped at 1, resulting in a loss function that is _very_ robust to outliers (unlike squared loss, which is influenced heavily by outliers). For predictions in which $|y_i - H(x_i)| \leq 50$, the loss looks very similar to squared loss. The choice of 50 as the threshold was arbitrary; we could have chosen a different transition point (and, once you finish the question, it's worth investigating how your results would have changed if 50 was replaced with 5 or 100).

You'll also notice that the "handoff" from the curved part to the flat part at $h = \pm 50$ is smooth, meaning Tukey loss is differentiable, unlike absolute loss. This will be important!

Let's continue to consider the simple linear regression model, $H(x_i) = w_0 + w_1 x_i$, where $x_i$ is a scalar, not a vector. In that case, the empirical risk $R_T$ using Tukey loss looks like:

$$R_T(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n  \begin{cases} 1 - \left( 1 - \left( \frac{y_i - (w_0 + w_1 x_i)}{50} \right)^2 \right)^3 && \text{if} \: |y_i - (w_0 + w_1 x_i)| \leq 50, \\ 1 && \text{otherwise} \end{cases}$$

Let's graph the loss surface, using some help from the helper functions we've defined in `hw10_util.py`. Note that we're specifying that we want the $x$'s and $y$'s to come from the `'departure_hour'` and `'minutes'` columns in `df`, respectively.

In [None]:
util.draw_loss_surface_tukey_commute(xs=df['departure_hour'], ys=df['minutes'])

Let's use gradient descent to find the intercept, $w_0^*$, and slope, $w_1^*$, that minimize the loss surface above. That is, let's find the best intercept and slope to use in a simple linear model that predicts `'minutes'` using `'departure_hour'`, using Tukey loss. We can think of this problem as trying to find the vector, $\vec w^*$, that minimizes $R_T(\vec w)$, where $\vec w = \begin{bmatrix} w_0 \\ w_1 \end{bmatrix}$.

The gradient descent update rule is as follows:

$$\vec w^{(t+1)} = \vec w^{(t)} - \alpha \nabla R_T( \vec w^{(t)})$$

where $\vec w^{(t)}$ is our guess for the minimizing $\vec w^*$ at timestep $t$. To start the process, we'll need to decide on an initial guess, $\vec w^{(0)}$, and a step size, $\alpha$, which we will do later.

But, more crucially, to run gradient descent ourselves, we'll need to be able to compute $\nabla R_T(\vec w^{(t)})$, i.e. the **gradient** of $R_T$ at any point $\vec w$.

### Question 3.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

As a refresher, the empirical risk function that we're trying to minimize is:

$$R_T(\vec w) = R_T(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n  \begin{cases} 1 - \left( 1 - \left( \frac{y_i - (w_0 + w_1 x_i)}{50} \right)^2 \right)^3 && \text{if} \: |y_i - (w_0 + w_1 x_i)| \leq 50, \\ 1 && \text{otherwise} \end{cases}$$

Complete the implementation of the function `tukey_gradient`, which takes in:
- `w`, an **array of length 2**, corresponding to a value of $w_0$ and a value of $w_1$, and
- `xs` and `ys`, two Series/1D arrays with the same length, corresponding to sequences of $x$-values (like `df['departure_hour']`) and $y$-values (like `df['minutes']`), respectively.

`tukey_gradient` should return an **array of length 2**, containing the value of the gradient of $R_T$, evaluated at the point `w` that is passed in. Example behavior is given below.

```python
# This says that dR/dw0, when (w0, w1) = (100, -5), is -0.02389409, and
# dR/dw1, when (w0, w1) = (100, -5), is -0.20369193.
>>> tukey_gradient(np.array([100, -5]), xs=df['departure_hour'], ys=df['minutes'])
array([-0.02389409, -0.20369193])
```

Some guidance:
- Remember, the gradient of a function is a vector of partial derivatives. So, $\nabla R_T(\vec w) = \begin{bmatrix} \frac{\partial R}{\partial w_0} \\ \frac{\partial R}{\partial w_1} \end{bmatrix}$.
- Because $R_T$ is a piecewise function, both partial derivatives will also be piecewise, using the same condition as in $R_T$. So, your definition of `tukey_gradient` can involve `if`-statements.
- You'll need to do a substantial amount of math on-paper to complete this implementation, and accurately translate it to code. `for`-loops are fine in your implementation.
    - Back in Lecture 11, we stressed that the derivative of a sum of functions is equal to the sum of the derivatives of those functions. In other words, $\frac{\partial R_T}{\partial w_0} = \frac{1}{n} \sum_{i = 1}^N \frac{\partial L_T}{\partial w_0}$. Given this, it's easiest to start with finding the partial derivatives of just the loss function $L_T$ with respect to $w_0$ and $w_1$, and summing these partial derivatives in a `for`-loop.
    - When computing the partial derivatives of $L_T$ with respect to the two parameters, you may find it helpful to perform a substitution, e.g. $e_i = y_i - (w_0 + w_1 x_i)$, and then use the chain rule from calculus, i.e. $\frac{\partial L_T}{\partial w_0} = \frac{\partial L_T}{\partial e_i} \cdot \frac{\partial e_i}{\partial w_0}$.
    - $\frac{\partial R}{\partial w_0}$ and $\frac{\partial R}{\partial w_1}$ will look very similar, but with one small difference.

In [None]:
def tukey_gradient(w, xs, ys):
    ...

# Feel free to change these inputs to make sure your functions work correctly.
tukey_gradient(np.array([100, -5]), xs=df['departure_hour'], ys=df['minutes'])

In [None]:
grader.check("q03_01")

### Question 3.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `run_gradient_descent_tukey`, which takes in:
- `w_initial`, an **array of length 2**, corresponding to an initial guess, $\vec w^{(0)}$, of the minimizer.
- `alpha`, a positive number corresponding to a step size/learning rate.
- `xs` and `ys`, two Series/1D arrays with the same length, corresponding to sequences of $x$-values (like `df['departure_hour']`) and $y$-values (like `df['minutes']`), respectively.
- `tol`, a float representing the **convergence criteria**, the default value of which will be set to `0.0001` (i.e. $10^{-3}$).
- `verbose`, a Boolean flag.

`run_gradient_descent_tukey` should run as many iterations of gradient descent as necessary, terminating once $\lVert \nabla R_T(\vec w^{(t)}) \rVert_2 \leq \text{tol}$, i.e. once the $L_2$ norm of `tukey_gradient(w, xs, ys)` (where `w` is the current guess of $\vec w^*$) is less than `tol`.
`run_gradient_descent_tukey` should return a 2D array containing all vectors $\vec w^{(t)}$ that were visited by gradient descent. In other words, it should return a 2D array of shape `(num_iterations, 2)`, where row 0 is $\vec w^{(0)}$ (the initial guess), row 1 is $\vec w^{(1)}$, row 2 is $\vec w^{(2)}$, and so on, until row -1 is our final guess of $\vec w^*$.

If `verbose=True`, then on iterations 0, 1000, 2000, and so on, use the `display` function to show the iteration number $t$, the value of $\vec w^{(t)}$, the value of  $R_T(\vec w^{(t)})$, and the value of $\lVert \nabla R_T(\vec w^{(t)}) \rVert_2$ (i.e. the norm of the gradient vector) at the current iteration. You can compute $R_T(\vec w^{(t)})$ using the `tukey` function defined at the top of Question 3, or using `util.empirical_risk`. We won't test your code with `verbose=True`, so you have some flexibility in how to implement it, but you'll need to complete this step in order for the remainder of the question to make sense.

Finally, if more than 50,000 iterations have been completed (including the first and current guesses), terminate the algorithm and return an array of shape `(50000, 2)` containing the 50,000 vectors $\vec w^{(0)}, \vec w^{(1)}, ... \vec w^{(49,999)}$ that were visited by the algorithm.

Example behavior is given below.

```python
>>> path = run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                                      alpha=10, xs=df['departure_hour'], ys=df['minutes'])

# Our guess of w^*, the optimal parameter vector.
>>> path[-1]
array([121.5414978 ,  -5.85456359])

# The number of steps it took to reach the above guess, including the first and last guesses.
>>> len(path)
7616
```

In [None]:
def run_gradient_descent_tukey(w_initial, alpha, xs, ys, tol=0.0001, verbose=False):
    ...

# Feel free to change these inputs to make sure your functions work correctly.
path = run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                           alpha=10, 
                           xs=df['departure_hour'], 
                           ys=df['minutes'], 
                           verbose=True)
path[-1]

In [None]:
grader.check("q03_02")

Nice work! Let's experiment with what you've done. The expression below will call your implementation of `run_gradient_descent_tukey`, and draw out the path that gradient descent took to minimize $R_T(\vec w)$ on the loss surface of $R_T(\vec w)$ itself, using an initial guess of $\vec w^{(0)} = \begin{bmatrix} 100 \\ 0 \end{bmatrix}$ and $\alpha = 10$ (as in the example output provided in Question 3.2). If you hover over a point in gold, you'll see its iteration number, $t$.

In [None]:
path = run_gradient_descent_tukey(w_initial=np.array([100, 0]), alpha=10, xs=df['departure_hour'], ys=df['minutes'])
util.draw_loss_surface_tukey_commute(xs=df['departure_hour'], ys=df['minutes'], path=path)

You should notice that within a few steps, gradient descent gets down to the valley, but it takes thousands more iterations to inch sufficiently close to the true minimum. If you called `run_gradient_descent_tukey` with `verbose=True` above, you should have seen that the value of $R_T(\vec w^{(t)})$ decreased very slowly every 1000 iterations, and the norm of the gradient vector very, very slowly approached 0. If we settled for a greater tolerance – say, $0.0005$ instead of $0.0001$, we'd have converged quicker:

In [None]:
run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                           alpha=10, 
                           xs=df['departure_hour'], 
                           ys=df['minutes'], 
                           tol=0.0005,
                           verbose=True)

But, the resulting $\vec w^*$ of $\begin{bmatrix} 105.234 \\ -3.978 \end{bmatrix}$ is quite far from what we got when using a tolerance of $0.0001$, which gave us $\begin{bmatrix} 121.541 \\ -5.855 \end{bmatrix}$. Why do you think this is happening?

Instead of weakening our tolerance, perhaps we can try a different learning rate:

In [None]:
run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                           alpha=15, 
                           xs=df['departure_hour'], 
                           ys=df['minutes'], 
                           tol=0.0001,
                           verbose=True)

If you've implemented everything correctly, you should see that the above call to gradient descent maxes out at 50,000 iterations. Something strange must be going on, since the guesses of $\vec w^{(t)}$ seem to be relatively constant every 1000 iterations, as do the values of $R_T(\vec w^{(t)})$ and $\lVert \nabla R_T(\vec w^{(t)}) \rVert_2$. What's going on? Make an educated guess, then run the cell below. (It should take ~30 seconds to render.)

In [None]:
path = run_gradient_descent_tukey(w_initial=np.array([100, 0]), alpha=15, xs=df['departure_hour'], ys=df['minutes'])
util.draw_loss_surface_tukey_commute(xs=df['departure_hour'], ys=df['minutes'], path=path)

It seems that a learning rate of $\alpha = 15$ is too large, and results in oscillatory behavior between iterations $t$ and $t+1$. Since we're only printing every 1000 iterations when `verbose=True`, we only get to see one of the two oscillatory states (e.g., in the sequence $a$, $b$, $a$, $b$, $a$, $b$, ..., if we sample just the even positions, we'd think the entire sequence was made up of $b$'s).

Maybe an even larger learning rate will work better – let's try $\alpha = 50$.

In [None]:
run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                           alpha=50, 
                           xs=df['departure_hour'], 
                           ys=df['minutes'], 
                           tol=0.0001,
                           verbose=True)

That was quick... what happened?

In [None]:
path = run_gradient_descent_tukey(w_initial=np.array([100, 0]), alpha=50, xs=df['departure_hour'], ys=df['minutes'])
util.draw_loss_surface_tukey_commute(xs=df['departure_hour'], ys=df['minutes'], path=path)

Why did gradient descent get stuck up top? Something you may have noticed is that $R_T(\vec w)$ is **not convex**. This means that there can be regions in which the gradient of $R_T(\vec w)$ is 0 that don't correspond to a global minimum (which isn't possible for convex functions). The step size of $\alpha = 50$ is so large that, after a few iterations, gradient descent lands us in the flat region, and once a $\vec w^{(t)}$ lands there, $\nabla R_T(\vec w^{(t)}) = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$, terminating the algorithm instantly.

So, to summarize:
- Due to the nature of the loss surface, gradient descent can get _near_ the minimum fairly quickly, but may take many iterations to actually converge.
- If we choose the step size to be too large, gradient descent may oscillate infinitely, "bouncing" over the minimum.
- Since the loss surface is not convex, gradient descent can get "trapped" in flat regions and terminate mistakenly.

In future courses – and in the real world – you'll look at solutions to all of these problems. But for now, you hopefully have a better understanding of how gradient descent works, and how it can go wrong.

Finally, let's actually take a look 👀 at the line that minimizes mean Tukey loss on the commute times dataset.

In [None]:
w0, w1 = run_gradient_descent_tukey(w_initial=np.array([100, 0]), 
                                    alpha=10, xs=df['departure_hour'], ys=df['minutes'],
                                    verbose=True)[-1]
w0, w1

In [None]:
fig.add_trace(
    go.Scatter(x=[5.5, 11.5], y=[w0 + w1 * 5.5, w0 + w1 * 11.5], mode='lines', line=dict(color='gold'), name='Best Line (Mean Tukey Loss)')
)

Knowing what you know about Tukey loss, how does the <b><span style="color:red">line that minimizes mean squared error</span></b> compare to the <b><span style="color: gold">line that minimizes mean Tukey loss</span></b>? You don't need to write the answer to this question – or any of the leading questions in this notebook – anywhere, but you _should_ think about them.

## Finish Line 🏁

Congratulations! You're ready to submit Homework 10. Remember, Homework 10 is fully autograded, so you only need to submit it once.

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under **"Homework 10"**. Make sure your notebook is still named `hw10.ipynb` and the name has not been changed.
5. Stick around while the Gradescope autograder grades your work.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.