> <p><small><small>Copyright 2022 DeepMind Technologies Limited.</p>
> <p><small><small> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at </p>
> <p><small><small> <a href="https://www.apache.org/licenses/LICENSE-2.0">https://www.apache.org/licenses/LICENSE-2.0</a> </p>
> <p><small><small> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. </p>

# **Introduction to Supervised Learning 1 - Regression**
<a href="https://colab.research.google.com/github/deepmind/educational/blob/master/colabs/summer_schools/intro_to_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to a practical session that will teach you a few basic concepts used across modern machine learning.
The practical assumes prior knowledge of NumPy, as well as basic linear algebra.

**Learning objectives**

This practical is designed to help you see the wood (some basic concepts in supervised learning) from the trees (the ever growing body of approaches).
In this practical you will predict a real-valued output $y$ from a scalar input $x$.
Don't worry, you will encounter multivariate inputs soon in the next practical (Introduction to Supervised Learning 2 - Classification).

In this practical, you will:
- learn what a model for data is and what model parameters are;
- experience modeling assumptions;
- experience basic optimization of a loss function to fit data;
- get introduced to automatic differentiation;
- extend linear features to nonlinear features in a linear model;
- learn about underfitting and overfitting and basic regularization.

**Disclaimer**

This code is intended for educational purposes, and in the name of readability for a non-technical audience does not always follow best practices for software engineering.


# **Linear Regression**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

import jax
import jax.numpy as jnp

%matplotlib inline
%config InlineBackend.figure_format="retina"

### **Let's plot basic data!**

In [None]:
x_data_list = [1, 2, 3, 4, 5]
y_data_list = [3, 2, 3, 1, 0]

def plot_basic_data(parameters_list=None, title="Observed data"):
  xlim = [-1, 7]
  fig, ax = plt.subplots()
  
  if parameters_list is not None:
    x_pred = np.linspace(xlim[0], xlim[1], 100)
    for parameters in parameters_list:
      y_pred = parameters[0] + parameters[1] * x_pred
      ax.plot(x_pred, y_pred, ':', color=[1, 0.7, 0.6])

    parameters = parameters_list[-1]
    y_pred = parameters[0] + parameters[1] * x_pred
    ax.plot(x_pred, y_pred, "-", color=[1, 0, 0], lw=2)

  ax.plot(x_data_list, y_data_list, "ob")
  ax.set(xlabel="Input x", ylabel="Output y",
         title=title,
         xlim=xlim, ylim=[-2, 5])
  ax.grid()

plot_basic_data()

In [None]:
parameters_list = []   # A list of parameters for the next exercise.

## **Tuning parameters by hand...**

Above you can see some data points where we have outputs for each input.
We want to predict output $y$ given input values for $x$.
We start by modelling the data with a simple linear function $f(x) = \color{red}{w} x + \color{red}{b}$.
There are two numbers, $\color{red}{b}$ and $\color{red}{w}$, which we call the model's parameters.
If we change them, then $f(x)$ will change accordingly!

Your next challenge is to find a "good" setting of parameters $\color{red}{b}$ and $\color{red}{w}$ by hand.
"But I came here to learn about deep learning!" we hear you say. True.
But we are going to start small, and after this manual exercise, we'll ask you questions about assumptions that you made that you didn't even know you made! Ready? Ready!

**Exercise: Finding two "good" parameters**
1. Move the two sliders below to set $\color{red}{b}$ and $\color{red}{w}$, and press "Run cell" on the code cell below. 
2. Is your $f(x)$ close to the blue data points? Can you find a better fit?
3. Adjust the two sliders a bit more, and press "Run cell" again on the cell...
4. If your $f(x)$ now closer to all the blue data points? Repeat and repeat step 3 until you get a manual fit that you are happy with.

In [None]:
b = 0.74 #@param {type:"slider", min:-1, max:8, step:0.01}
w = 0.88 #@param {type:"slider", min:-3, max:3, step:0.01}
print("Plotting line", w, "* x +", b)
parameters = [b, w]
parameters_list.append(parameters)
plot_basic_data(parameters_list,
                title="Observed data and my first predictions")

## **Weights and biases**
What happened to the function when you changed $\color{red}{b}$?
And when you changed $\color{red}{w}$? It's the intercept and slope, we hear you say!
We picked notation "$\color{red}{w}$" and "$\color{red}{b}$" for a reason, as our models will become more complicated than linear functions!
- "$\color{red}{w}$" stands for "weight", which is multiplied with $x$ (or in more complicated models, other functions of $x$).
- "$\color{red}{b}$" stands for "bias". It shifts the line up or down in the absence of any data.

##**You're a born optimizer!**

We will now plot your sequence of choices for $\color{red}{b}$ and $\color{red}{w}$ on a $(\color{red}{b}, \color{red}{w})$ axis. Press "run" on the cell below:

In [None]:
fig, ax = plt.subplots()
opt = {"head_width": 0.2, "head_length": 0.2,
       "length_includes_head": True, "color": "r"}
if parameters_list is not None:
  b_old = parameters_list[0][0]
  w_old = parameters_list[0][1]
  for i in range(1, len(parameters_list)):
    b_next = parameters_list[i][0]
    w_next = parameters_list[i][1]
    ax.arrow(b_old, w_old, b_next - b_old, w_next - w_old, **opt)
    b_old, w_old = b_next, w_next

  ax.scatter(b_old, w_old, s=200, marker="o", color="y")
  bs = [parameters[0] for parameters in parameters_list]
  ws =  [parameters[1] for parameters in parameters_list]
  ax.scatter(bs, ws, s=40, marker='o', color='k')

ax.set(xlabel="Bias b", ylabel="Weight w",
       title="My sequence of b\'s and w\'s",
       xlim=[-1, 8], ylim=[-3, 3])
plt.show()

## **Is your neighbour a born optimizer?**

Look at the plot of your sequence of choices for $\color{red}{b}$ and $\color{red}{w}$.
Do you notice how they changed? If you're doing this practical in a group, pause here and compare your solution with that of your neighbours:

- Did they change $\color{red}{b}$ and $\color{red}{w}$ with big steps or small steps each time?
- Did they start with small steps, and then progressed to bigger steps? Or the other way round? What about you?
- Did the magnitude of your previous steps influence your next choice? Why? Or why not?
- Did you all converge to roughly the same endpoint for $\color{red}{b}$ and $\color{red}{w}$, or did your sequences end up in different places?

## **Did you make any assumptions?**

Every model makes assumptions. One assumtion that we made is that our model is a *linear* model, i.e. that our best guess is for $y$ is with $f(x) = \color{red}{w} x + \color{red}{b}$.
Turn to your neighbours and tell them how you would approach guessing $\color{red}{b}$ and $\color{red}{w}$ if:
- someone paid you a million dollars if you predicted $y$ correctly at $x = 4.9$, but paid you nothing if you predicted $y$ correctly at $x = 1.1$;
- if one of the entries for $y$ actually contained a typo?

Would your solution change? Why? Or why not?

## **A loss function**

You tweaked two numbers (or parameters), $\color{red}{b}$ and $\color{red}{w}$, to find a good fit.
We want to define a function called a "loss function" that is at its smallest when *you* think the data fit is great, and that is big when the data fit is not so great.

Below, we've created a loss function $\mathrm{loss}(\color{red}{b}, \color{red}{w})$.
Is it small at the best $\color{red}{b}$ and $\color{red}{w}$ that you found manually?
Your sequence of choices for $\color{red}{b}$ and $\color{red}{w}$ are also plotted on the $(\color{red}{b}, \color{red}{w})$ axis.
Does your sequence progressively move toward a parameter setting for which the loss function is small?
We plotted two views of the loss function, so that it is easier to see the minimum *and* the function.

In [None]:
def l1_loss(b, w):
  loss = 0 * b
  for x, y in zip(x_data_list, y_data_list):
    f = w * x + b
    loss += np.abs(f - y)
  return loss / len(x_data_list)

bs, ws = np.linspace(-1, 8, num=25), np.linspace(-3, 3, num=25)
b_grid, w_grid = np.meshgrid(bs, ws)
loss_grid = l1_loss(b_grid, w_grid)

def plot_loss(parameters_list, title, show_stops=False):
  fig, ax = plt.subplots(1, 2, figsize=(18, 8),
                         subplot_kw={"projection": "3d"})
  ax[0].view_init(10, -30)
  ax[1].view_init(30, -30)

  if parameters_list is not None:
    b_old = parameters_list[0][0]
    w_old = parameters_list[0][1]
    loss_old = l1_loss(b_old, w_old)
    ls = [loss_old]

    for i in range(1, len(parameters_list)):
      b_next = parameters_list[i][0]
      w_next = parameters_list[i][1]
      loss_next = l1_loss(b_next, w_next)
      ls.append(loss_next)

      ax[0].plot([b_old, b_next], [w_old, w_next], [loss_old, loss_next],
                color="red", alpha=0.8, lw=2)
      ax[1].plot([b_old, b_next], [w_old, w_next], [loss_old, loss_next],
                color="red", alpha=0.8, lw=2)
      b_old, w_old, loss_old = b_next, w_next, loss_next

    if show_stops:
      ax[0].scatter(b_old, w_old, loss_old, s=100, marker="o", color="y")
      ax[1].scatter(b_old, w_old, loss_old, s=100, marker="o", color="y")
      bs = [parameters[0] for parameters in parameters_list]
      ws = [parameters[1] for parameters in parameters_list]
      ax[0].scatter(bs, ws, ls, s=40, marker="o", color="k")
      ax[1].scatter(bs, ws, ls, s=40, marker="o", color="k")
    else:
      ax[0].scatter(b_old, w_old, loss_old, s=40, marker='o', color='k')
      ax[1].scatter(b_old, w_old, loss_old, s=40, marker='o', color='k')

  ax[0].plot_surface(b_grid, w_grid, loss_grid, cmap=cm.coolwarm,
                     linewidth=0, alpha=0.4, antialiased=False)
  ax[1].plot_surface(b_grid, w_grid, loss_grid, cmap=cm.coolwarm,
                     linewidth=0, alpha=0.4, antialiased=False)
  ax[0].set(xlabel="Bias b", ylabel="Weight w", zlabel="Loss", title=title)
  ax[1].set(xlabel="Bias b", ylabel="Weight w", zlabel="Loss", title=title)
  plt.show()

plot_loss(parameters_list,
          "An example loss function and my sequence of b\'s and w\'s",
          show_stops=True)

## **Let's contruct a loss function**

When you manually adjusted your weights $\color{red}{b}$ and $\color{red}{w}$, you probably looked at how close each $f(x)$ was to the $y$ that it tries to predict.
Maybe you glanced at the distance from the red line to each of the blue dots, and imagined the average of the distances (marked in purple) below.
If the average was small, your fit was good!

<img src='https://storage.googleapis.com/dm-educational/assets/supervised-learning/loss-function-intro.svg' width='400'>

To formalize this notion, let $x_1 = 1$, $x_2 = 2$, $x_3 = 3$... and let $y_1 = 3$, $y_2 = 2$, $y_3 = 3$... The blue dots are therefore a sequence of input-output $(x, y)$ pairs.
Assuming that the order of the data points doesn't matter, and $n = 1, ..., N$ (where $N=5$ in our case) indexes the data, our loss will look something like this:

$\mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N \mathrm{error}(\color{red}{b}, \color{red}{w} ; x_n, y_n)$.

We take all 5 purple bars (or errors), add them together, and then divide by 5 to get the average amount of purple ink used.
So far, we've said nothing about $\mathrm{error}(\color{red}{b}, \color{red}{w} ; x_n, y_n)$, except that:
- all the error terms depend on $\color{red}{b}$ and $\color{red}{w}$;
- each term only considers one data point $(x_n, y_n)$;
- it doesn't matter in which order we sum the purple bars.

What would you like the "error" to be? Choices, choices! We can just let it be the average of the purple distances,

$\mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N \Big|y_n - \underbrace{(\color{red}{w} x_n + \color{red}{b})}_{f(x_n)} \Big|$.

We can also let the error be the average of the *squared* distances, also called the "mean squared error" (MSE).
We'll write the mean squared error in pencil in light gray for now, as we'll return to that choice later.

$\color{lightgray}{\mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N \Big(y_n - \underbrace{(\color{red}{w} x_n + \color{red}{b})}_{f(x_n)} \Big)^2}$.

**Exercise**

Before we proceed, take a pencil and paper and draw a figure to illustrate the two error terms for one data point.
Let the horizontal axis be $f(x)$ and let the vertical axis be $\mathrm{error}(\color{red}{b}, \color{red}{w} ; x, y)$ for one data point.
- For which value of $f(x)$ is each error minimized?
- Explain to your neighbour how the penalties that the error terms give differ when $f(x)$ is close to $y$, and when $f(x)$ is far from $y$.
- Can you design your own error term? Explain to your neighbour or your tutor the motivation behind your choice.

## **Gradients**

When you manually tweaked $\color{red}{b}$ and $\color{red}{w}$ to adjust the red line, you tried to adjust it so that the fit would be better.
If you were an experienced manual parameter adjuster, you might even have adjusted the $\color{red}{b}$ and $\color{red}{w}$ so that the fit gets *maximally better* with each adjustment.
That notion is the **direction of maximum decrease** of the loss function, or minus the gradient.

Let's take the sum of absolute values, $\mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N |\color{red}{w} x_n + \color{red}{b} - y_n|$.
We want to know in which direction to adjust $\color{red}{b}$ **and** in which direction to adjust $\color{red}{w}$.
We require two gradients,

$\nabla_{\color{red}{b}} \mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N
\begin{cases}
    -1 & \text{if } \color{red}{w} x_n + \color{red}{b} - y_n < 0 \\
    1 & \text{if } \color{red}{w} x_n + \color{red}{b} - y_n \ge 0
\end{cases}$ ,

$\nabla_{\color{red}{w}} \mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N
\begin{cases}
    -x_n & \text{if } \color{red}{w} x_n + \color{red}{b} - y_n < 0 \\
    x_n & \text{if } \color{red}{w} x_n + \color{red}{b} - y_n \ge 0
\end{cases}$ .

*(For the mathematical purists, yes: the gradient of the absolute value is technically not defined at zero, but we'll ignore that technicality for the purposes of finding a minimum for now.)*

In the code snippet below, we compute the two gradients using a for-loop over examples.
This is just to illustrate how the gradient is computed. Very soon, we'll throw away the for-loop over data points and do it "all at once" in vectorized operations!


In [None]:
def manual_grad(b, w):
  grad_b = 0
  grad_w = 0
  for x, y in zip(x_data_list, y_data_list):
    f = w * x + b
    grad_b += np.sign(f - y)
    grad_w += np.sign(f - y) * x
  grad_b /= len(x_data_list)
  grad_w /= len(x_data_list)
  return grad_b, grad_w

## **Gradient descent: No more tuning parameters by hand!**

To get to the minimum of the loss function -- which hopefully gives us the data fit we want -- we want to repeatedly use the gradients to tweak the parameters $\color{red}{b}$ and $\color{red}{w}$ in the right direction.
But how? That opens a whole direction of research! The simplest idea is to start with an initial guess $\color{red}{b}$, and then repeatedly update

$\color{red}{b} \leftarrow \color{red}{b} - \color{blue}{\eta} \nabla_{\color{red}{b}} \mathrm{loss}(\color{red}{b}, \color{red}{w})$ ,

while also doing the same for $\color{red}{w}$.
The parameter $\color{blue}{\eta}$ just tells us how much we are going to scale the gradient before we use it to update our parameters:
are we going to try to walk downhill with big steps or small steps?
It is called a **learning rate**.

**Exercise**
1. Run the code snippet below, and note the $(\color{red}{b}, \color{red}{w})$ trajectory as we use the gradient to (try to) get to the minimum.
2. Adjust the starting values for $\color{red}{b}$ or $\color{red}{w}$ or the value of $\color{blue}{\eta}$ and see how the resulting trajectory to the minimum changes.
3. Can you find a setting for $\color{blue}{\eta}$ where things start spiralling out of control and the loss gets bigger and bigger (and not smaller)?
4. Can you find a setting for $\color{blue}{\eta}$ so that we're still far away from the minimum after `200` parameter update steps?

In [None]:
b = -1                # Change me! Try 2, 7
w = -2                # Change me! Try -1, 2
learning_rate = 0.1   # Change me! Try 0.01, 0.5, ...

parameters_step_list = []

for _ in range(200):
  parameters_step_list.append([b, w]) 
  grad_b, grad_w = manual_grad(b, w)
  b = b - learning_rate * grad_b
  w = w - learning_rate * grad_w

plot_loss(parameters_step_list,
          "A loss function, and minimizing it with gradient descent")

## **Autodiff: No more manual gradients!**

You don't need to do the above calculations yourself thanks to automatic differentiation, or "autodiff"!
While you can probably derive and code the gradients of the loss function for our linear model without making a mistake somewhere, getting the gradients right for more complex models can be much more work. Much, much more work!
Before we introduce autodiff, you should spare a thought for the generations of machine learning researchers who did amazing work without it ;-)

Here is autodiff in action in the lines of code below. We contruct a function called `auto_grad`,

`auto_grad = jax.grad(loss_function, argnums=(0, 1))`

and call it in the same way as we called `manual_grad`.

In [None]:
x = np.array(x_data_list)
y = np.array(y_data_list)

def loss_function(b, w):
  f = w * x + b
  errors = jnp.abs(y - f)
  # Instead of summing over individual data points in a for-loop, and then
  # dividing to get the average, we do it in one go. No more for-loops!
  return jnp.mean(errors)

# This is it! One line of code.
auto_grad = jax.grad(loss_function, argnums=(0, 1))

# Let's see if it works. Does auto_grad match our manual version?
b, w = 2.0, 3.0

grad_b, grad_w = auto_grad(b, w)
print("Autograd         grad_b:", grad_b, "  grad_w", grad_w)

grad_b, grad_w = manual_grad(b, w)
print("Manual gradients grad_b:", grad_b, "  grad_w", grad_w)

## **JAX and other tools**

To automatically get the gradients of our loss function `loss_function(b, w)`, we called `jax.grad(loss_function)`.
The result was a function `auto_grad(b, w)`, which we could call for any setting of $\color{red}{b}$ and $\color{red}{w}$ to get $\nabla_{\color{red}{b}} \mathrm{loss}(\color{red}{b}, \color{red}{w})$ and $\nabla_{\color{red}{w}} \mathrm{loss}(\color{red}{b}, \color{red}{w})$.
This magic happened thanks to JAX.

[JAX](https://jax.readthedocs.io/en/latest/index.html) is NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research.
It does a lot more than automatic differentiation!
You'll notice that we didn't use a for-loop like in our illustrative `manual_grad(b, w)` function. We
- put all the data points $(x_1, y_1)$ to $(x_N, y_N)$ in vectors `x` and `y`;
- computed $\color{red}{w}$ * `x` + $\color{red}{b}$ in one line to give a vector `f`, after which `y - f` gave another vector, and `errors = jnp.abs(y - f)` gave the error for each data point in a **vector**;
- computed `jnp.mean(errors)`, the average of the errors for each data point.

You'll notice `jnp`, short for `jax.numpy`.
It is necessary to compile and run your NumPy code on accelerators like GPUs and TPUs!

We now minimize the loss function again with gradient descent, this time using the handy automatic gradients that [JAX](https://jax.readthedocs.io/en/latest/index.html) gave us.

In [None]:
b, w = -1.0, -2.0
learning_rate = 0.1

parameters_step_list = []

for _ in range(200):
  parameters_step_list.append([b, w])
  grad_b, grad_w = auto_grad(b, w)      # Now with JAX automatic differentiation
  b = b - learning_rate * grad_b
  w = w - learning_rate * grad_w

plot_loss(parameters_step_list, "Gradient descent with automatic gradients")

## [Optional] **Analytical solution to linear regression with quadratic loss**
Before we move to models that are nonlinear in $x$ we'll show the analytical solution to linear regression under a *mean squared error* loss function.
As a reminder, the MSE loss is the sum of quadratic functions

$\mathrm{loss}(\color{red}{b}, \color{red}{w}) = \frac{1}{N} \sum_{n=1}^N \big(y_n - (\color{red}{w} x_n + \color{red}{b}) \big)^2$.

It is indeed a special case! If we replaced $(y_n - (\color{red}{w} x_n + \color{red}{b}) )^2$ with $|y_n - (\color{red}{w} x_n + \color{red}{b})|$ -- like we did earlier -- we wouldn't be able to derive a closed-form analytical solution any more.

Like we've done before, we will put all data points $(x_n, y_n)$ in vectors, and instead of treating $\color{red}{b}$ and $\color{red}{w}$ separately, we'll treat them together as a vector $(\color{red}{w}, \color{red}{b})$.
We convert to vector notation by defining the following terms. (The advantage of using vector notation is that the same derivation applies to any number of parameters, as we will see later!)

$\color{red}{\mathbf{w}} =
\begin{pmatrix}
    \color{red}{w} \\
    \color{red}{b}
\end{pmatrix} \ , \qquad
\mathbf{X} =
\begin{pmatrix}
    x_1 & 1 \\
    x_2 & 1 \\
    \vdots & \vdots\\
    x_N & 1
\end{pmatrix} \ , \qquad
\mathbf{y} =
\begin{pmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_N
\end{pmatrix}$

**Exercise**

Using this notation, convince yourself that the *mean squared error* loss is equivalent to

$\mathrm{loss}(\color{red}{\mathbf{w}}) = \frac{1}{N} (\mathbf{y} - \mathbf{X}\color{red}{\mathbf{w}})^\mathsf{T}(\mathbf{y} - \mathbf{X}\color{red}{\mathbf{w}})$.

See [this blog post](https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/) for more details.

**Exercise**

Convince yourself that the MSE loss is minimized at

$\color{red}{\mathbf{w}^*} = (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y}$

For illustrative purposes we'll consider the *mean squared error* and analytical solution below, as knowing the minimum in the convex quadratic case (as we have here) will help us develop intuition around basic gradient descent as well.

# **Non-linear regression**

So far we've looked at data that could be fitted fairly accurately with a line. Despite its simplicity, linear regression tends to be very useful in practice, especially as as a starting point in data analysis! However, there are cases where a linear fit is unsatisfying, for example, consider the following dataset in blue with the best linear fit in red:

In [None]:
def generate_wave_like_dataset(min_x=-1, max_x=1, n=100):
  xs = np.linspace(min_x, max_x, n)
  ys = np.sin(5 * xs) + np.random.normal(size=len(xs), scale=0.2)
  return xs, ys

def regression_analytical_solution(X, y):
  return ((np.linalg.inv(X.T.dot(X))).dot(X.T)).dot(y)

def gradient_descent(X, y, learning_rate=1e-2, num_steps=1000):
  report_every = num_steps // 10

  def loss(current_w, X, y):
    y_hat = jnp.dot(X, current_w)
    loss = jnp.mean((y_hat - y) ** 2)
    return loss, y_hat

  loss_and_grad = jax.value_and_grad(loss, has_aux=True)
  # Initialize the parameters
  w = np.random.normal(size=(X.shape[1]))

  # Run a a few steps of gradient descent
  for i in range(num_steps):
    (loss, y_hat), grad = loss_and_grad(w, X, ys)

    if i % report_every == 0:
      print(f"Step {i}: w: {w}, Loss: {loss}, Grad: {grad}")

    w = w - learning_rate * grad

  return w

def plot_data(y_hat, xs, ys, title):
  plt.figure()
  plt.scatter(xs, ys, label="Data")
  plt.plot(xs, y_hat, 'r', label=title)

  plt.title(title)
  plt.xlabel("Input x")
  plt.ylabel("Output y")
  plt.legend();

In [None]:
xs, ys = generate_wave_like_dataset()
X = np.vstack([xs, np.ones(len(xs))]).T
w = regression_analytical_solution(X, ys)
y_hat = X.dot(w)

plot_data(y_hat, xs, ys, "Linear regression (analytic minimum)")

The linear regression is clearly missing some important wave-like structure in this data. This is also known as **under-fitting**.

## **From $x$ to a feature vector**

Imagine you didn't know that the data generating process was a noisy sine-wave, and instead were given this data by someone else.
How would you proceed? One option is to consider increasing the complexity of the linear model by trying to fit a higher order polynomial, for example a 4th degree polynomial:
$\hat{y} = \color{red}{w_4}x^4 + \color{red}{w_3}x^3 + \color{red}{w_2}x^2 + \color{red}{w_1}x + \color{red}{w_0}$

Luckily we can still solve for the least squares parameters $\color{red}{w_4}, \color{red}{w_3}, \color{red}{w_2}, \color{red}{w_1}, \color{red}{w_0}$ using the same techniques we used for fitting a line. 

Given the dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$, we construct a *feature* matrix $\mathbf{\Phi}$ by expending original features, being careful to include terms corresponding to each power of $x$, as follows:

$\mathbf{\Phi} =
\begin{pmatrix}
x_1^4 & x_1^3 & x_1^2 & x_1 & 1 \\
x_2^4 & x_2^3 & x_2^2 & x_2 & 1 \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
x_n^4 & x_n^3 & x_n^2 & x_n & 1
\end{pmatrix}
$

And just like before, our $\mathbf{y}$ vector is $(y_1, y_2, ..., y_n)^\mathsf{T}$

Next we fit a 4th degree polynomial to our data and find that the fit is visually a lot better and captures the wave-like pattern of the data! 


In [None]:
def create_data_matrix(xs):
  return np.vstack([np.power(xs, 4),
                    np.power(xs, 3),
                    np.power(xs, 2),
                    xs,
                    np.ones(len(xs))]).T

phi = create_data_matrix(xs)

## **Gradient descent with Jax**

We fit our fourth degree polynomial using automatic differentiation and gradient descent!

In [None]:
w = gradient_descent(phi, ys, num_steps=5000)
print("w:", w)
y_hat = phi.dot(w)

plot_data(y_hat, xs, ys, 'Polynomial regression (gradient descent steps)')

## [Optional] **An analytic solution**

Our fourth degree polynomial is still linear in the weights $\color{red}{\mathbf{w}}$, and we can fit our polynomial with
$\color{red}{\mathbf{w}^*} = (\mathbf{\Phi}^\mathsf{T} \mathbf{\Phi})^{-1} \mathbf{\Phi}^\mathsf{T} \mathbf{y}$ as before.

In [None]:
w = regression_analytical_solution(phi, ys)
print("w:", w)
y_hat = phi.dot(w)

plot_data(y_hat, xs, ys, "Polynomial regression (analytic solution)")

## **Can you make the fit better?**

Are you pleased with the fit? We certainly aren't! Revisit the code earlier in the Colab, and see if you could adapt the features matrix $\mathbf{\Phi}$ to include higher powers than $x^4$. It was coded as:

```
phi = np.vstack([np.power(xs, 4), np.power(xs, 3),
                 np.power(xs, 2), xs, np.ones(len(xs))]).T
```

If you rerun the cells, do higher powers help? What about:

```
phi = np.vstack([np.power(xs, 5), np.power(xs, 4), np.power(xs, 3),
                 np.power(xs, 2), xs, np.ones(len(xs))]).T
```

What about other features? What if you included periodic features, and used features like $[1, x, x^2, \sin(x)]$?

```
phi = np.vstack([np.sin(xs),
                 np.power(xs, 2),
                 xs,
                 np.ones(len(xs))]).T
```

In our examples here, we could analytically determine the global minimum $\color{red}{\mathbf{w}^*}$. In some of the examples, basic gradient descent took long to get close to $\color{red}{\mathbf{w}^*}$. Why do you think that is?

## **What happens if we extend our predictions out a bit?**

*We will now return to the fourth-order polynomial.*

In the plot below we fill in some extra data points from the true function (in orange) for comparison, but bear in mind that these were not used to fit the regression model. We are **extrapolating** the model into a previously unseen region!
We see that the while the fit looks good in the blue region that the model was fitted on, the fit seems to diverge significantly in the orange region.
The model is able to **interpolate** well (fill in gaps in the region it was fitted), but it **extrapolates** (outside the fitted region) poorly.
This is a common concern with models in general, unless you can be sure that you have the correct *inductive biases* (assumptions about the data generating process) built into the model, you should be cautious about extrapolating from it.

To simplify the explanation, we will use the analytic minimum of the loss function below.

In [None]:
# Recover the analytic solution.
phi = create_data_matrix(xs)
w = regression_analytical_solution(phi, ys)

# Extend the x's and y's.
more_xs, more_ys = generate_wave_like_dataset(min_x=-1.2, max_x=-1, n=20)
all_xs = np.concatenate([more_xs, xs])
all_ys = np.concatenate([more_ys, ys])

# Get the design matrix for the extended data, so that we could make predictions
# for it.
phi = create_data_matrix(all_xs)

# Note that we don't recompute w, we use the previously computed values that
# only saw x values in the range [0, 10]
y_hat = phi.dot(w)

plt.scatter(xs, ys, label="Data")
plt.scatter(more_xs, more_ys, label="Unseen Data")
plt.plot(all_xs, y_hat, 'r', label='Polynomial Regression')

plt.title("A wave-like dataset with the best-fit line")
plt.xlabel("Input x")
plt.ylabel("Output y")
plt.legend()
plt.show()

# **Generalisation**

The previous example you've seen how the red line -- our fourth order polynomial -- does not extrapolate well. It does not generalise well to unseen examples.
You've also seen how it interpolates the observed data with varying degrees of success.
We will now introduce the themes of under-fitting and over-fitting data.

## **Under-fitting**
We have already seen an example of **under-fitting**, where we tried to fit a linear model to a noisy sine-wave dataset.
The model did not explain the *training data* (the data that was used to fit the model) well, nor would we expect it to *generalise* to any previously unseen data from the same data generating process.
Under-fitting is usually the result of a model that is not complex enough to fit the given data.
One can detect under-fitting by observing a loss that plateus at a larger than expected value or by using other metrics such as "r-squared" from statistics, which measures "goodness of fit".
As we saw already, the easiest way to address under-fitting is by increasing the complexity of the model, for example by adding more parameters as we did in the case of the 4th degree polynomial.

## **Over-fitting**
Over-fitting, on the other hand, is the case where the model fits the training data very well but **fails to generalise** to previously unseen data from the same data generating process.
This is usually the result of the model having sufficient degrees of freedom to fit the noise in the training data.
Let's look at a new dataset, consisting of just 5 points, to illustrate this.

In [None]:
np.random.seed(2)

xs_small, ys_small = generate_wave_like_dataset(n=5)

plt.figure()
plt.scatter(xs_small, ys_small)
plt.title("A small dataset generated from a line with measurement noise")
plt.xlabel("Input x")
plt.ylabel("Output y")
plt.show()

Now as we did before, we fit a 4-degree polynomial to this dataset using regression

In [None]:
phi_small = create_data_matrix(xs_small)
w_first = regression_analytical_solution(phi_small, ys_small)

xs_plot = np.linspace(-1, 1, 1000)
phi_plot = create_data_matrix(xs_plot)
y_hat = phi_plot.dot(w_first)

plt.figure()
plt.scatter(xs_small, ys_small, label="Data")
plt.plot(xs_plot, y_hat, c='r', label="Model")
plt.ylim(-4, 4)

plt.legend()
plt.title("5 point dataset and regression model fit")
plt.xlabel("x")
plt.ylabel("y");
plt.show()

print(w_first)
loss = np.mean(np.power(phi_small.dot(w_first) - ys_small, 2))
print("Loss:", loss)

As we can see, the model fits the training points very well, passing exactly through each of them. The loss is essentially 0, again indicating that the fit to the training data is extremely good. But will it be that good in practice?

Let's plot a few more points from the original function (in orange below) and see how well the model fits these points.

In [None]:
xs_more, ys_more = generate_wave_like_dataset(n=32)

plt.figure()
plt.scatter(xs_small, ys_small, label="Training Data")
plt.scatter(xs_more, ys_more, label="Unseen Data")
plt.plot(xs_plot, y_hat, c="r", label="Model")
plt.ylim(-4, 4)

plt.legend()
plt.title("5 point dataset and regression model fit")

plt.xlabel("x")
plt.ylabel("y");

phi_more = create_data_matrix(xs_more)
loss_training = np.mean(np.power(phi_small.dot(w_first) - ys_small, 2))
loss_validation = np.mean(np.power(phi_more.dot(w_first) - ys_more, 2))
print("Loss on training data:", loss_training)
print("Loss on unseen data:  ", loss_validation)

The loss on this previously unseen data is much larger than on the training data.
This is a sign of over-fitting.
The model fits the training data very well but fails to generalise to previously unseen data, *even in close vicinity to the training data*.

## **What shall we do? Pause here!**

Before progressing with this practical, take a moment to think about the problem. In machine learning, there are many practical approaches to getting a model that generalizes well. As you can guess, much theory is devoted to the problem too!

With what you've seen so far, try to explain to your neighbour

1.   every factor that you can think of, that could cause a model to generalize poorly;
2.   some ideas that you could think of to improve the model's fit to (unseen) data;
3.   any underlying assumptions that you are making about unseen data.

Don't proceed until you've had a solid discussion on the topic. If someone is tutoring this practical, they might contribute to the discussion!

We shall now look at some approaches.

## **Train, validate, test**

Before we implement solutions for overfitting, we first introduce a technique for detecting it based on the idea of the loss being low on the training data and higher on unseen data.
This technique is known as **model selection**.
We split our data into two sets, a **training set** and a **validation set**.

We will use gradient descent to fit the model parameters.
While fitting the model to the training set, we periodically evaluate it on the validation set.
While training, the loss initiallly decreases on both the training set and the validation set.
However, after some number of training steps, the loss may start increasing on the validation set while continuing to decrease on the training set.
It is an indication that our model's performance on unseen data will also get worse, even though our training curves look great.
There are more general procedures to create many subsamples of training data and validation data pairs for small datasets and this technique is called **cross-validation**.

In [None]:
def gradient_descent_with_validation(phi_train, y_train, phi_val, y_val,
                                     learning_rate=1e-1, num_steps=1000):
  def loss(current_w, phi, y):
    y_hat = jnp.dot(phi, current_w)
    loss = jnp.mean((y_hat - y) ** 2)
    return loss, y_hat

  loss_and_grad = jax.value_and_grad(loss, has_aux=True)

  # Initialize the parameters
  w =  0.2 * np.random.normal(size=(phi_train.shape[1]))

  train_losses = []
  val_losses = []

  # Run a few steps of gradient descent
  for i in range(num_steps):
    (train_loss, y_hat), grad = loss_and_grad(w, phi_train, y_train)
    (val_loss, y_hat), _ = loss_and_grad(w, phi_val, y_val)

    train_losses.append(train_loss)
    val_losses.append(val_loss)

    w = w - learning_rate * grad

  plt.figure()
  plt.plot(range(len(train_losses)), train_losses, label="Train")
  plt.plot(range(len(val_losses)), val_losses, label="Validation")
  plt.title("Train and validation loss curves")
  plt.xlabel("Step")
  plt.ylabel("Loss")
  plt.legend()

  return w

In [None]:
np.random.seed(2)

n=10
x_all, y_all = generate_wave_like_dataset(n=n)

# In practice, a smaller preportion of data is used for validation. The example
# below is for illustrative purposes only. 
training_set = np.random.permutation(n) < 0.5 * n
x_train = x_all[training_set]
y_train = y_all[training_set]
phi_train = create_data_matrix(x_train)

x_val = x_all[~training_set]
y_val = y_all[~training_set]
phi_val = create_data_matrix(x_val)

w_gd_val = gradient_descent_with_validation(phi_train, y_train, phi_val, y_val)

y_hat = phi_plot.dot(w_gd_val)

plt.figure()
plt.scatter(x_train, y_train, label="Training Data")
plt.scatter(x_val, y_val, label="Validation Data")
plt.plot(xs_plot, y_hat, c="r", label="Model")
plt.ylim(-4, 4)

plt.legend()
plt.title("Train and validation dataset with regression model fit")

plt.xlabel("x")
plt.ylabel("y");

If you look at the train and validation loss curves above, you'll notice a point where the validation error becomes bigger again, even though the training error continues to shrink.
Shall we stop our training process there, and not minimize the training loss exactly?
It is a reasonable practical suggestion, as the validation error tells us something about the model overfitting the data.

### **Early stopping**



This technique of stopping training when the validation loss is minimised can also be automated in your training script by monitoring the validation loss and stopping as soon as it starts increasing. This technique is known as **early stopping** and is a form of model **regularization**.

Let's now stop training at around 200 training steps.

In [None]:
w_early_stopping = gradient_descent_with_validation(
    phi_train, y_train, phi_val, y_val, num_steps=200)

y_hat = phi_plot.dot(w_early_stopping)

plt.figure()
plt.scatter(x_train, y_train, label="Training Data")
plt.scatter(x_val, y_val, label="Validation Data")
plt.plot(xs_plot, y_hat, c="r", label="Model")
plt.ylim(-4, 4)

plt.legend()
plt.title("Train and validation dataset with regression model fit")

plt.xlabel("x")
plt.ylabel("y");

### **Regularisation**

Regularisation refers to a set of techniques that are used to control and prevent overfitting in machine learning models.
We've already seen an example of regularisation in the form of **early stopping**.

We'll find a clue to why the model with parameter vectors we found by early stopping generalized better than the first model that overfit.
Let's compare their parameter vectors next.
The magnitudes of the individual components, as well as the L2-norm (the square root of sum of squares of each magnitude of vector) are smaller for our "early stopped" model.

In [None]:
print("First model parameters:", w_first)
print("First model parameter L2-norm:", np.linalg.norm(w_first, 2))
print()
print("Early stopping model parameters:", w_early_stopping)
print("Early stopping model parameter L2-norm:", np.linalg.norm(w_early_stopping, 2))

Notice how the magnitude (as measured by the L2 norm) of the first parameter vector is significantly larger than the magnitude of the early stopping parameter vector.
It is often the case that a model that overfits has a larger magnitude parameter vector than one that doesn't.
A well-known and principled way of controlling the magnitude of the parameter vector is to add a "regularization team" to the objective function.
This term discourages weight vectors to have large magnitudes.

In particular, we add an
**L2-regularisation term** $\|\color{red}{\mathbf{w}}\|_2^2$ -- which is scaled by a coefficient $\lambda > 0$ -- to the 
linear regression loss function that we encountered earlier,
$\mathrm{loss}(\color{red}{\mathbf{w}}) =  \frac{1}{N} (\mathbf{y} - \mathbf{\Phi}\color{red}{\mathbf{w}})^\mathsf{T}(\mathbf{y} - \mathbf{\Phi}\color{red}{\mathbf{w}})$.
Our objective is then to minimize a new loss function

$\underset{\color{red}{\mathbf{w}}}{\text{minimize}} \
\frac{1}{N} (\mathbf{y} - \mathbf{\Phi}\color{red}{\mathbf{w}})^\mathsf{T}(\mathbf{y} - \mathbf{\Phi}\color{red}{\mathbf{w}})
+ \lambda \|\color{red}{\mathbf{w}}\|_2^2$

with respect to parameters $\color{red}{\mathbf{w}}$, where $\|\color{red}{\mathbf{w}}\|_2^2 = \sum_{j} \color{red}{w_j}^2$.

The optimizer now has to *trade off* improving the fit to the training data (decrease the first term) with the increase in magnitude of the parameter vector (increase in the second term).
The *hyperparameter* $\lambda$ controls the strength of this trade-off and is something that you as a user can tune.
We can tune this by training a few values for this hyperparameter on the training set (it is called **hyperparameter sweep**) and pick the value that gives the smallest error on the validation set.

There is an analytical solution to this optimization problem:

$\color{red}{\mathbf{w}^*} = (\mathbf{\Phi}^\mathsf{T} \mathbf{\Phi} + \lambda \mathbf{I})^{-1} \mathbf{\Phi}^\mathsf{T} \mathbf{y}$

Notice that only difference from the least squares solution is that we add $\lambda \mathbf{I}$ to $\mathbf{\Phi}^\mathsf{T} \mathbf{\Phi}$; see **Appendix B** for a derivation.

Now we can practice how to perform regularized regression for our data. Let's start coding and running the experiments!

In [None]:
# Adopted the function regression_analytical_solution with a regularizer
def regularized_regression_analytical_solution(phi, y, regularization_coef):
  t = phi.shape[0]
  return (np.linalg.inv(phi.T.dot(phi) +
                        regularization_coef * np.identity(t)
                        ).dot(phi.T)).dot(y)

# Put some values for the hyperparameter lambda to an array: these are our
# hyperparameter values to sweep
regularization_coefs = [1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2]
min_validation_error = np.Inf

# Accumulate training errors in a list
training_errors = []
# Accumulate validation errors in a list
validation_errors = []

# Train the model for each hyperparameter value and calculate the corresponding
# the validation error for each
for regularization_coef in regularization_coefs:
  w_reg = regularized_regression_analytical_solution(phi_train, y_train,
                                                     regularization_coef)
  training_error = jnp.mean((jnp.dot(phi_train, w_reg) - y_train) ** 2)
  training_errors.append(training_error)
  validation_error = jnp.mean((jnp.dot(phi_val, w_reg) - y_val) ** 2)
  validation_errors.append(validation_error)

  # Find the hyperparameter value that gives the best validation error
  if validation_error < min_validation_error:
    min_validation_error = validation_error
    best_regularization_coef = regularization_coef
    best_w_reg = w_reg

We just performed a basic form of cross-validation that has one training and validation set to pick best hyperparameter value.
Let's check best $\lambda$ value  and magnitudes of the weights:

In [None]:
print("Best value for hyperparameter alpha:", best_regularization_coef)
print("Weights for the regularized regression:", best_w_reg)
print("L2-norm for the weights for the regularized regression:",
      np.linalg.norm(best_w_reg, 2))

We collected training error and validation error values for each regularization coefficient value, $\lambda$.
You can see in the plot below that when we increase $\lambda$, the training error increases while validation error decreases.
This means that in the beginning, where we have small values of $\lambda$, the model was overfitting; thus, regularizing more by increasing the coefficient helps. However, as we further increase $\lambda$, it starts to underfit as both training and validation error increases.
Therefore, we applied the cross validation to find the best value: we loop over different hyperparameter values, train and log the validation error and find the best hyperparameter value that gives the minumum validation error.  You will see in next colabs that we will have other hyperparameters for different models and we will apply this technique to pick the best values. 

In [None]:
plt.figure()
plt.semilogx(np.array(regularization_coefs), np.array(training_errors), label="Train")
plt.semilogx(np.array(regularization_coefs), np.array(validation_errors), label="Validation")
plt.title("Train and validation losses vs regularization coefficient")
plt.xlabel("Regularization coefficient $\lambda$")
plt.ylabel("Loss")
plt.legend()
plt.show()

We reached a different solution with early stopping than we did by adding a regularizer.
The analytical solution was convenient as we didn't have to worry whether the optimizer would converge to a (global) minimum. But optimization is not always that convenient!
The models we use for the more complicated problems
usually can't be analytically minimized,
and we normally use optimizers like gradient descent based algorithms, possibly with early stopping or regularizers.



### [Optional] **Ridge and Lasso**
The special case of linear regression with an L2 regularisation is also known in the statistics community as **Ridge Regression**.
Ridge regression has a tendency to shrink the parameters in a paramater vector, but not necessarily to set them all the way to zero.
Replacing the L2-norm in the regularised loss with an L1 norm leads to another model called **The Lasso**.
This has the tendency to set components of the parameter vector to 0 and can therefore also be used as a *variable selection technique*.
For example, suppose you are trying to predict house prices and you have 10 different inputs you could use to predict the price. However, you are unsure which (if any) of these 10 inputs is actually related to the house price. You could construct a Lasso model with all 10 inputs and after fitting it, find that it sets the parameter associated with 5 of these inputs to 0. This would suggest that those inputs were not useful in predicting the output house price.
If you're interested in finding out more, including why Ridge regression shrinks parameters and Lasso sets them to zero, see [this resource](https://online.stat.psu.edu/stat508/lesson/5).


# **Appendices**

## **Appendix A: When gradient descent goes wrong**
Lets re-visit a small variation of the wave-like dataset from earlier and attempt to fit a 4th degree polynomial to it using gradient descent.
You'll notice from the output that the loss starts off very large and seems to increase at every step before becoming "infinity"! What's going on here?

In [None]:
xs, ys = generate_wave_like_dataset(n=100)
xs = 5 * (xs + 1)
phi = create_data_matrix(xs)

w = regression_analytical_solution(phi, ys)
y_hat = phi.dot(w)

plot_data(y_hat, xs, ys, "Linear regression (analytic minimum)")

In [None]:
gradient_descent(phi, ys, num_steps=10)

The first issue seems to be that the magnitude of the gradient is large.
This means that at each gradient descent step, the change to the parameters $\color{red}{\mathbf{w}}$ are extremely large, causing $\color{red}{\mathbf{w}}$ to overshoot and go to a worse range in terms of loss.

The following illustrates how this may happen in the simple case of a single parameter $\color{red}{w}$, a mean squared error loss function $l(\color{red}{w})$, plotted in blue, and an initial value $\color{red}{w}^{(0)} = 2.5$ shown as the single red dot where the value of the loss is $2.5^2 = 6.25$.

In [None]:
ws = np.linspace(-10, 10, 1000)
losses = ws ** 2

plt.plot(ws, losses)
plt.title("Loss function for a single parameter w")
plt.xlabel("$w$")
plt.ylabel("$l(w)$")

points = [(2.5, 2.5**2)]

ws, losses = list(zip(*points))
plt.scatter(ws, losses, c='r', s=50, zorder=100)

plt.show()

The gradient of this loss function is simply $l'(\color{red}{w}) = 2\color{red}{w}$. Suppose now that our learning rate was 1.5. If we take a gradient step, we arrive at a new w value of:

$\color{red}{w}^{(1)} = \color{red}{w}^{(0)} - 1.5 \cdot 2\color{red}{w}^{(0)} = 2.5 - 7.5 = -5$

At $\color{red}{w}^{(1)} = -5$, the value of the loss value is now $(-5)^2 = 25$! This point and 2 more steps of gradient descent are illustrated below. We see that the loss keeps increasing and the parameter $\color{red}{w}$ moves further away from the minimum. 

In [None]:
LEARNING_RATE = 1.5

ws = np.linspace(-20, 20, 1000)
losses = ws ** 2

fig, ax = plt.subplots()

plt.plot(ws, losses)
plt.title("Loss function for a single parameter w")
plt.xlabel("$w$")
plt.ylabel("$l(w)$")

w = 2.5
points = []
for _ in range(4):
  points.append((w, w**2))
  w = w - LEARNING_RATE * 2 * w

ws, losses = list(zip(*points))
plt.scatter(ws, losses, c='r', s=50, zorder=10)

for point, next_point in zip(points, points[1:]):
  ax.annotate("", xy=next_point, xytext=point,
              arrowprops=dict(arrowstyle="-|>", ls="--", color="grey"))

plt.show()

One way to address this is to use a lower learning rate, with a learning rate of 0.3 and 5 steps of gradient descent, the picture looks quite different. This time the loss decreases with every step and we arrive at the minimum loss where $\color{red}{w}=0$.

In [None]:
LEARNING_RATE = 0.3

ws = np.linspace(-5, 5, 1000)
losses = ws ** 2

fig, ax = plt.subplots()

plt.plot(ws, losses)
plt.title("Loss function for a single parameter w")
plt.xlabel("$w$")
plt.ylabel("$l(w)$")

w = 2.5
points = []
for _ in range(4):
  points.append((w, w**2))
  w = w - LEARNING_RATE * 2 * w

ws, losses = list(zip(*points))
plt.scatter(ws, losses, c='r', s=50, zorder=10)

for point, next_point in zip(points, points[1:]):
  ax.annotate("", xy=next_point, xytext=point,
              arrowprops=dict(arrowstyle="-|>", ls="--", color="grey"))

plt.show()

Let's try setting a lower learning rate with our earlier example.

**Exercise**

Try settings some different learning rates in the cell below to see what happens.
With a learning rate as low as 1e-8, the loss doesn't explode, but decreases extremely slowly.
We will eventually get to the optimal value but it will take a long time (over 100 000 steps!).

The issue seems to be that when we're far away from the optimal setting of the parameters $\color{red}{\mathbf{w}}$, the gradient magnitude is extremely large, so we set a lower learning rate to compensate.
However, when we get into a better range for the paramters, the gradient magnitude shrinks into a more manageable range, but now our learning rate is so small that we need to do thousands of tiny steps to reach our goal. 

In [None]:
LEARNING_RATE = 1e-8

w = gradient_descent(phi, ys, num_steps=100, learning_rate=LEARNING_RATE)

### **Optimization!**

By now, you can guess that **optimization** is a big and important research area.
Surely, there must be cleverer things that we could do than basic gradient descent. We've shown how we can run into serious difficulty with an apparently very simple problem.

There are much, much cleverer things that we can do! We will leave you with this last open-ended exercise:
Can you reimplement a better version of 

```
def gradient_descent(X, y, ...):
  ...

  for i in range(num_steps):
    _, grad = loss_and_grad(w, X, ys)
    
    w = w - learning_rate * grad
  
  return w
```

following some of the approaches in [this blog](https://ruder.io/optimizing-gradient-descent/index.html)?



## **Appendix B: Linear regression analytical solution**
Using the vector notation we introduced previously, we can calculate summation of the squared loss over data points as follows: 

\begin{align}
\newcommand{\T}{\mathsf{T}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\w}{\color{red}{\mathbf{w}}}
l(\w) &= \frac{1}{N} (\y - \X\w)^\T(\y - \X\w) & (1) \\
&= \frac{1}{N} (\y^\T - (\X\w)^\T)(\y - \X\w) & (2) \\
&= \frac{1}{N} [\y^\T\y - \y^\T(\X\w) - (\X\w)^\T\y + (\X\w)^\T(\X\w)] & (3)  \\
&= \frac{1}{N}[\y^\T\y - 2(\X\w)^\T \y + (\X\w)^\T(\X\w)] & (4)  \\
&= \frac{1}{N}[\y^\T\y - 2(\X\w)^\T \y + \w^\T\X^\T\X\w] & (5)
\end{align}

**Exercise**

Convince yourself that all the steps make sense,
1.  using the matrix-transpose identity $(A+B)^\mathsf{T} = A^\mathsf{T} + B^\mathsf{T}$;
2.  multiplying terms;
3.  $(\X\w)$ and $\y$ are vectors, so the order of multiplication doesn't matter;
4.  using the identity $(AB)^\T = B^\T A^\T$.

Now we can calculate the derivative of the loss with respect to the parameter vector $\mathbf{w}$.
Using matrix-vector derivatives from [matrixcheatsheet](http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf)  providing related parts from [matrixcookbook](https://www2.imm.dtu.dk/pubdb/edoc/imm3274.pdf), we get:

\begin{align}
\frac{\delta l}{\delta \w} & = -2\X^\T \y + 2\X^\T\X\w &  (6)
\end{align}

Finally, thanks to special properties of loss (a convex quadratic function), setting this derivative equal to zero and solving for $\w$ gives the optimum value $\w^{\color{red}{*}}$ which is the minimim of the squared loss:

\begin{align}
\frac{\delta l}{\delta \w} = -2\X^\T \y + 2\X^\T\X\w^{\color{red}{*}} & = 0 \\
\Rightarrow \X^\T \y & = \X^\T\X\w^{\color{red}{*}} \\
\Rightarrow \w^{\color{red}{*}} & = (\X^\T\X)^{-1} \X^\T \y \\
\end{align}

We can use the above derivative (6) to calculate the derivative of the sum of the squared loss and L2-norm regularization (that we covered in Regression section above)
and again find the parameter value that makes the derivative zero:
\begin{align}
-2\X^\T \y + 2\X^\T\X\w^{\color{red}{*}} + 2 \lambda \w^{\color{red}{*}} & = 0 \\
\Rightarrow \X^\T \y & = \X^\T\X\w^{\color{red}{*}} + \lambda \w^{\color{red}{*}}\\
\Rightarrow \X^\T \y & = (\X^\T\X+ \lambda \mathbf{I})\w^{\color{red}{*}}\\
\Rightarrow \w^{\color{red}{*}} & = (\X^\T\X + \lambda \mathbf{I})^{-1} \X^\T \y \\
\end{align}

Not all problems have an analytical solution; for those problems we use methods like gradient descent to find the optimum.
That is why we explained how to apply gradient descent based training algorithm to find the minimum of a squared loss fuction, even though in this case, we could calculate it directly.