In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER 131] Homework 6: Gradient Descent

This homework focuses on gradient descent. 



### Table of Contents
* [Project](#project)<br>
1. [A Simple Model](#model)<br>
1. [Fitting the Model](#fitting)<br>
1. [Increasing Model Complexity](#complexity)<br>
1. [Gradient Descent](#gd)<br>

---

## Section A: Project<a id='project'></a>

This week, each member of your team should conduct an EDA of a different dataset that you are considering for your analysis.

**Question A.1** In a few sentences, explain where and how you obtained this data and how it was collected.

*YOUR ANSWER HERE*

**Question A.2** Briefly summarize the structure, granularity, scope, temporality and faithfulness (write 1-2 sentences for each of structure, granularity, etc). Is there any aspect of this dataset that is limiting, or any reason to question its validity?

*YOUR ANSWER HERE*

**Question A.3** Specify three data cleaning operations that you will have to perform on this dataset.

*YOUR ANSWER HERE*

## Introduction to Gradient Descent
Before we dive into the data and the homework, let's set up our motivation for exploring gradient descent. So far, we've been finding model parameters for linear regression by defining a loss function: a function that we want to minimize. Specifically, this loss function has been the mean squared error (MSE) - the linear regression fitting that we did in homework 5 and lab 5 worked by solving for the $\theta$ values that minimize the mean squared error of the training data. To minimize the MSE, we have to take its derivative, set it to zero, and solve for the parameters.<br>

This process isn't always feasible. One reason for this is that when you have a problem with a lot of response variables (features), setting the MSE derivative to equal zero becomes computationally intensive and involves inverting a very large matrix; when you have a model with a more complex form than linear regression, finding a derivative of the loss function and setting it to zero can be difficult.  A second reason is that some of the loss functions you might encounter can't be massaged into a form that allows you to find the parameters algebraically.  <br>

This is where gradient descent comes in! For complex models, or models with many features, it's a more efficient way to find the minimum of the loss function. 


**Dependencies:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import warnings 
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight') 

# Set some parameters
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 14
np.set_printoptions(4)

In [None]:
# We will use plot_3d helper function to help us visualize gradient
from hw7_utils import plot_3d

----

## Load Data
For this homework, we'll be working with theoretical data.
Load the data.csv file into a pandas dataframe.  

In [None]:
# Run this cell to load our sample data
data = pd.read_csv('hw7_data.csv', index_col=0)
data.head()

---

## Section 1. A Simple Model<a id='model'></a>
Let's start by examining our data and creating a simple model that can represent this data.<br>

**Question 1.1** Define a function `scatter()` that produces a scatter plot. It should take as input the x and y values, and produce a scatter plot with axis labels. Then, plot the $x$ and $y$ data from the `data` df you loaded above.<br>

In [None]:
def scatter(x, y):
    """
    Generate a scatter plot using x and y

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    """
    #YOUR CODE HERE

In [None]:
#YOUR CODE HERE

**Question 1.2:** Describe any significant observations about the distribution of the data. How can you describe the relationship between $x$ and $y$?

*Your answer here*

**Question 1.3:** For now, let's assume that the data follows some linear model, parametrized by $\theta$:

$\Large
\hat{y} = \theta \cdot x
$

Define a linear model function `linear_model()` that produces a value $\hat{y}$ given $x$ and $\theta$, where $x$ is a vector of observations and $\theta$ is a scalar value.

In [None]:
# SOLUTION
def linear_model(x, theta):
    """
    Returns the estimate of y given x and theta

    Keyword arguments:
    x -- the vector of values x
    theta -- the scalar theta
    """
    y_hat = ... # YOUR CODE HERE

    return y_hat

In [None]:
# run this cell, do not change it
assert linear_model(0, 1) == 0
assert linear_model(10, 10) == 100
assert np.sum(linear_model(np.array([3, 5]), 3)) == 24
assert linear_model(np.array([7, 8]), 4).mean() == 30

**Question 1.4:** In class, we learned that the $L^2$ loss function (i.e., **the mean squared error**) is smooth and continuous. Let's use $L^2$ loss to find an optimal value for $\theta$. First, define the $L^2$ loss function `l2_loss` below, that calculates the value of $L^2$ given a set of actual observations $y$ and predictions $\hat{y}$.

In [None]:
# SOLUTION
def l2_loss(y, y_hat):
    """
    Returns the average l^2 loss given y and y_hat

    Keyword arguments:
    y -- the vector of true values y
    y_hat -- the vector of predicted values y_hat
    """
    return ... # YOUR CODE HERE


In [None]:
# run this cell, do not change it
assert l2_loss(2, 1) == 1
assert l2_loss(2, 0) == 4 
assert l2_loss(5, 1) == 16
assert l2_loss(np.array([5, 6]), np.array([1, 1])) == 20.5
assert l2_loss(np.array([1, 1, 1]), np.array([4, 1, 4])) == 6.0

**Question 1.5:** Write a function `l2_plot()` that produces a plot of $L^2$ loss as a function of the coefficient $\theta$. Your function should take inputs $x$ and $y$, which are vectors of $x$ and $y$ observations, and input `thetas`, which is a list of possible thetas. You should end up with a plot of $\theta$ values on the x-axis, and the $L^2$ loss corresponding with those $\theta$ values on the y-axis.  Make sure to label your axes and add a title.<br>
<br>

In [None]:
def l2_plot(x, y, thetas):
    """
    Plots the average l2 loss for given x, y as a function of theta.
    Use the functions you wrote for linear_model and l2_loss.

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    thetas -- the vector containing different estimates of theta
    """
    # Calculate the loss here for each value of theta
    # Create your plot here

**Question 1.6:** Run the function `l2_plot()` using the $x$ and $y$ values from dataframe `data` above and a list of `thetas` (you can define this range yourself - the `np.linspace()` function might be helpful here). 

What appears to be the optimal $\theta$ value based on the visualization? We'll call this value $\theta^*$.  Set the variable `theta_star_guess` to the value of $\theta$ that appears to minimize our loss.

In [None]:
thetas = ... # define possible thetas
l2_plot(...) # call your L^2 loss plotting function
theta_star_guess = ... # Your guess here

In [None]:
assert l2_loss(3, 2) == 1
assert l2_loss(0, 10) == 100
assert -3 <= theta_star_guess <= -2

---
## Section 2: Fitting our Simple Model<a id='fitting'></a>
Now that we have defined a simple linear model and loss function, let's begin working on fitting our model to the data.

**Question 2.1:** Let's confirm our visual findings for our optimal coefficient $\theta^*$. First, identify the analytical solution for the optimal $\theta^*$ that minimizes average $L^2$ loss. Of the three options, below, which correctly gives the formula that tells us what $\theta^*$ is given $i$ observations of $x$ and $y$? Highlight your answer in <font color = "red">red</font> (double click this cell if you don't know how to do this).

1. $$\Large {\theta}^* = \frac{\sum x_i + y_i}{\sum x_i^2}$$ <br>
2. $$\Large {\theta}^* = \frac{\sum x_iy_i}{\sum x_i^2}$$ <br>
3. $$\Large {\theta}^* = \frac{\sum x_iy_i}{\sum x_i}$$ <br>

**Question 2.2:** 
Now that we have the analytic solution for $\theta^*$, implement the function `find_theta` that calculates the numerical value of $\theta^*$ based on our data $x$, $y$.

In [None]:
# SOLUTION
def find_theta(x, y):
    """
    Find optimal theta given x and y

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    """
    # YOUR CODE HERE
    return ...

t_star = find_theta(...) # Your code here to get theta star

In [None]:
# run this cell; do not change it
print(f'theta_opt = {t_star}')
assert -2.5 <= t_star <= -2

**Question 2.3:** Now, let's plot our loss function again using the `l2_plot()` function. This time, add a vertical line at the optimal value of theta (i.e. plot the line $x = \theta^*$). The function `plt.axvline()` is helpful here.

In [None]:
l2_plot(...) # plot loss function
# add a vertical line

<br> 
**Question 2.4:** We now have an optimal value for $\theta$ that minimizes our loss. In the cell below, plot the scatter plot of the data from Question 1a (you can reuse the `scatter()` function here). Add the best-fit line $\hat{y} = \theta^* \cdot x$ using the $\theta^*$ you computed above.

In [None]:
# YOUR CODE HERE

**Question 2.5:** Great! It looks like our estimate for $\theta$ is able to capture a lot of the data with a single parameter. Now let's try to plot the residual to see what we've missed.<br>  

The residual is defined as $r=y-\theta^* \cdot x$. Below, write a function to find the residual and plot the residuals as a function of the independent variable in a scatter plot. Plot a horizontal line at $y=0$ to assist with visualization. Add axis labels..

In [None]:
def visualize_residual(x, y):
    """
    Plot a scatter plot of the residuals, the remaining 
    values after removing the linear model from our data.

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    """
    # calculate residual
    r = ...
    
    # plot residual, including axis labels and vertical line at y=0

visualize_residual(x, y)
plt.show()

**Question 2.6:** What does the residual look like? Do you notice a relationship between $x$ and $r$?

*YOUR ANSWER HERE*

---
## Section 3: Increasing Model Complexity<a id='complexity'></a>

It looks like the remaining data is cosinusoidal, meaning our original data follows a linear function and a cosinusoidal function. Let's define a new model to address this discovery and find optimal parameters to best fit the data:

$$\Large
\hat{y} = \theta_1x + cos(\theta_2x)
$$

Now, our model is parameterized by both $\theta_1$ and $\theta_2$, or composed together, $\vec{\theta}$.

Note that a generalized cosine function $a\cos(bx+c)$ has three parameters: amplitude scaling parameter $a$, frequency parameter $b$ and phase shifting parameter $c$. We can assume that the scaling and shifting parameter ($a$ and $c$ in this case) are 1 and 0 respectively. 

**Question 3.1:** In the following cell, **explain why we can assume the scaling parameter to be 1 and shifting parameter to be 0 based on the residual plot in Question 2e**. 

You might find the following code helpful in visualizing all three parameters.

```python
def plot_cos_generalized(a,b,c,label=None):
    """Plot a cosine function with three parameters"""
    X = np.linspace(-5, 5)
    Y = a * np.cos(b*X + c)
    plt.plot(X, Y, ':',label=label)
    plt.legend()
 ```

You can try plotting: 
```python
plot_cos_generalized(1,1,1, label='cos(x)')
plot_cos_generalized(1,1,2, label='cos(x + 2)')
plot_cos_generalized(1,2,2, label='cos(2x + 2)')
plot_cos_generalized(2,2,2, label='2cos(2x + 2)')
```

In [None]:
# use this cell for scratch work

*Your answer here*

**Question 3.2:** As in Question 1, write a function that predicts a value $\hat{y}$ given an input $x$ based on our new model.

*Hint:* Try to do this without using for loops. The `np.cos` function may help you.

In [None]:
def cos_model(x, theta_1, theta_2):
    """
    Predict the estimate of y given x, theta_1, theta_2

    Keyword arguments:
    x -- the vector of values x
    theta_1 -- the scalar value theta_1
    theta_2 -- the scalar value theta_2
    """
    # YOUR CODE HERE

In [None]:
print(np.isclose(cos_model(1, 1, np.pi), 0))
# Check that we accept x as arrays
assert len(cos_model(x, 2, 2)) > 1

**Question 3.3:** In this question your job is to match the left and right sides of the equations for:
1. The $L^2$ loss for for the `cos` model, $\hat{y} = \theta_1x + cos(\theta_2x)$.  We'll call that $L(x, y, \theta_1, \theta_2)$.
2. The partial derivatives of the `cos` model loss functions, $\frac{\partial L }{\partial \theta_1}, \frac{\partial L }{\partial \theta_2}$. 

Notice that we now have $\vec{x}$ and $\vec{y}$ instead of $x$ and $y$. This means that when determining the loss function $L(x, y, \theta_1, \theta_2)$, you'll need to take the average of the squared losses for each $y_i$, $\hat{y_i}$ pair.

As your answer below, match the right side (letters) to the correct left sides (numbers).

1. $L(x, y, \theta_1, \theta_2)$ <br>
2. $\frac{\partial L}{\partial \theta_1}$ <br>
3. $\frac{\partial L}{\partial \theta_2}$ <br>


A. $\frac{2}{n} \sum_{i=1}^n (x_i y_i \sin(\theta_2 x_i) - \theta_1 x_i ^ 2 \sin(\theta_2 x_i) - x_i \cos(\theta_2 x_i)\sin(\theta_2 x_i))$ <br>
B. $-\frac{2}{n} \sum_{i=1}^n (x_i y_i - \theta_1 x_i ^ 2 - x_i \cos(\theta_2 x_i))$<br>
C.  $\frac{1}{n} \sum_{i=1}^n (y_i - \theta_1 x_i - \cos(\theta_2 x_i)) ^ 2$ 


*Your answer*: <br>
1. ... <br>
2. ... <br>
3. ... <br>

**Question 3.4:** Now, implement the functions `dt1` and `dt2`, which should compute $\frac{\partial L }{\partial \theta_1}$ and $\frac{\partial L }{\partial \theta_2}$ respectively. Use the formulas you wrote for $\frac{\partial L }{\partial \theta_1}$ and $\frac{\partial L }{\partial \theta_2}$ in the previous exercise. In the functions below, the parameter `theta` is a vector that looks like $( \theta_1, \theta_2 )$.

In [None]:
def dt1(x, y, theta):
    """
    Compute the numerical value of the partial of l2 loss with respect to theta_1

    Keyword arguments:
    x -- the vector of all x values
    y -- the vector of all y values
    theta -- the vector of values theta
    """
    # YOUR CODE HERE

In [None]:
def dt2(x, y, theta):
    """
    Compute the numerical value of the partial of l2 loss with respect to theta_2

    Keyword arguments:
    x -- the vector of all x values
    y -- the vector of all y values
    theta -- the vector of values theta
    """
    # YOUR CODE HERE

In [None]:
# This function calls dt1 and dt2 and returns the gradient dt. It is already implemented for you.
def dt(x, y, theta):
    """
    Returns the gradient of l2 loss with respect to vector theta

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    theta -- the vector of values theta
    """
    return np.array([
        dt1(x, y, theta),
        dt2(x, y, theta)
    ])

In [None]:
# run this cell, do not change it. Both outputs should be True
print(np.isclose(dt1(x, y, [0, np.pi]), 153.0123127740174))
print(np.isclose(dt2(x, y, [0, np.pi]), 0.8562500798403736))

---
## Section 4: Gradient Descent<a id='gd'></a>
Now try to solve for the optimal $\theta^*$ analytically...

**Just kidding!**

You can try but we don't recommend it. When finding an analytic solution becomes difficult or impossible, we resort to alternative optimization methods for finding an approximate solution.

So let's try implementing a numerical optimization method: gradient descent!


**Question 4.1:** Implement the `grad_desc` function that performs gradient descent for a finite number of iterations. This function takes in array $x$, array $y$, and an initial value for $\theta$ (`theta`). `alpha` will be the learning rate (or step size, whichever term you prefer). In this part, we'll use a static learning rate that is the same at every time step. 

At each time step, use the gradient and `alpha` to update your current `theta`. Also at each time step, be sure to save the current `theta` in `theta_history`, along with the $L^2$ loss (computed with the current `theta`) in `loss_history`.

Hints:
- Write out the gradient update equation (1 step). What variables will you need for each gradient update? Of these variables, which ones do you already have, and which ones will you need to recompute at each time step?
- You may need a loop here to update `theta` several times.

In [None]:
# Run me
def init_t():
    """Creates an initial theta [0, 1] as a starting point for gradient descent"""
    return np.array([0,1.1])

In [None]:
# SOLUTION
def grad_desc(x, y, theta, num_iter=20, alpha=0.01):
    """
    Run gradient descent update for a finite number of iterations and static learning rate

    Keyword arguments:
    x -- the vector of values x
    y -- the vector of values y
    theta -- the vector of values theta to use at first iteration
    num_iter -- the max number of iterations
    alpha -- the learning rate (also called the step size)
    
    Return:
    theta -- the optimal value of theta after num_iter of gradient descent
    theta_history -- the series of theta values over each iteration of gradient descent
    loss_history -- the series of loss values over each iteration of gradient descent
    """
    # YOUR CODE HERE

In [None]:
# run this cell, do not change them
t = init_t() # set initial theta values to 0
t_est, ts, loss = grad_desc(x, y, t)

print(len(ts) == len(loss) == 20) # theta history and loss history are 20 items in them
print(ts[0].shape == (2,)) # theta history contains theta values
print(np.isscalar(loss[0])) # loss history is a list of scalar values, not vector

print(loss[1] - loss[-1] > 0) # loss is decreasing

print(np.allclose(np.sum(t_est), -0.75, atol=2e-1))  # theta_est should be close to our value

**Question 4.2:** Let's visually inspect our results of running gradient descent to optimize $\theta$. Plot our x values with our model's predicted y values over the original scatter plot. Did gradient descent successfully optimize $\theta$?

In [None]:
# Run me
t = init_t()
t_est, ts, loss = grad_desc(x, y, t)

In [None]:
# YOUR CODE HERE
# get predicted y values from cosinusoidal model based on thetas obtained through gradient descent
# plot model's predicted values
# plot actual observations

*YOUR ANSWER HERE* 

**Question 4.3:** Let's visualize gradient descent to see how it converges. Plot the loss values on the y-axis and the iteration number on the x-axis for your gradient descent. 

In [None]:
# YOUR CODE HERE

**Question 4.4:** Create a single plot that shows the iteration (x-axis) vs the loss value (y-axis) for different values of `alpha`: try using `alpha` = 0.01, `alpha` = 0.005, and `alpha` = 0.0005. Add a legend. How does the loss value change over different iterations when alpha varies? Based on what you know about gradient descent, why does the loss value change in this way?<br>

*Note*: if you have a function that returns multiple values, but you only care about some of those values, you can use `_` to indicate that you don't want to save a given output. For instance, running: `_,_, loss = grad_desc(x, y, t)` would only save the return value for `loss_history`, and not `theta` or `theta_history`. This can save you some memory.

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

## Extra credit
How--and why--does your model change if you set the initial conditions in the `init_t` function to significantly different values (e.g., 0,1)?

In [None]:
# Scratch work here

*YOUR ANSWER HERE*


----

## Bibliography

+ Data 100 - HW 5: Modeling, Estimation and Gradient Descent

<hr/>

Data Science Modules: http://data.berkeley.edu/education/modules