# Midterm Part II

In this midterm, we will compare optimization algorithms for linear regression.
- linear regression X (analytic solution, gradient descent, adam)
without using torch.nn / torch.optim packages.

For this exam, you need to 
1. fill out ``midterm.py``
2. run the current jupyter notebook
3. explain the two algorithms GD & ADAM by comparing your results (answer at the end of this notebook)


## Google Colab Setup
Setup the work environment following Homework 1 (mount google drive, set the path)

## Setup Code
Before getting started, we need to run some boilerplate code to set up our environment, same as Assignment 1. You'll need to rerun this setup code each time you start the notebook.

First, run this cell load the autoreload extension. This allows us to edit .py source files, and re-import them into the notebook for a seamless editing and debugging experience.

In [None]:
%load_ext autoreload
%autoreload 2

## Generate a simulation model

In [None]:
import torch
import midterm

import math
import matplotlib.pyplot as plt

midterm.reset_seed(1)
trainX, trainy, true_beta = midterm.gen_linear(N=500,D=10)
testX, testy, beta2 = midterm.gen_linear(N=500,D=10, beta=torch.squeeze(true_beta))
print(true_beta-beta2)

## Fit linear regression with analytic solution
Here, you need to fill out ``midterm.py`` file (``train_analytic``, ``predict`` functions).

In [None]:
model_a = midterm.LinearRegress(trainX,trainy)
model_a.train_analytic()
print('Training Mean squared error (MSE):', (model_a.beta-true_beta).square().mean())
print('Training Mean squared prediction error (MSPE):', (model_a.yhat_train-trainy).square().mean())
testy_hat = model_a.predict(testX)
print('Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())


## Ridge with analytic solution
There is no additional codding required. We can just modify $\lambda$.

In [None]:
model_a.train_analytic(lam=.5)
print('Training Mean squared error (MSE):', (model_a.beta-true_beta).square().mean())
print('Training Mean squared prediction error (MSPE):', (model_a.yhat_train-trainy).square().mean())
testy_hat = model_a.predict(testX)
print('Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())


## Using gradient descent
Let's use the MSPE as our loss function,
$$ L(y,beta) =\frac{1}{N} \sum (y_i-({\beta}_0 + {\beta}_1 x_{i1}+ \ldots + {\beta}_D x_{iD})^2, $$
because we do not know the true ``beta`` in a real case.

With a matrix form (and a proper modification), 
$$ L(y,beta) = \frac{1}{N}\|y-X\beta\|^2 $$
Write down your gradient descent method code in ``train_gd``. Don't forget to consider $\frac{1}{n}$. Without it, your gradient might be too large (change the learning rate). 

In the code below, you can modify ``lr`` and ``niter`` as you want. 

In [None]:
model_gd = midterm.LinearRegress(trainX,trainy)

lr = 1e-2
niter = 2000
model_gd.train_gd(niter = niter, lr=lr)

print('Training Mean squared error (MSE):', (model_gd.beta-true_beta).square().mean())
print('Training Mean squared prediction error (MSPE):', (model_gd.yhat_train-trainy).square().mean())
testy_hat = model_gd.predict(testX)
print('Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())

# plot the loss history
plt.plot(model_gd.loss_history, 'o')
plt.xlabel('Iteration')
plt.ylabel('training loss')
plt.title('Training Loss history')
plt.show()


## Adam
Let's use adam as your optimizer. Fill out the ``train_adam``. In your code, you can add ``t=t+1`` at the beginning to make the iteration number start from 1 instead of 0. 
- Try to modify your ``lr`` to see how sensitive it is. 
- Try to modify ``niter`` to see how fast it is.

In [None]:
model_adam = midterm.LinearRegress(trainX,trainy)

lr = 1
niter = 200
model_adam.train_adam(niter = niter, lr=lr, beta1=0.9, beta2=0.999)

print('Training Mean squared error (MSE):', (model_adam.beta-true_beta).square().mean())
print('Training Mean squared prediction error (MSPE):', (model_adam.yhat_train-trainy).square().mean())
testy_hat = model_adam.predict(testX)
print('Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())

# plot the loss history
plt.plot(model_adam.loss_history, 'o')
plt.xlabel('Iteration')
plt.ylabel('training loss')
plt.title('Training Loss history')
plt.show()


### Compare test MSPE 

In [None]:
testy_hat = model_a.predict(testX)
print('Analytic-Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())

testy_hat = model_gd.predict(testX)
print('GD-Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())

testy_hat = model_adam.predict(testX)
print('adam-Test Mean squared prediction error (MSPE):', (testy_hat-testy).square().mean())

# Explain the two optimization algorithms with comparison

## Fill out your answer below. A paragraph would be fine!
answer: 