# Project 1 Report - Nick Vega

The project report should include a brief justification of your solution at a high-level, e.g., using any relevant explanations, equations, or pictures that help to explain your solution. You should also describe what your code does, e.g. using a couple of sentences per function to describe your code structure. 

## Task 1A : Linear Regression

In this task, my group implemented linear regression with gradient descent optimization from scratch. 

The dataset used was a synthetic dataset that was provided. 

In this task, my group learned how to
- implement linear regression from scratch (no ML libraries)
- implement loss functions and their gradient
- implement and understand gradient descent optimization
- implement visualizations and analyze model performance

Before we implemented linear regression with gradient descent optimization from scratch, we were provided with a given problem setup that included 
- necessary dependencies
- utility functions (data generation and visualization)

Below is the following generated data that is provided and used in this task.

![Generated Data](generated_data.png)

### Task 1A.1: Modeling

**TASK** 

In linear regression, we must find coefficents $\hat{\mathbf{w}}$ such that the residual between $\hat{y} = \hat{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and $y$ is minimized.

To find the optimal coefficents $\hat{\mathbf{w}}$ that minimizes the residuals, we will need to implement gradient descent optimization. However, before we implemnet gradient descent optimization, we must first implement a loss function that can calculate the gradient of the empirical risk function with respect to w and the regularized loss of the empirical risk during each iteration of gradient descent.

In order to do this, we start with a ridge regression risk function, defined as 

$$ R({\mathbf{w}}) = \mathbb{E}[(y-{\mathbf{w}}^\top \mathbf{x})^2)] +  \lambda \mathbf{w}^\top \mathbf{w}$$

The ridge regression risk function will need to be approximated by the empirical risk. The result is:

$$\hat{R}_{\text{ridge}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{w}^\top \mathbf{x}_i\right)^2 + \lambda \mathbf{w}^\top \mathbf{w}$$

In this task, we were instructued to construct a customized function that would return the empirical risk and its gradient at parameter w.

**SOLUTION**

We are tasked with constructed a customized function that will return the empirical risk and its gradient at parameter w.

We defined a function named lossFunction that took a numpy array of w, X, and y, and a regularization paramter lambda and returned the computed ridge regression risk function and its gradient.

In the function, the code initializes the regLoss to 0 and creates a gradient vector that is filled with zeros of the shape of w.

Then, the code computes 

$$\hat{R}_{\text{ridge}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{w}^\top \mathbf{x}_i\right)^2 + \lambda \mathbf{w}^\top \mathbf{w}$$

where n is the number of rows in X. 

The gradient, the deraritive of the empirical risk at parameter w, is then calculated using the following formula as the code:

$$\nabla \hat{R}_{\text{ridge}}(\mathbf{w}) = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) + 2\lambda\mathbf{w}$$

where n is the number of rows in X.


**IMPLEMENTATION** 

Given the generated data, the loss function is then used to compute and display the initial loss and gradient for regularized linear regression. 

The result is the following: 

Loss at initial w (zeros): [[8.12449231]]

### Task 1A.2: Training Your Ridge Regressor: Gradient Descent 

**TASK**

As mentioned in Task 1A.1,  we must find coefficents $\hat{\mathbf{w}}$ such that the residual between $\hat{y} = \hat{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and $y$ is minimized. In order to find the coefficients $\hat{\mathbf{w}}$ such that the residual is minimized, we will implement gradient descent optimization from scratch.

In Task 1A.1, we constructed a loss function that would compute the empirical risk and its gradient at paramter w. 

Now, in the current task, we will train our ridge regression by implementing gradient descent optimization from scratch. 

The parameters $\hat{\mathbf{w}}$ can be updated via a gradient descent rule: 

$$ \hat{\mathbf{w}}_{t+1} \gets \hat{\mathbf{w}}_t - \eta_t \left.\frac{\partial \hat{R}}{\partial \mathbf{w}} \right|_{\mathbf{w}=\hat{\mathbf{w}}_t},$$

where $\eta_t$ is a parameter of the algorithm, $t$ is the iteration index, and $\frac{\partial \hat{R}}{\partial \mathbf{w}}$ is the gradient of the empirical risk function w.r.t. $\mathbf{w}$.

We will keep $\eta_t$ constant. The computational complexity of gradient descent is $O(n_{\text{iter}} \cdot  n d)$. 

In this task, we will write a customized gradient descent optimization function which will return an array of empirical risk values, one for each iteration, as well as the final output of the model paramter.



**SOLUTION**

We are tasked with writing a customized gradient descent optimization function which will return an array of empirical risk values, one for each iteration, as well as the final output of the model paramter.

We defined a function called gradient descent which took numpy arrays for X, y, and w, and a float for eta, lambda, and tolerance, and a integer for a maximum number of iterations conducted. The function returns a tuple of a numpy array and a list with the final coefficients and loss history. 

`tolerance` specifies the stopping condition: The gradient descent algorithm terminates the observed loss values converges (i.e. two consective losses differ by at most `tolerance`). 

In the function, the code initalizes an empty Loss_history list and a prev_loss of infinity which signifies the loss in the preivous iteration of gradient descent. 

To initialize the gradient descent, the function calls the lossFunction function we made in task 1A.1 and appends the loss to the list.

The function then enters a while loop that terminates when the current loss in the current iteration minus the previous loss from the previous iteration is greater than the tolerance. 

In the loop, for each iteration, the gradient descent rule

$$ \hat{\mathbf{w}}_{t+1} \gets \hat{\mathbf{w}}_t - \eta_t \left.\frac{\partial \hat{R}}{\partial \mathbf{w}} \right|_{\mathbf{w}=\hat{\mathbf{w}}_t},$$

is conducted with code with the gradient coming from the lossFunction function

The prev_loss is then set to the current loss.

The lossFunction function from task 1A.1 is then called, and the loss is appended to the Loss_history list. 

The function will also terminate if the a specified number of maximum iterations is reached. 

Once terminated, the function then returns the current w and Loss_history list.

**IMPLEMENTATION** 

The gradient descent function we have constructed is called using the generated data for X , y , and w (X=X_train, y=y_train, and w = initial_w)

The eta = 0.1, the tolerance = 1e-6, and the max_iter = 1e4

The result of performing gradient descent using the customized gradient descent function is: 

The regularized w using ridge regression:
 [[1.8996459 ]
 [0.64674945]]

![Loss Function Using GD](loss_function_gd.png)

From the graph, the loss(w) is minimized after ~10-15 iterations.

### Task 1A.3: Test Module

### Task 1A.4: True Risk [Bonus]