# Project 1 Report - Nick Vega

The project report should include a brief justification of your solution at a high-level, e.g., using any relevant explanations, equations, or pictures that help to explain your solution. You should also describe what your code does, e.g. using a couple of sentences per function to describe your code structure. 

## Task 1A : Linear Regression

In this task, my group implemented linear regression with gradient descent optimization from scratch. 

The dataset used was a synthetic dataset that was provided. 

In this task, my group learned how to
- implement linear regression from scratch (no ML libraries)
- implement loss functions and their gradient
- implement and understand gradient descent optimization
- implement visualizations and analyze model performance

Before we implemented linear regression with gradient descent optimization from scratch, we were provided with a given problem setup that included 
- necessary dependencies
- utility functions (data generation and visualization)

Below is the following generated data that is provided and used in this task.

![Generated Data](/workspaces/regression-and-model-selection/images/generated_data.png)

### Task 1A.1: Modeling

**TASK** 

In linear regression, we must find coefficents $\hat{\mathbf{w}}$ such that the residual between $\hat{y} = \hat{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and $y$ is minimized.

To find the optimal coefficents $\hat{\mathbf{w}}$ that minimizes the residuals, we will need to implement gradient descent optimization. However, before we implemnet gradient descent optimization, we must first implement a loss function that can calculate the gradient of the empirical risk function with respect to w and the regularized loss of the empirical risk during each iteration of gradient descent.

In order to do this, we start with a ridge regression risk function, defined as 

$$ R({\mathbf{w}}) = \mathbb{E}[(y-{\mathbf{w}}^\top \mathbf{x})^2)] +  \lambda \mathbf{w}^\top \mathbf{w}$$

The ridge regression risk function will need to be approximated by the empirical risk. The result is:

$$\hat{R}_{\text{ridge}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{w}^\top \mathbf{x}_i\right)^2 + \lambda \mathbf{w}^\top \mathbf{w}$$

In this task, we were instructued to construct a customized function that would return the empirical risk and its gradient at parameter w.

**SOLUTION**

We are tasked with constructed a customized function that will return the empirical risk and its gradient at parameter w.

We defined a function named lossFunction that took a numpy array of w, X, and y, and a regularization paramter lambda and returned the computed ridge regression risk function and its gradient.

In the function, the code initializes the regLoss to 0 and creates a gradient vector that is filled with zeros of the shape of w.

Then, the code computes 

$$\hat{R}_{\text{ridge}}(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n \left(y_i - \mathbf{w}^\top \mathbf{x}_i\right)^2 + \lambda \mathbf{w}^\top \mathbf{w}$$

where n is the number of rows in X. 

The gradient, the deraritive of the empirical risk at parameter w, is then calculated using the following formula as the code:

$$\nabla \hat{R}_{\text{ridge}}(\mathbf{w}) = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) + 2\lambda\mathbf{w}$$

where n is the number of rows in X.


**IMPLEMENTATION** 

Given the generated data, the loss function is then used to compute and display the initial loss and gradient for regularized linear regression. 

The result is the following: 

Loss at initial w (zeros): [[8.12449231]]

### Task 1A.2: Training Your Ridge Regressor: Gradient Descent 

**TASK**

As mentioned in Task 1A.1,  we must find coefficents $\hat{\mathbf{w}}$ such that the residual between $\hat{y} = \hat{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and $y$ is minimized. In order to find the coefficients $\hat{\mathbf{w}}$ such that the residual is minimized, we will implement gradient descent optimization from scratch.

In Task 1A.1, we constructed a loss function that would compute the empirical risk and its gradient at paramter w. 

Now, in the current task, we will train our ridge regression by implementing gradient descent optimization from scratch. 

The parameters $\hat{\mathbf{w}}$ can be updated via a gradient descent rule: 

$$ \hat{\mathbf{w}}_{t+1} \gets \hat{\mathbf{w}}_t - \eta_t \left.\frac{\partial \hat{R}}{\partial \mathbf{w}} \right|_{\mathbf{w}=\hat{\mathbf{w}}_t},$$

where $\eta_t$ is a parameter of the algorithm, $t$ is the iteration index, and $\frac{\partial \hat{R}}{\partial \mathbf{w}}$ is the gradient of the empirical risk function w.r.t. $\mathbf{w}$.

We will keep $\eta_t$ constant. The computational complexity of gradient descent is $O(n_{\text{iter}} \cdot  n d)$. 

In this task, we will write a customized gradient descent optimization function which will return an array of empirical risk values, one for each iteration, as well as the final output of the model paramter.



**SOLUTION**

We are tasked with writing a customized gradient descent optimization function which will return an array of empirical risk values, one for each iteration, as well as the final output of the model paramter.

We defined a function called gradient descent which took numpy arrays for X, y, and w, and a float for eta, lambda, and tolerance, and a integer for a maximum number of iterations conducted. The function returns a tuple of a numpy array and a list with the final coefficients and loss history. 

`tolerance` specifies the stopping condition: The gradient descent algorithm terminates the observed loss values converges (i.e. two consective losses differ by at most `tolerance`). 

In the function, the code initalizes an empty Loss_history list and a prev_loss of infinity which signifies the loss in the preivous iteration of gradient descent. 

To initialize the gradient descent, the function calls the lossFunction function we made in task 1A.1 and appends the loss to the list.

The function then enters a while loop that terminates when the current loss in the current iteration minus the previous loss from the previous iteration is greater than the tolerance. 

In the loop, for each iteration, the gradient descent rule

$$ \hat{\mathbf{w}}_{t+1} \gets \hat{\mathbf{w}}_t - \eta_t \left.\frac{\partial \hat{R}}{\partial \mathbf{w}} \right|_{\mathbf{w}=\hat{\mathbf{w}}_t},$$

is conducted with code with the gradient coming from the lossFunction function

The prev_loss is then set to the current loss.

The lossFunction function from task 1A.1 is then called, and the loss is appended to the Loss_history list. 

The function will also terminate if the a specified number of maximum iterations is reached. 

Once terminated, the function then returns the current w and Loss_history list.

**IMPLEMENTATION** 

The gradient descent function we have constructed is called using the generated data for X , y , and w (X=X_train, y=y_train, and w = initial_w)

The eta = 0.1, the tolerance = 1e-6, and the max_iter = 1e4

The result of performing gradient descent using the customized gradient descent function is: 

The regularized w using ridge regression:
 [[1.8996459 ]
 [0.64674945]]

![Loss Function Using GD](/workspaces/regression-and-model-selection/images/loss_function_gd.png)

From the graph, the loss(w) is minimized after ~10-15 iterations.

### Task 1A.3: Test Module

**TASK**

In this task, we will evaulate the model we have constructed by plotting the predicted values on the test data, together with the training points. The code was implemented for us.

**IMPLEMENTATION**

![Ridge Regression](/workspaces/regression-and-model-selection/images/ridge_regression.png)

From the graph, the ridge regression fit on the generated data given seems to minimize the residuals effectively. The process for genearting the ridge regression fit included constructing a customized loss function to compute the empirical risk and its gradient, constructing a customized gradient descent optimization function to compute the optimal parameter w using the gradient descent rule that called the customized loss function.

### Task 1A.4: True Risk [Bonus]

In this task, we will compute the expected error (true risk) of our model and then answer questions related to model selection based on our observations.

**TASK 1A.4.1**



In this task, we will compute and compare the expected error and estimated error of the trained model. 

The true risk for our linear regression model can be directly computed since we know:
- Input $\mathbf{x} = [x_0, 1]^\top$ is sampled from standard normal distribution where $x_0 \sim \mathcal{N}(0,1)$
- True parameter is $\mathbf{w}_\text{true} = [2, -2]^T$
- Noise follows $\epsilon \sim \mathcal{N}(0, \sigma^2)$ where $\sigma = 0.6$

For a learned parameter $\hat{\mathbf{w}}$, the true error given MSE as the loss function is:

\begin{align}
E(\hat{\mathbf{w}}) &= \mathbb{E}_\mathbf{x}[(\mathbf{w}_\text{true}^\top \mathbf{x} + \epsilon - \hat{\mathbf{w}}^\top \mathbf{x})^2] \\

&= \mathbb{E}_\mathbf{x}[(\mathbf{w}_\text{true}^\top \mathbf{x} - \hat{\mathbf{w}}^\top \mathbf{x})^2] + \mathbb{E}[\epsilon^2] \\

&= (\mathbf{w}_\text{true} - \hat{\mathbf{w}})^\top \mathbb{E}[\mathbf{x}\mathbf{x}^\top](\mathbf{w}_\text{true} - \hat{\mathbf{w}}) + \sigma^2 \\

&= (\mathbf{w}_\text{true} - \hat{\mathbf{w}})^\top I(\mathbf{w}_\text{true} - \hat{\mathbf{w}}) + \sigma^2 \\

&= \|\mathbf{w}_\text{true} - \hat{\mathbf{w}}\|_2^2 + \sigma^2

\end{align}


where we used:
- $\mathbb{E}[\mathbf{x}\mathbf{x}^T] = I$ (for $x_0 \sim \mathcal{N}(0,1)$)
- $\mathbb{E}[\epsilon^2] = \sigma^2$
- Independence of $\mathbf{x}$ and $\epsilon$

**K-fold Cross-validation:**

We can compare this theoretical expected error with estimates from $k$-fold cross-validation:

$$\hat{E}(\hat{\mathbf{w}}) = \frac{1}{k} \sum_{i=1}^k \left[ \frac{1}{|V_i|} \sum_{(\mathbf{x},y)\in V_i} (y - \hat{\mathbf{w}}_i^\top \mathbf{x})^2 \right]$$

where
- $k$ is the number of folds
- $V_i$ is the validation set ($i$-th fold)
- $\hat{\mathbf{w}}_i$ is the model trained on all folds except $V_i$
- $|V_i|$ is the size of the validation fold

**SOLUTION 1A.4.1**

The task is calculate the true expected error. We are given a function called true_gen_error that takes in a numpy array of w (learned polynomial coefficients) and w_true (true polynomial coefficients) and sigma (the standard deviation of Gaussian noise) and returns teh true expected error or true risk. 

The function is implemented as follows. An error of 0 is initalized. 

Then, the true risk is computed using the following formula: 

$$||\mathbf{w}_\text{true} - \hat{\mathbf{w}}||_2^2 + \sigma^2$$

The formula is computed using functions from numpy such as np.subtract(w_true, w), and then takes the 2-norm using np.linalg.vector_num and sqaures the result. the result is then added to sigma squared. 

The estimated expected erorr using k-fold cross-validation is implemented for us. 

**IMPLEMENTATION 1A.4.1**

The learned parameter $\hat{\mathbf{w}}$ is computed using the customzied gradient descent function we constructed previously. The true risk is then computed using the true_gen_error function we implemented with w being the learned parameter $\hat{\mathbf{w}}$ from gradient descent, w_true being a np.array of 2, 0.6, and sigma being 0.6. The 5-fold CV estimate is computed using the function they gave us.

The results are the following:

True expected error: 0.4319

5-fold CV estimate: 0.4056 (±0.1941)

## Task 1B: Model Selection for House Value Prediction

In this task, we will use the California housing dataset from the scikit-learn package to predict the house values in California districts given some summary stats about them based on the 1990 census data.

The dataset has 8 features: longitudes, latitudes, housing median age, total rooms, total bedrooms, population, households, median income, and median house value. The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($\$100,000$). We split the dataset as 80\% for training data and 20\% for testing data. 

We were given scripts that loaded the california housing dataset into a pandas dataframe. The dataframe is below:

![California Housing Dataset](/workspaces/regression-and-model-selection/images/california_housing_dataset.png)


### Task 1B.1: Ridge Regression

**TASK**

We were given a script that generated the training and test data.

Task 1B.1 includes two sub sections, *Training and Evaluation* and *Model Selection via k-fold Cross Validation*

*Training and Evaluation*

In this subsection, we are tasked with writing a customized function to fit the ridge regression model on the training data and calculate the MSE on the test set.

The function should choose $\lambda$ from $\{10^{-10}, 10^{-6}, 10^{-4}, 10^{-2}, 10^{-1}, 1, 10, 20, 50, 100\}$, compute the estimate $\hat{\mathbf{y}}$ for different values $\lambda$, and plot the test MSE as a function of $\lambda$. 

*Model Selection via k-fold Cross Validation*

In this subsection, we are tasked with implementing 10-fold cross validation on the training set to select lambda. 

**SOLUTION**

*Training and Evaluation*

In this subsection, we are tasked with writing a customized function to fit the ridge regression model on the training data and calculate the MSE on the test set.

The function can be described as follow. The customized function is named train_and_eval and takes numpy arrays of X_train, y_train, X_eval, and y_eval, a lambda value and returns the mean squared error. 

Initially, the mse is set to 0, eta is set to 0.001, and tolerance is set to 1e-4. Then, the customized gradient descent function is called using X_train, y_train, initial_w, eta, a lambda value, and tolerance. The optimized weight computed from the gradient descent function is used to make predictions on the evaluation set as follows:

$$\hat{y} = \mathbf{X}\mathbf{w}$$

where X is X_eval and w is the optimized weight from gradient descent.

The mean squared error is then calculated using the following formula:

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2$$

The function then returns the calculated mse.

*Model Selection via k-fold Cross Validation*

In this subsection, we are tasked with implementing 10-fold cross validation on the training set to select lambda. 

The function can be described as follow. The customized function is named cross_validation and takes numpy arrays of X_train, y_train 

**IMPLEMENTATION**

*Training and Evaluation*

The customized train_and_eval function is called
for each lambda value in the list $\{10^{-10}, 10^{-6}, 10^{-4}, 10^{-2}, 10^{-1}, 1, 10, 20, 50, 100\}$
and the given X_train , y_train, X_test, and y_test. The mse from each lambda is then put into an array.

The resulting array from calculating the mse from each lambda is: 

[[4.93873158e+00 1.00000000e-10],

[4.93873173e+00 1.00000000e-06],

[4.93874647e+00 1.00000000e-04],

[4.94171940e+00 1.00000000e-02],

[4.96865133e+00 1.00000000e-01],

[5.17883497e+00 1.00000000e+00],

[5.52571622e+00 1.00000000e+01],

[5.57193418e+00 2.00000000e+01],

[5.60276505e+00 5.00000000e+01],

[5.61368148e+00 1.00000000e+02]]

The array of mse is then visualized as follows:

![Test MSE](/workspaces/regression-and-model-selection/images/test_mse.png)

### Task 1B.2 Polynomial Model Selection

**TASK**

**SOLUTION**

**IMPLEMENTATION**