### Problem 1
<img src="problem1.png" width="800" height="600" alt="Problem 1 Image"> 




### Solution 1

### (i)

The loss function provided is a ridge regression loss function, which is used for linear regression with 
L2 regularization. The L2 regularization is included to prevent overfitting by penalizing large coefficients through the regularization parameter
λ.


Analytically solving for the parameters $\beta$ involves finding the values of $\beta$ that minimize the loss function $L(\beta)$. For ordinary linear regression without regularization (i.e., when $\lambda = 0$), this can be done by setting the derivative of the loss function with respect to each $\beta_j$ to zero and solving the resulting normal equations. However, with the addition of the L2 regularization term, the solution is not the same as ordinary least squares (OLS) because of the penalty on the size of the coefficients.

To find the analytical expression for the $\beta$ parameters in ridge regression, we also set the derivative of $L(\beta)$ with respect to each $\beta_j$ to zero. This will give us a set of equations that we can solve for $\beta$.

Let's denote our design matrix as $X$ (with each row corresponding to an observation and each column to a feature) and our response vector as $y$. The loss function can be written in matrix form as:

$$ L(\beta) = (y - X\beta)^T(y - X\beta) + \lambda\beta^T\beta $$

To minimize this loss function, we take the derivative with respect to $\beta$ and set it to zero:

$$ \frac{\partial L(\beta)}{\partial \beta} = -2X^T(y - X\beta) + 2\lambda\beta = 0 $$

This gives us the ridge regression normal equations:

$$ X^TX\beta + \lambda I\beta = X^Ty $$

where $I$ is the identity matrix. The solution to this equation is:

$$ \beta = (X^TX + \lambda I)^{-1}X^Ty $$

----------------
### (ii) Gradient of the loss function $\nabla L(\beta)$ 

$$\nabla L(\beta) = -2X^T(y - X\beta) + 2\lambda\beta $$


where:

- $X$ is the matrix of input features, with rows representing samples and columns representing features.
- $X^T$ is the transpose of the matrix $X$.
- $y$ is the vector of observed values (target values).
- $\beta$ is the vector of parameters that we are trying to learn.
- $\lambda$ is the regularization parameter that controls the amount of shrinkage: larger values of $\lambda$ shrink the parameters more toward zero.

----------------

### (iii) Update step for gradient descent

$$  \beta_j^{(t+1)} = \beta_j^{(t)} - \eta \cdot \frac{\partial L(\beta)}{\partial \beta_j} $$

where:

- $\beta_j^{(t+1)}$ is the updated value of the $j$-th parameter at iteration $t+1$.
- $\beta_j^{(t)}$ is the current value of the $j$-th parameter at iteration $t$.
- $\eta$ is the learning rate, a positive scalar determining the step size at each iteration.
- $\frac{\partial L(\beta)}{\partial \beta_j}$ is the partial derivative of the loss function with respect to the $j$-th parameter, representing the direction and rate of the steepest increase in the loss function.

By subtracting the gradient scaled by the learning rate from the current parameters, the update rule moves the parameters in the direction that most steeply reduces the loss function.

----------------
### (iv) pseudo-code for a stochastic gradient descent (SGD) algorithm to estimate parameters $\beta$ 

- Initialize β at random
- Choose a learning rate η

- Repeat the following until an approximate minimum is obtained:
    - Shuffle the dataset randomly
    - For each example in the dataset:
        - Calculate the gradient of the loss with respect to the example
        - Update β by subtracting η times the gradient from β


