# 1.8 Gradient Descent #

## ***Vocabulary & Code*** ##

**Convexity**:
- A function is convex if the chord connecting any 2 points of the graph lies above the function.

<br>
<center>
    <img src="images/1.8.2.png" alt="Professor Notes" />
</center>
<br>

# Lecture Notes #

## ***1.8.0 Introduction*** ##

Goal: Find minimizer of a function.

We can evaluate $f(x)$ at some point $x$, and also evaluate the derivative ,$f'(x)$, to find the slope of the tangent line at that point..

**Update Rule:**
- If $f'(x) < 0$, move a bit to the right
- Else if $f'(x) > 0$, move a bit to the left
- Else if $f'(x) = 0$ or is close to 0, then stop and ouput $x$.

<br>
<center>
    <img src="images/1.8.1.png" alt="Professor Notes" />
</center>
<br>

What about a more complicated function? Using the same update rule, how can we reach the global minimum instead of the local minimum?

If the function is convex, we will always reach the global minimum.

---

Another example: $f(x) = w^Tx + b$ (linear functions in $d$ dimensions). We are currently at point $x$, we want to know what direction to move in to minimize $f$. By "direction" we mean a unit vector. $$

$f(x + u)$, where $u$ is some arbitraty unit vector.

$$f(x+u) = w^Tx+w^Tu+b$$

The correct choice of $u$ is $\frac{-w}{||w||}$.

The way to maximize the value of the inner product for the unit vector ($w^Tu$) would be to choose $\frac{w}{||w||}$, and since we want to minimize we will use $\frac{-w}{||w||}$.

Also, if we move in direction $\frac{-w}{||w||}$, then $f$ decreases by $||w||_2$

## ***1.8.1 Putting it Together*** ##

Thus far, our idea has been to look at tangent lines and this idea works for linear functions and simple complex functions.

Even if we want to minimize more complicated functions, assume they are "locally" linear.

$f$ at point $x$ can be Taylor expanded: 
$$f(x+\epsilon) = f(x) + \epsilon*f'(x)+ \frac{\epsilon^2}{2!} f''(x)+\frac{\epsilon^3}{3!} f'''(x)+.....$$
The above is an expression in terms of $\epsilon$. The first term is a linear function of $\epsilon$, and when $\epsilon$ is small, the rest of the terms are negligible. This means that when we are looking at very small neighborhoods, or values of $\epsilon$, $f$ actually is a linear function. 

---

Taylor's theorem also holds in $d$ dimensions, meaning even in $d$ dimensions, the function will look linear in a small enough neighborhood. Instead of taking derivatives, in higher dimensions, we must look at gradients.

The gradient of $f$ at point $x$ is written:

$$\nabla f(x)=(\frac{\partial f}{\partial x_1}(x), ... , \frac{\partial f}{\partial x_d}(x)) $$

Which is a $d$-dimensional vector.

---

Example:

$$f = w^Tx+b \leadsto \frac{\partial f}{\partial x_i} = w_i \leadsto \nabla f*w $$

Another Example:

<br>
<center>
    <img src="images/1.8.2.png" alt="Professor Notes" />
</center>
<br>

## ***1.8.2 Defining and Applying*** ##

Define Gradient Descent:

Imagine we are trying to minimize $f(w)$. Initially we'll choose $w$ randomly, some arbitraty starting value. 

If $||\nabla f(w)||_2 \lt \epsilon$, stop and output $w$.\
Otherwise, $w_{new} = w_{old} - \eta \nabla f(w_{old})$, where $\eta$ is the step-size parameter.

This is also written as $w_j^{new} = w_j^{old} - \eta \frac{\partial f}{\partial w_j}(w_{old})$. 

The step-size parameter is usually set to be relatively small, because the linearity of these complicated functions only holds locally.

---

Let's apply GD to linear regression.

In linear regression, we have a training set $S$ of size $m$. \
We are searching for this function: $h(x) = w^Tx+b$, and our function of loss was MSE (mean squared error): $M.S.E.(w) = \frac{1}{m}\sum_{j=1}^m(w^Tx^j+b-y^j)^2$. 

We are trying to minimize the MSE, so that is what we will use GD on, that will be our $f$. GD is used on loss.

We will call $w^Tx^j+b-y^j$ from inside $f$ $g_j$. We need to compute the gradient of the MSE at the point $w$, so our first step should be to compute the partial derivative $g_j$ w.r.t. $w_i$. 

<br>
<center>
    <img src="images/1.8.4.png" alt="Professor Notes" />
</center>
<br>

In this case, the update rule is:

$$ w_{new} = w_{old} - \eta \nabla MSE(w) $$

It is important to note that MSE is a convex function. And the running time for computing this notation is $\mathcal{O}(m*n)$. This function is also easily parallelizable. We can send each of the $j$'s to a different processor.

---

**Stochastic Gradient Descent**

Previously in the linear regression example, we summed over all points in the training set. 

We can choose an index $j$ at random and compute the gradient with respect to this point only...

The new update rule for MSE would be:

$$ w_{new} = w_{old} - 2\eta (w^Tx^j+b-y^j)x^j$$

Question: Why does this make sense as an update rule?
- In expectation, our update will use the entire gradient because each point will be chosen with equal probability.
- $\mathbf{E}[w_{new}] = w_{old} -2 \eta*\frac{1}{m}\sum_{j=1}^m(w^Tx^j+b-y^j)x^j$

---

**Batches** can be used to interpolate between gradient descent and and pure stochastic gradient descent. Batches will reduce the variance of the random variable corresponding to the weight vectors that you are obtaining.

## ***1.8.3 Choosing Step Size*** ##

Question: How do we choose $\eta$, the step-size?
- More art than science
- Use cross-validation to pick $\eta$
- Many techniques for adaptively choosing $\eta$ that are very successful
- **Momentum**:
    - Has a "velocity" variable $v$, which is initially 0.
    - $V_i = \alpha*V_{i-1}-\eta g_i$, where $g_i$ is the gradient at point $i$.
    - This takes a weighted average of $\eta g_i$'s.
    - $w_{new} = w_{old} + V_i$
 
<br>
<center>
    <img src="images/1.8.5.png" alt="Professor Notes" />
</center>
<br>

**Accelerated Gradient Descent**
Will be studied in optimization classes.

# Personal Notes #

### Summary of GD Process:

1. If $||\nabla f(w)||_2 \lt \epsilon$, stop and output $w$.\
2. Otherwise, $w_{new} = w_{old} - \eta \nabla f(w_{old})$, where $\eta$ is the step-size parameter. This is also written as $w_j^{new} = w_j^{old} - \eta \frac{\partial f}{\partial w_j}(w_{old})$. 

- **Solve the loss function**: Compute the loss function value (e.g., hinge loss, squared loss) for a data point xixi​.
- **Take the derivative**: Compute the gradient of the loss function with respect to the model parameters (e.g., w).
- **Plug into the update rule**: Update the model parameters using the gradient in the SGD update rule.

This iterative process continues until the model converges or a stopping criterion is met.

Note: $||\nabla f(w)||_2 = \sqrt{w_1^2 + w_2^2 + ... + w_n^2}$

---

### Generalized Gradient Descent (GD) Update Rule:

For **Gradient Descent (GD)**, the update rule is derived from minimizing a loss function $L(w)$ over the entire dataset. The generalized form is:

$$
w_{\text{new}} = w_{\text{old}} - \eta \nabla L(w_{\text{old}})
$$

Where:
- $w_{\text{new}}$ is the updated weight vector.
- $w_{\text{old}}$ is the current weight vector.
- $\eta$ is the learning rate (step size).
- $\nabla L(w_{\text{old}})$ is the **gradient of the loss function** with respect to $w$, computed using **all the data points** in the dataset.

---

### Generalized Stochastic Gradient Descent (SGD) Update Rule:

For **Stochastic Gradient Descent (SGD)**, the gradient is computed using only **one random data point** (or a small batch of data) instead of the entire dataset. The generalized form of the SGD update rule is:

$$
w_{\text{new}} = w_{\text{old}} - \eta \nabla L_i(w_{\text{old}})
$$

Where:
- $w_{\text{new}}$ and $w_{\text{old}}$ are the updated and current weight vectors.
- $\eta$ is the learning rate.
- $\nabla L_i(w_{\text{old}})$ is the gradient of the loss function with respect to $w$, but evaluated only on a **single data point** (or a mini-batch) $i$.

---

### Differences Between GD and SGD:

1. **Full Gradient vs. Stochastic Gradient**:
   - **GD** computes the gradient $\nabla L(w)$ using all training examples, which is computationally expensive for large datasets.
   - **SGD** computes the gradient using a **single random data point** (or a small batch), which makes it much faster per iteration but introduces some noise due to the randomness.

2. **Convergence**:
   - **GD** moves more smoothly towards the minimum since it uses the full dataset to compute the gradient at each step.
   - **SGD** introduces more variability in updates, which can help avoid local minima but may require more iterations to converge.

---

### Summary of Update Rules:

- **GD Update Rule**:
  $$
  w_{\text{new}} = w_{\text{old}} - \eta \nabla L(w_{\text{old}})
  $$
  (Gradient computed over the **entire dataset**)

- **SGD Update Rule**:
  $$
  w_{\text{new}} = w_{\text{old}} - \eta \nabla L_i(w_{\text{old}})
  $$
  (Gradient computed over a **single data point** $i$ or mini-batch)
