# Gradient Descnet

## Fixed step size
- Initialize x = 0
- repeat:
  - x $\leftarrow$ x - $\eta \nabla f(x)$, where $\eta$ is the step size
  - until the stop cretia is met

However, fixed step size is not good choice in many situations.
## Covergence Theorem for Fixed Step Size
Suppose $f: \mathbb{R}^{d} \to \mathbb{R}$ is convex and differentiable, and $\nabla f$ is **Lipschitz** continuous with constant $L > 0$, i.e.   
$$\lVert \nabla f(x) - \nabla f(y)\rVert \leq L \lVert x - y \rVert $$ $\forall x, y \in \mathbb{R}^d$  
Then the gradient descent with fixed step size $\eta \leq \frac{1}{L}$ converges. In particular, we have:  
$$f(x^{(k)}) - f(x^{*}) \leq \frac{\lVert x^{(0)} - x^* \rVert^2}{2\eta k} $$  
<br/>

proof:  
$x^{(1)}=x^{(0)}-\eta \nabla f\left(x^{(0)}\right)$  
$x^{(2)}=x^{(1)}-\eta \nabla f\left(x^{(1)}\right)$   
.  
.  
$x^{(k)}=x^{(k - 1)}-\eta \nabla f\left(x^{(k - 1)}\right)$  
we want to prove the sequence $\{x^{(0)}, x^{(1)}, ... , x^{(k)} \}$ is convergent, we show that it is Cauchy.  
let $m, n \in \mathbb{R}^{d}$, and $m > n$,  
$\begin{aligned} 
\| x^{(m)}-x^{(n)} &\| =\| x^{(m-1)}-\eta \nabla f\left(x^{(m-1)}\right)-x^{(n-1)}+\eta \nabla f\left(x^{(n-1)}\right) \| \\ &=\left\|x^{(m-1)}-x^{(n-1)}+\eta \nabla f\left(x^{(n-1)}\right)-\eta \nabla f\left(x^{(m-1)}\right)\right\| \\ 
&\leqslant \| x^{(m-1)}-x^{(n-1)} \|+\eta\| \nabla f\left(x^{(m-1)}\right)-\nabla f\left(x^{(n-1)}\right) \| \\ 
&\leqslant \| x^{(m-1)}-x^{(n-1)} \|+\eta L \|x^{(m-1)}-x^{(n-1)} \|\\ 
&\leqslant 2\|x^{(m-1)}-x^{(m-1)}\| 
\end{aligned}$  
if $\eta \leqslant \frac{1}{L}$,  
$\Rightarrow\|x^{(m)}-x^{(n)}\| \leqslant 2^{n}\|x^{(m-n)}-x^{(0)}\|<\epsilon$, if we take some $N$ and $m, n \ge N$

## When to Stop?
- Wait until $\|\nabla f(x)\| \leqslant \epsilon$, for some $\epsilon$ you choose
- Stop when the result worsen or not improving

# Gradient Descent for Empirical Risk

## Linear Least Squares Regression
**Setup**  
- Input Space $\mathcal{X} = \mathcal{R}^d$  
- Output Space $\mathcal{y} = \mathcal{R}$
- Action Space $\mathcal{y} = \mathcal{R}$  
- loss $\ell(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2$  
- Hypothesis Space $\mathcal{F} = \{\mathcal{f}: \mathcal{R}^d \to \mathcal{R}|w^Tx, w\in \mathcal{R}^d \}$  

**Empirical Risk**  
$$\hat{R}_{n}(w)=\frac{1}{n} \sum_{i=1}^{n}\left(w^{T} x_{i}-y_{i}\right)^{2}$$  
where $w \in \mathcal{R}^d$ parameterizes the hypothesis space $\mathcal{F}$. 

## Gradient Descent for Empirical Risk Averages
Generally, our hypothesis space is $\mathcal{F} = \{\mathcal{f_w}: \mathcal{X} \to \mathcal{A}| w\in \mathcal{R}^d \}$, and ERM is to find $w$ minimizing 
$$\hat{R}_{n}(w)=\frac{1}{n} \sum_{i=1}^{n} \ell(f_{w}(x_{i})-y_{i})$$  
suppose $\ell(f_{w}(x_{i})-y_{i})$ is differentiable with respect to $w$, then we can do gradient descent

## Question: How does it scale with n?
$$\nabla \hat{R}_{n}(w)=\frac{1}{n} \sum_{i=1}^{n} \nabla_{w} \ell\left(f_{w}\left(x_{i}\right), y_{i}\right)$$
- At each single iteration, we have to touch $n$ points, it is comuptational expensive.  
- Can we make progress without looking at all the data?

# Minibatch Gradient
Suppose $\mathcal{D_{n}} = \{(x_1, y_1),...,(x_n, y_n)\}$ is our full dataset, let's make a subsample of size $N$.  
$\{(x_{m_1}, y_{m_1}),...,(x_{m_N}, y_{m_N})\}$,  
then the minibatch gradient descent is 
$$\nabla \hat{R}_{N}(w)=\frac{1}{N} \sum_{i=1}^{N} \nabla_{w} \ell\left(f_{w}\left(x_{m_{i}}\right), y_{m_{i}}\right)$$  

## What can we say about the minibatch descent?
- What is the expected value?
$$\begin{aligned} \mathbb{E}\left[\nabla \hat{R}_{N}(w)\right] &=\frac{1}{N} \sum_{i=1}^{N} \mathbb{E}\left[\nabla_{w} \ell\left(f_{w}\left(x_{m_{i}}\right), y_{m_{i}}\right)\right] \\ 
&=\mathbb{E}\left[\nabla_{w} \ell\left(f_{w}\left(x_{m_{1}}\right), y_{m_{1}}\right)\right] \\
&= \sum_{i = 1}^{n}P(m_{1} = i)\nabla_{w}\ell(f_w(x_i), y_i) \\
&= \frac{1}{n}\sum_{i = 1}^{n}\nabla_{w}\ell(f_w(x_i), y_i) \\
&= \nabla \hat{R}_n(w)
\end{aligned}$$ 
- Minibatch gradient is an **unbiased estimator** for the full batch gradient.
- Tradeoﬀs of minibatch size:
  - Bigger $N$, better estimate of gradient, but slower
  - Smaller $N$, quicker but worse estimate
- Even N = 1 works, it's called **stochastic gradient descent (SGD)**

## Minibatch Algorithm
- Initialize w = 0  
- Repeat 
   - Randomly choose a subsample with $N$ points
   - $w \leftarrow w-\eta\left[\frac{1}{N} \sum_{i=1}^{N} \nabla_{w} \ell\left(f_{w}\left(x_{i}\right), y_{i}\right)\right]$

## Stochastic Gradient Descent
- Initialize w = 0
- Repeat 
   - Randomly choose a training point $(x_i, y_i)$ 
   - $w \leftarrow w-\eta \nabla_{w} \ell\left(f_{w}\left(x_{i}\right), y_{i}\right)$