## Chapter 8

## Optimization for Training Deep Models

### 8.1 How Learning Differes from Pure Optimization

<strong>ML/DL Optimization </strong>: <br>
 1. acts indirectly - reduce cost function $J(\theta)$ to improve $P$

<strong>Pure Optimization</strong>: <br>
1. acts directly - minimize $J$ is the goal

Objective function: <br>


\begin{equation*}
J^{*}(\theta) =  \mathop{\mathbb{E}}_{(x,y) \sim p_{data}} L(f(x;\theta),y)
\end{equation*}


#### 8.1.1 Empirical Risk Minimization

Above equation is known as <strong> risk </strong>
<br> but in reality, the true underlying distribution $p_{data}$ is not known
<br> replacing the true distribution $p(x,y)$ with empirical distribution $\hat{p}(x,y)$

<strong> empirical risk: </strong>
\begin{equation*}
\mathop{\mathbb{E}}_{x,y \sim \hat{p}_{data}(x,y)} [L(f(x;\theta),y)] = \frac{1}{m} \sum\limits_{i=1}^m L(f(x^{(i)};\theta), y^{(i)})
\end{equation*}
where $m$ is the number of training examples

However, empirical risk minimization is prone to overfitting <br>
 - models with High Capacity can simply memorize the training set
 - many useful loss functions such as 0-1 loss have no useful derivatives
<br>$\therefore$ emprical risk minimization is rarely used in DL

#### 8.1.2 Surrogate Loss Functions and Early Stopping

<strong> surrogate loss function </strong> is used as a proxy, and has advantages over the original loss function. 
<br> Negative Log-Likelihood is typically used as a surrogate for the 0-1 loss.

Important Difference between Optimization in General and Optimization in ML/DL:
<br> ML usually minimizes a surrogate loss function but <strong> halts when a convergence criterion based on early stopping is satisfied </strong> 
<br>whereas for the pure optimization, it is considered to have converged when the <strong> gradient becomes very small </strong>

#### 8.1.3 Batch and Minibatch Algorithms

ML: Objective function usually decomposes as a <strong> sum over the training examples </strong>

For Example, MLE: <br>

\begin{equation*}
\theta_{ML} = arg\max\limits_{\theta} \sum\limits_{i=1}^m log p_{model} (x^{(i)}, y^{(i)}; \theta) 
\end{equation*}


Maximizing this sum is equivalent to maximizing the Expectation over the Empirical distribution defined by the training set: <br>
\begin{equation*}
\mathop{\mathbb{E}}_{x,y \sim \hat{p}_{data}} log p_{model}(x, y; \theta)
\end{equation*}

Most commonly used property is the gradient: <br>
\begin{equation*}
\nabla_{\theta} \:J(\theta) = \mathop{\mathbb{E}}_{x,y \sim \hat{p}_{data}} \nabla_{\theta}\: log\: p_{model} (x, y; \theta)
\end{equation*}

Computing this expectation exactly is very expensive <br>
 - instead, compute these expectations by randomly sampling a small number of examples from the dataset, then taking the Average over only those examples

1. The Standard Error of the Mean is given by: <br>
\begin{equation*}
SE(\hat{\mu}_m) = \sqrt{Var\bigg[\frac{1}{m}\sum\limits_{i=1}^mx^{(i)}\bigg]} = \frac{\sigma}{\sqrt{m}}
\end{equation*}
<br> where $\sigma$ is the true standard deviation of the value of the samples

The denominator of $\sqrt{m}$ shows that there iare <strong> less than linear returns </strong> to using more examples to estimate the gradient.

2. We may find large number of examples that all make very similar contributions to the gradient - <strong>redundant samples</strong>

<strong> batch / deterministic </strong>: use entire training set <br>
<strong> online </strong>: use examples drawn from a continuous stream of new data <br>
<strong> minibatch / stochastic </strong>: use more than one but less than all of training examples <br>

minibatch sizes are decided by following factors:
- larger batches = more accurate estimate of gradient, but with less than linear returns
- small batches do not utilize multi-core architectures
- small batches can offer a regularizing effect - due to noise they add to learning process. Generalization error is often best for a batch size of 1. But learning will take more time - more steps + reduced learning rate to maintain stability due to high variance in gradient estimate

Update methods based only on the gradient $\mathbf{g}$ are usually relatively robust and can handle smaller batch sizes like 100. <br>
Second-order methods that compute updates such as $\mathbf{H^{-1}g}$ typically require much larger batch sizes like 10,000. - to minimize fluctuations in the estimates of *$\mathbf{H^{-1}g}$*

If $\mathbf H$ is poorly conditioned, very small changes in the estimate of $\mathbf g$ can cause large changes in the update $\mathbf H^{-1}g$, even if $\mathbf H$ were estimated perfectly 

Recall from **4.2 Poor Conditioning**: <br>
$f(x) \; = \; A^{-1}x $ <br>
$A \; \in \; \mathbb{R}^{n\times n} $ has an eigenvalue decomposition<br>
condition number is: $\max \limits_{i,j}\|\frac{\lambda_i}{\lambda_j}\|$
<br> This is the ratio of the magnitude of the largest and smallest eigenvalue
<br> when condition number is large, matrix inversion is particularly sensitive to error in the input
<br> Poorly conditioned matrices amplify pre-existing erros when we multiply by the true matrix inverse.

Important to select minibatches **randomly**.