## Chapter 3: First order methods


# 3.11 Mini-batch optimization 

- In machine learning applications we almost never tasked with minimizing a single mathematical function, but one that consist a *sum* of $P$ functions.  


- In other words the sort of function $g$ we very often need to minimize in machine learning applications takes the general form 

\begin{equation}
g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right).
\end{equation}

where $g_1,\,g_2,\,...,g_P$ are mathematical functions themselves.  


- In machine learning applications hese functions $g_1,\,g_2,\,...,g_P$ are almost always of the same type - e.g., they can be convex quadratic functions with different constants paramterized by the same weights $\mathbf{w}$.  


- This special *summation structure* allows for a simple but very effective enhancement to virtually any local optimization scheme, and is called *mini-batch optimization*. 

- You can toggle the code on and off in this presentation via the button below.

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# import standard tools
import sys
sys.path.append('../../')
import autograd.numpy as np
import time

# import custom plotting tools
from mlrefined_libraries import math_optimization_library as optlib
from mlrefined_libraries import calculus_library as callib
static_plotter = optlib.static_plotter.Visualizer();
anime_plotter = optlib.animation_plotter.Visualizer();

# The next three lines are needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## 3.11.1  A simple idea with powerful consequences

- Suppose we were to apply a local optimization scheme to minimize a function $g$ of the form

\begin{equation}
g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right).
\end{equation}


where $g_1\,g_2,\,...,g_P$ are all functions of the same kind (e.g., quadratics with different constants parameterized by $\mathbf{w}$).  

- What would happen if we were to try to minimize $g$ by taking descent steps in the summand functions $g_1,\,g_2,\,...,g_P$ one-at-a-time?   


- As we will see empircaly throughout this text, starting with the examples below, in many instances this idea can actually lead to considerably faster optimization of a function $g$ consisting of a sum of $P$ functions as detailed in general above above.


- The gist of this idea is drawn graphically in the figure below for the case $P = 3$, where we compare the idea of taking a the a descent step simultaneously in $g_1,\,g_2,\,...,g_P$ versus a sequence of $P$ descent steps in $g_1$ then $g_2$ etc., up to $g_P$. 

<img src= '../../mlrefined_images/math_optimization_images/batch_vs_miinbatch_functions.png' width="80%" height="auto"/>

## 3.11.2  Descending with larger mini-batch sizes

- Instead of taking $P$ sequential steps in single functions $g_p$ (a mini-batch of *size $1$*) one-at-a-time, we can more general (with functions $g$ that take the form $g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right)$) take fewer steps in one epoch, but take each step with respect to *several* of the functions $g_p$ e.g., two functions at-a-time, or three functions at-a-time, etc.,.  


- With this slight twist on the idea detailed above we take fewer steps per epoch but take each with respect to larger non-overlapping subsets of the functions $g_1,\,g_2,\,...,g_P$, but still sweep through each function exactly once per epoch. 

## 3.11.3  Mini-batch optimization general performance

- Is the trade-off - taking more steps per epoch with a mini-batch approach as opposed a full descent step - worth the extra effort?  Typically *yes*.  


- Often in practice when minimizing machine learning functions an epoch of mini-batch steps like those detailed above will drastically outperform an analagous full descent step - often referred to as a *full batch* or simply a *batch* epoch in the context of mini-batch optimiztaion.  


- A prototypical comparison of a cost function history employing a batch and corresponding epochs of mini-batch optimization applied to the same hypothetical function $g$ with the same initialization $\mathbf{w}^0$ is shown in the figure below.  


- Because we take far more steps with the mini-batch approach and because each $g_p$ takes the same form, each epoch of the mini-batch approach typically outperforms its full batch analog.  


- Even when taking into account that far more descent steps are taken during an epoch of mini-batch optimization the method often greatly outperforms its full batch analog.

<img src= '../../mlrefined_images/math_optimization_images/minibatch_functions.png' width="80%" height="auto"/>