---
title: 12.4 SGD
subject:  Optimization
subtitle: 
short_title: 12.4 SGD
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: 
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)

{doc}`Lecture notes <../lecture_notes/Lecture 22 - An Introduction to Backpropagation.pdf>`

## Reading

Material related to this page, as well as additional exercises, can be

## Learning Objectives

By the end of this page, you should know:
- 

\section*{Stochastic Gradient Descent}

Last class, motivated by machine learning applications, we discussed how to apply gradient descent to solve the following optimization problem:

\begin{equation}
\text{minimize } \text{loss}((z_i,y_i),x) = \text{minimize } \sum_{i=1}^N l(m(z_i;x)-y_i) \quad (L)
\end{equation}

over the parameters $x$ of a model $m$ so that $m(z_i;x) \approx y_i$ for all training data input/output pairs $(z_i,y_i)$, $i=1,\ldots,N$.

The bulk of our effort was spent on understanding how to compute the gradient of $l(m(z_i;x)-y_i)$ with respect to the model parameters $x$, with a particular focus on models $m$ that can be written as the following composition of models:

\begin{align*}
O_0 &= z_i \\
O_1 &= m_1(O_0;x_1), \quad O_1 \in \mathbb{R}^{p_1}, O_0 \in \mathbb{R}^{p_0} \\
O_2 &= m_2(O_1;x_2), \quad O_2 \in \mathbb{R}^{p_2}, O_1 \in \mathbb{R}^{p_1} \quad (DNN) \\
&\vdots \\
O_L &= m_L(O_{L-1};x_L), \quad O_L \in \mathbb{R}^{p_L}, O_{L-1} \in \mathbb{R}^{p_{L-1}}
\end{align*}

which is the structure of contemporary models used in machine learning called deep neural networks (we'll talk about these much more later). We made two key observations about the model structure that allowed us to effectively apply the matrix chain rule to compute gradients $\frac{\partial l}{\partial x_1}, \ldots, \frac{\partial l}{\partial x_L}$:

1) $\frac{\partial l}{\partial x_j}$ only needs to compute partial derivatives of $l$ and layers $j, j+1, \ldots, L$;

2) If we start from layer $L$ ($\frac{\partial l}{\partial x_L}$) and work our way backwards we can
   (i) reuse previously computed partial derivatives, and
   (ii) save space on memory by exploiting that $\frac{\partial l}{\partial x_L}$ is a row-vector.

The resulting algorithm is called backpropagation, and is a key enabling technology in modern machine learning. You will learn more about this in ESE 5460.

Now, despite all of this cleverness, when the model parameter vectors $x_1, \ldots, x_L$ and layer outputs $O_1, \ldots, O_L$ are very high dimensional (it is not uncommon for each $x_i$ to have 100s of thousands or even millions of components) computing the gradient $\nabla_x l(m(z_i;x)-y_i)$ of a single term in the sum (L) can be quite costly. Add to that the fact that the number of data points $N$ is often very large (order of millions in many settings), and we quickly run into some serious computational bottlenecks. And remember, this all just so we can run a single iteration of gradient descent. This may seem hopeless, but luckily, there is a very simple trick that lets us work around this problem: stochastic gradient descent.

Stochastic Gradient Descent (SGD) is the work horse algorithm of modern machine learning and has been rediscovered by various communities over the past 70 years, although it is usually credited to Robbins and Monro for a paper they wrote in 1951.

\textbf{Key Idea:} Since our loss function can be written as a sum over examples, i.e.

\begin{equation}
(LL) \quad \text{loss}((z_i,y_i),x) = \frac{1}{N} \sum_{i=1}^N l(m(z_i;x)-y_i) \quad (\text{loss}(x) = \mathbb{E} l_i(x))
\end{equation}

then the gradient is also a sum: $\nabla_x \text{loss} = \frac{1}{N} \sum \nabla_x l_i$. Therefore we expect each individual gradient $\nabla_x l_i$ to have some useful information in it. SGD minimizes (LL) by following the gradient of a \textbf{single randomly selected example} (or a small batch of $B$ randomly selected samples).

The SGD algorithm can be summarized as follows: Start with an initial guess $x^{(0)}$, and at each iteration $k=0,1,2,\ldots$, do:

(i) Select an index $i \in \{1,\ldots,N\}$ at random
(ii) Update
\begin{equation}
x^{(k+1)} = x^{(k)} - s^{(k)} \nabla_x l_i(x^{(k)}) \quad (SGD)
\end{equation}

Using the gradient of only the $i$th loss term $l_i(x) = l(m(z_i;x)-y_i)$.

As before, $s^{(k)} > 0$ is a step-size that can change as a function of the current iterate.

This method works shockingly well in practice and is computationally tractable as at each iteration, the gradient of only the $i$th loss term needs to be computed. Modern versions of this algorithm replace step (i) with a mini-batch, i.e., by selecting $B$ indices at random, and step (ii) replaces $\nabla_x l_i(x^{(k)})$ with the average gradient:

\begin{equation}
\frac{1}{B} \sum_{b=1}^B \nabla_x l_b(x) \quad (\hat{G})
\end{equation}

The overall idea behind why SGD works (take ESE 605U if you want to see a rigorous proof) is that while each individual update (SGD) may not be an accurate gradient for the overall loss function loss$(x)$, we are still following $\nabla_x \text{loss}(x)$ "on average". This also explains why you may want to use a mini-batch $B$ to compute a better gradient estimate $(\hat{G})$, as having more loss terms leads to a better approximation of the true gradient. The tradeoff here is that as $B$ becomes larger, computing $(\hat{G})$ is more computationally demanding.

\section*{Linear Classification and the Perceptron}

An important problem in machine learning is that of binary classification. In one of the online case studies, you saw how to use least squares to solve this problem. Here, we offer an alternate perspective that will lead us to one important historical reason for the emergence of deep neural networks.

The problem set up for linear binary classification is as follows. We are given a set of $N$ vectors $z_1,\ldots,z_N \in \mathbb{R}^n$ with associated binary labels $y_1,\ldots,y_N \in \{-1,+1\}$. The objective in linear classification is to find an affine function $x^Tz+v$, defined by unknown parameters $x \in \mathbb{R}^n$ and $v \in \mathbb{R}$, that strictly separates the two classes. We can pose this as finding a feasible solution to the following linear inequalities:

\begin{equation}
(LC) \quad
\begin{cases}
x^Tz_i + v > 0 & \text{if } y_i = +1 \\
x^Tz_i + v < 0 & \text{if } y_i = -1
\end{cases}
\end{equation}

The geometry of this problem is illustrated on the right. There are three key components:

1) The separating hyperplane $H = \{z \in \mathbb{R}^n : x^Tz+v = 0\}$. This is the set of vectors $z \in \mathbb{R}^n$ that lie on the subspace $H$, which is the solution set to the linear equation

\[x^Tz = -v.\]

The coefficient matrix here is $x^T \in \mathbb{R}^{1\times n}$, and so rank$(x^T) = 1$. This tells us that $\dim \text{Null}(x^T) = \dim H = n-1$. In $\mathbb{R}^2$, this is the equation of a line:

\[\begin{bmatrix} x_1 & x_2 \end{bmatrix} \begin{bmatrix} z_1 \\ z_2 \end{bmatrix} + v = x_1z_1 + x_2z_2 + v = 0 \implies z_2 = -\frac{x_1}{x_2}z_1 - \frac{v}{x_2}\]

In $\mathbb{R}^3$, this is the equation of a plane with normal vector $x$ going through point $v$. And in $\mathbb{R}^n$ is called a hyperplane. A key feature of a hyperplane is that it splits $\mathbb{R}^n$ into two half-spaces, i.e., the subsets of $\mathbb{R}^n$ on either side.

2) The half-space $H^+ = \{z \in \mathbb{R}^n : x^Tz+v > 0\}$, which is the "half" of $\mathbb{R}^n$ for which $x^Tz+v > 0$. We want all of our positive (+) examples to live here.

3) The half-space $H^- = \{z \in \mathbb{R}^n : x^Tz+v < 0\}$, which is the "half" of $\mathbb{R}^n$ for which $x^Tz+v < 0$. We want all (-) examples to live here.


The problem of finding the parameters $(x,v)$ defining the classifier in (LC) can be solved using linear programming, a family of optimization algorithms that you'll learn about in BE 3040 and ESE 6050. It can also be solved using SGD as applied to a special loss function called the hinge loss:

\[
\text{loss}((z_i,y_i);(x,v)) = \frac{1}{N} \sum_{i=1}^N \max\{1-y_i(x^Tz_i+v), 0\},
\]

which is a commonly used loss function for classification (you'll learn why in your ML classes).

The reason we are taking this little digression is that applying SGD to the hinge loss gives us The Perceptron Algorithm:

Initialize initial guess $(x^{(0)},v^{(0)})$
For each iteration $k=0,1,2,\ldots$, do:
\begin{enumerate}
    \item Draw a random index $i \in \{1,\ldots,N\}$
    \item If $y_i(x^{(k)T}z_i+v^{(k)}) < 1$: Update $(x^{(k+1)},v^{(k+1)}) = (x^{(k)},v^{(k)}) + y_i\begin{bmatrix} z_i \\ 1 \end{bmatrix}$ \hspace{2em} (U)
    \\ Else, if $y_i(x^{(k)T}z_i+v^{(k)}) \geq 1$, do not update $(x^{(k+1)},v^{(k+1)}) = (x^{(k)},v^{(k)})$.
\end{enumerate}

This algorithm goes through the examples $(z_i,y_i)$ one at a time, and updates the classifier only when it makes a mistake (U). The intuition is it "nudges" the classifier to be "less" wrong by $\|z_i\|^2+1$ on any example $(z_i,y_i)$ it currently misclassifies.

This incremental update works, and you can show that if there exists a solution to (LC), the perceptron algorithm will find it. People got REALLY EXCITED ABOUT THIS. See next page for a NYT article about the perceptron algorithm, which in hindsight seems a little silly given that we now know it's just SGD applied to a particular loss function. But then again, so is most of today's AI!

\section*{Single and Multi Layer Perceptrons}

Given the excitement about the perceptron, why do we not use them anymore? It turns out, it is very easy to stump! Consider the following set of positive and negative examples:

[XOR function diagram]

No linear classifier can separate the + from the -
Is AI doomed?

These define an XOR function: the positive examples are in quadrants where $\text{sign}(z_1) \neq \text{sign}(z_2)$ and the negative ones are in quadrants for which $\text{sign}(z_1) = \text{sign}(z_2)$. These can't be separated by a linear classifier!

**TO DO**: Electronic Brain teaches itself pic


But suppose we were allowed to have two classifiers, and then combine them using a nonlinearity?

\begin{tikzpicture}
\draw[->] (-2,0) -- (2,0) node[right] {$z_1$};
\draw[->] (0,-2) -- (0,2) node[above] {$z_2$};
\draw[blue, thick] (-2,1) -- (2,1) node[right] {$2$};
\draw[purple, thick] (1,-2) -- (1,2);
\draw[magenta, thick] (-2,-2) -- (2,2);
\node[magenta] at (1.5,1.5) {$H_1^+ = \{z_2 - z_1 > 0\}$};
\node[purple] at (-1.5,1.5) {$H_2^+ = \{z_1 - z_2 > 0\}$};
\end{tikzpicture}

In the image above, we define a pink classifier that returns $f_1(z) > z_1$
a purple classifier that returns $f_2(z) = z_2$. If we define our output to be

\[f(z) = f_1(z) f_2(z) = -z_1 z_2,\]

then we see that this 'works':

\begin{tabular}{c|c|c|c|c}
 & 1 & 2 & 3 & 4 \\
\hline
sign $f_1(z)$ & +1 & -1 & -1 & +1 \\
sign $f_2(z)$ & -1 & -1 & +1 & +1 \\
sign $f(z)$ & -1 & +1 & -1 & +1
\end{tabular}

This 'worked'! The two key ingredients here are:
\begin{enumerate}
\item Having intermediate computation, called hidden layers
\item Allowing for some nonlinearity.
\end{enumerate}

These two ingredients are combined to define the Multilayer Perceptron (MLP).

A single hidden layer MLP is defined by the equations:

\begin{align*}
h &= \sigma(W_1 z + b_1) \quad \text{(MLP1)} \\
o &= W_2 h + b_2
\end{align*}

The key features of (MLP1) are:
\begin{itemize}
\item An element-wise nonlinearity $\sigma$, called an activation function.
\item The input is $z \in \mathbb{R}^n$.
\item The hidden layer is defined by a weight matrix $W_1 \in \mathbb{R}^{h \times n}$ and a bias vector $b_1 \in \mathbb{R}^h$
\item The output layer is defined by a weight matrix $W_2 \in \mathbb{R}^{p \times h}$ and bias vector $b_2 \in \mathbb{R}^p$
\item The overall map maps input $z \in \mathbb{R}^n$ to output $o \in \mathbb{R}^p$.
\end{itemize}

In practice, (MLP1) is trained to find $W_1, W_2, b_1, b_2$ using SGD and backpropagation.

\section*{Why do we need a nonlinear activation function?}

Suppose we didn't include $\sigma(\cdot)$, and defined our MLP as $h = W_1 z + b_1$, $o = W_2 h + b_2$.
If we eliminate the hidden layer variable $h$, we get

\[o = W_2(W_1 z + b_1) + b_2 = \underbrace{W_2 W_1}_W z + \underbrace{W_2 b_1 + b_2}_b = Wz + b.\]

This shows that we do not increase the expressivity of our model; as without the activation
function, our model class reduces to affine functions. In some sense, this nonlinearity is
the "secret sauce" of MLPs.

Some common activation functions include:

\subsection*{The Sigmoid function:}

\begin{itemize}
\item Maps input into (0,1):

\[\text{sigmoid}(x) = \frac{1}{1+e^{-x}}\]

[Insert sigmoid function graph here]

\item Can view as a "soft version" of $\sigma(x) = 1$ if $x > 0$ and $\sigma(x) = 0$ if $x \leq 0$.
\item This is allows for binary classification over classes \{0,1\}.
\end{itemize}

\subsection*{The Rectified Linear Unit (ReLU):}

\[ReLU(x) = \max\{x, 0\}.\]

[Insert ReLU function graph here]

Which activation function to use is a bit of an art, but these are generally accepted tricks
of the trade that you'll learn about in ESE 5460. There are also many more than these two, with
new ones being invented.

In [None]:
\section*{Deep MLPs}

There is nothing preventing us from adding more hidden layers. The $L$-hidden-layer
MLP is defined as:

\begin{align*}
h_1 &= \sigma(W_1 x + b_1) \\
h_2 &= \sigma(W_2 h_1 + b_2) \\
&\vdots \\
h_L &= \sigma(W_L h_{L-1} + b_L) \\
o &= W_{L+1} h_L + b_{L+1}
\end{align*}

Shown on the right is an example with 3 hidden layers.

The important thing to notice is these functions are
compatible with our discussion on backpropagation, meaning computing gradients with respect to
the parameters $W_1,\ldots,W_{L+1}, b_1,\ldots,b_{L+1}$ can be done efficiently!

In the online notes, we'll show you how to take advantage of autodifferentiation to
efficiently train MLPs in code.

\begin{tikzpicture}[scale=0.7]
\node[circle,draw,fill=blue!20] (x1) at (0,0) {$x_1$};
\node[circle,draw,fill=blue!20] (x2) at (2,0) {$x_2$};
\node[circle,draw,fill=blue!20] (x3) at (4,0) {$x_3$};
\node[circle,draw,fill=blue!20] (x4) at (6,0) {$x_4$};

\node[circle,draw,fill=blue!20] (h11) at (0,2) {$h_1$};
\node[circle,draw,fill=blue!20] (h12) at (1.5,2) {$h_2$};
\node[circle,draw,fill=blue!20] (h13) at (3,2) {$h_3$};
\node[circle,draw,fill=blue!20] (h14) at (4.5,2) {$h_4$};
\node[circle,draw,fill=blue!20] (h15) at (6,2) {$h_5$};

\node[circle,draw,fill=blue!20] (h21) at (1,4) {$h_1$};
\node[circle,draw,fill=blue!20] (h22) at (3,4) {$h_2$};
\node[circle,draw,fill=blue!20] (h23) at (5,4) {$h_3$};

\node[circle,draw,fill=blue!20] (h31) at (2,6) {$h_1$};
\node[circle,draw,fill=blue!20] (h32) at (4,6) {$h_2$};

\node[circle,draw,fill=blue!20] (o1) at (2,8) {$o_1$};
\node[circle,draw,fill=blue!20] (o2) at (4,8) {$o_2$};

\draw (x1) -- (h11) -- (h21) -- (h31) -- (o1);
\draw (x2) -- (h12) -- (h22) -- (h32) -- (o2);
\draw (x3) -- (h13) -- (h23);
\draw (x4) -- (h14);
\draw (h15) -- (h23);

\node[text width=3cm] at (-2,0) {Input layer};
\node[text width=3cm] at (-2,2) {Hidden layer};
\node[text width=3cm] at (-2,4) {Hidden layer};
\node[text width=3cm] at (-2,6) {Hidden layer};
\node[text width=3cm] at (-2,8) {Output layer};

\end{tikzpicture}

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/10_Ch_11_PCA_Apps/121-Apps.ipynb)
