# Chapter 3: Stochastic Gradient Descent

## Introduction

In this chapter we will go over stochastic gradient descent (SGD). SGD is an optimization algorithm used to train (artificial) neural networks.

## Gradient Descent

We train most ML models by minimizing a loss function.

For example, let's say our model is described by a differentiable function $f$ of two arguments $x$ and $w,$ where $x$ is the vector of features and $w$ is the vector of weights and output is a single number $y$.

## Gradient Descent

Let's say we are trying to solve a regression problem. Then we could train our model $f$ by optimizing the following loss function

$$
  L(w) = \frac{1}{n}\sum_{i=1}^n (y_i - f(x_i, w))^2,
$$
where $(x_i, y_i)_{i=1}^n$ represents our training data. This loss function is called **mean squared error** (MSE).

That is, we are trying to find the weights $w$ such that our model would match the training data as best as possible.

## Gradient Descent

Note that the loss function is a function of weights only.

Since we assumed that $f$ is differentiable we can apply the **Gradient Descent** (GD) algorithm.

The idea behind GD is very simple. Since we assumed that $f$ is differentiable, then $L$ is also differentiable and its gradient $\nabla L$ (read as "del L") is defined:
$$
\nabla L (w) = \left( \frac{\partial L}{\partial w_1}(w), \dots, \frac{\partial L}{\partial w_m}(w) \right),
$$
where $m$ is dimension of $w$.

## Gradient Descent

The gradient is a vector. The geometric interpretation of the gradient $\nabla L$ is that it points in the direction in which $L$ increases the most quickly at point $w$. The magnitude of $\nabla L$ is the rate of increase of $L$ at $w$ in that direction. Then $- \nabla L$ is the direction of fastest decrease.

The idea of GD is to minimize $L$ by making small steps in the direction $-\nabla L.$

## Gradient Descent

Define a parameter $\eta$ called **learning rate**. It has to be a small positive number. Suppose the initial approximation of the minimum is $w^0$. Then in GD the $w^{i+1}$ approximation is computed from the $w^i$ approximation by using the simple formula
$$
w^{i+1} = w^i - \eta \nabla L(w^i).
$$

## Gradient Descent

Here is an illustration of GD, note that in the picture the loss function is called cost.

![](../images/GD.png){fig-align="center"}

## Gradient Descent

Lets implement GD in the case of simple linear regression. That is we will try to predict a continuous variable $y$ from one input variable $x$ where our $f$ will be
$$
f(x; w_0, w_1) = w_0+w_1x.
$$

We will use MSE as a loss function, so $L$ will be
$$
L(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0+w_1x))^2. 
$$

## Gradient Descent

We need to compute $\nabla L.$ We get
$$
\frac{\partial L(w_0, w_1)}{\partial w_0} = -\frac{2}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))
$$
and
$$
\frac{\partial L(w_0, w_1)}{\partial w_1} = -\frac{2}{n} \sum_{i=1}^n x_i(y_i - (w_0 + w_1 x_i)).
$$

## Gradient Descent

Now we can write the code. First define loss and gradient

In [1]:
import numpy as np

def loss(w, y, x):
  return np.mean(np.square(y-w[0]-w[1]*x))

def gradient(w, y, x):
  return np.array([
    -2*np.mean(y-w[0]-w[1]*x),
    -2*np.mean(x*(y-w[0]-w[1]*x))
  ])

## Gradient Descent

Now let's see if we can learn what the input linear function is.

In [2]:
# Let's see if we can learn that w^0 = 2 and w^1 = 5
x = np.linspace(-1, 1, 100)
y = 2+5*x

learning_rate = 0.1
w = np.array([0, 0]) # Initial weights

print(f"Initial loss: {loss(w, y, x)}")
for iter in range(200):
  w = w - learning_rate*gradient(w, y, x)
print(f"Loss after training: {loss(w, y, x)}")
print(f"Learnt weights: {w}")

Initial loss: 12.501683501683504
Loss after training: 4.935822348948857e-12
Learnt weights: [2.         4.99999619]


## Stochastic Gradient Descent

In ML applications loss functions usually have the following form:
$$
  L(w) = \sum_{j=1}^nL_j(w),
$$
where the sum is either over the samples of our training set or batches of samples.

## Stochastic Gradient Descent

For example, when we were discussing logistic regression we saw that the loss function had the form
$$
  L(w) = \frac{1}{n}\sum_{j=1}^n -y_j\log(f(x_j, w))-(1-y_j)\log(1-f(x_j, w)),
$$
where the sum is over the training samples.

By the way, this loss function is called cross-entropy.

## Stochastic Gradient Descent

Suppose we have two finite discrete random variables $X$ and $Y$ taking the same values $x_1, \dots, x_n$. Let's say that the pmf of $X$ is $p$ and pmf of $Y$ is $q$. Then cross-entropy is defined to be
$$
  H(X, Y) = -\sum_{i=1}^n p(x_i) \log(q(x_i)).
$$

When $X$ and $Y$ have the same distribution cross-entropy is equal to regular entropy, if they do not have the same distribution then cross-entropy is strictly larger.

## Stochastic Gradient Descent

Cross entropy measures how similar Y is to X. This is not the fully correct interpretation, but it is sufficient. The correct interpretation is a bit more subtle, you can read it [here](https://en.wikipedia.org/wiki/Cross-entropy).

Cross entropy is used as a loss function for classification problems.

Also note that it is not symmetric, that is
$$
H(X, Y) \ne H(Y, X)
$$
in general.

## Stochastic Gradient Descent

Getting back on topic, suppose our loss function has the following form
$$
  L(w) = \sum_{j=1}^nL_j(w).
$$

If there are a lot of training samples or the model has a lot of weights, then computing the full gradient $\nabla L$ becomes very computationally expensive.

## Stochastic Gradient Descent

It would instead be much nicer if we could compute the gradient only on a batch of our training data and use that to update the weights. Mathematically we would like our minimization step to look like 
$$
  w_{i+1}=w_i -\eta \nabla L_j(w)
$$
where now we only compute the gradient of the $j$-th component of our loss function. When making subsequent minimization steps we then iterate over the $L_j$.

## Stochastic Gradient Descent

This algorithm indeed works and is called **Stochastic Gradient Descent** (SGD).

When applying SGD we loop over our training set, usually in small batches. One loop over the full training set is called an **epoch**.

## Stochastic Gradient Descent

There is one simple improvement we can make, that is adding **momentum**.

Define the momentum parameter $\alpha,$ it has to be a smaller than 1 positive number.

Recursively define $\Delta w^{i+1} = \alpha \Delta w^i - \eta \nabla L_j(w^i)$ and then our minimization step is now
$$
  w^{i+1} = w^i + \Delta w^{i+1}.
$$

## Stochastic Gradient Descent

The idea is that if we stepped in the $\Delta w^{i}$ direction in the previous step we should continue going in that direction in the current step since the minimum is probably still that way. Hence the name momentum.

Since $\alpha < 1$ the influence of the $i$-th step will eventually decay to nothing and we won't overshoot the minimum.

## Stochastic Gradient Descent

One last thing, GD has a drawback that it is only able to find local minimums instead of global ones.

When doing SGD, it is best practice to shuffle your training set after each epoch, because doing this minimizes this problem. Shuffling the dataset also reduces overfitting.

## Stochastic Gradient Descent

The main shortcoming of SGD is that the learning rate is fixed. Ideally we would like the learning rate to vary a bit, because when we are far away from the minimum we could take bigger steps to train more efficiently and when we are near the minimum we would like to take smaller steps to not overshoot.

So one easy way to improve SGD is to add a mechanism for adjusting the learning rate automatically during training. Probably the most popular algorithm that implements it is [Adam](https://arxiv.org/abs/1412.6980).

## Practice task

Try to write your own implementation of logistic regression using SGD with momentum for training.

Some tips and reminders:

1. Logistic regression has the following form:
$$
  f(x; w) = \frac{1}{1+e^{-(w_0+w_1x_1 + \dots + w_nx_n)}}.
$$

2. The logistic function $f$ satisfies the following nice identity:
$$
  f(-x; w) = 1-f(x; w).
$$

## Practice task

3. Use cross-entropy as a loss function:
$$
L(w) = \frac{1}{n}\sum_{j=1}^n -y_j\log(f(x_j, w))-(1-y_j)\log(1-f(x_j, w)),
$$

4. You can derive all the partial derivatives that you need quite easily by using the [chain rule of differentiation](https://en.wikipedia.org/wiki/Chain_rule).

5. To learn how to shuffle numpy arrays, see answers [here](https://stackoverflow.com/questions/4601373/better-way-to-shuffle-two-numpy-arrays-in-unison).

## Practice task

6. Initially, set your learning rate and momentum to be very small, something like $0.0001.$

7. If you succeeded in implementing SGD try implementing Adam as well.

8. You can generate some mock data for testing your implementation using sklearn:

In [3]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10, n_classes=2, n_redundant=10, random_state=34)
print(f"X shape: {X.shape}, y shape: {y.shape}")

X shape: (10000, 20), y shape: (10000,)


## Practice task

Keep in mind that this is a toy example. If you implement both SGD and GD running times for GD will probably be lower. Performance benefits of SGD start to show up when you have a model with many more weights and more training data.