In [None]:
using MLDatasets
using PyPlot
using Random, Statistics
using Flux: onehotbatch

# Rapid intro to supervised learning with neural nets I: from scratch

This notebook gives a rapid introduction to supervised learning with neural networks. The example is based on [Chapter 1 of Nielsen's online book "Neural Networks and Deep Learning"](http://neuralnetworksanddeeplearning.com/chap1.html) and it guides you to set up the neural network training completely from scratch.

For further reading I recommend also the review article ["A high-bias, low-variance introduction to Machine Learning for physicists"](https://arxiv.org/abs/1803.08823).

## The MNIST hand-written digits data set

Let's first get a simple exemplary data set - the MNIST hand-written digits. The following cell downloads both the test and training parts of the data set.

In [None]:
# load full training set
train_x, train_y = float.(MNIST.traindata())
# load full test set
test_x,  test_y  = float.(MNIST.testdata());

`trainData` is now a array of shape `(28,28, 60000)`, meaning that we have 60k images of 28$\times$28 pixels (grayscale), each showing one hand-written digit. `trainLabels` holds the corresponding *labels*, i.e. an integer for each image, stating which digit it shows.

Let's have a look at some examples:

In [None]:
fig, axs = subplots(4,4,figsize=(7,7))

for ax in axs
    idx = rand(1:128)
    ax.imshow(train_x[:,:,idx]')
    ax.set_title("This is a $(train_y[idx])")
    ax.set_xticks([])
    ax.set_yticks([])    
end

Our goal is now to train a neural network, which takes the images as input, and processes them in such a way as to tell us which digit it is. For this purpose, we can take a supervised learning approach, because we have a *labeled data set*, where we know for each example the answer we would like the neural network to give.

Formally, we would like to find a function $f$, which maps every training sample $x^{(j)}$ to the corresponding label $y^{(j)}$:

$$y^{(j)}=f(x^{(j)})$$

## Neural networks

Neural networks are constructed as alternating sequences of affine-linear and non-linear maps. Therefore, it is natural to talk about *layers* as building blocks. The output of the $l$-th layer is a vector of *activations* $\mathbf{a}^{(l)}\equiv(a_1^{(l)},\ldots,a_{N_l}^{(l)})$ and $N_l$ is the *width* of the $l$-th layer. This output is obtained by processing the activations of the previous layer by an affine-linear map followed by a non-linear map:

$$a_i^{(l)} = \sigma\bigg(\sum_j W_{ij}^{(l)}a_j^{(l-1)}+b_j^{(l)}\bigg)$$

Here the **weights** $W_{ij}^{(l)}$ are the entries of a $N_l\times N_{l-1}$-matrix and $b_j^{(l)}$ are $N_l$ **biases**. $\sigma$ is some non-linear function. A common choice is the *sigmoid* function:



In [None]:
function sigmoid(x)
    return 1. / (1. + exp.(-x))
end

plot(-10.:10.,sigmoid.(-10:10.))
xlabel(L"x")
ylabel(L"\sigma(x)");

A neural network layer is commonly represented graphically as follows:


A neural network is a function constructed from neural network layers by stacking them on top of each other. The input is interpreted as the activations of the $0$-th layer:

$$\mathbf{a}^{(0)}(\mathbf{x}) = \mathbf{x}$$

The activations in the following layers, $\mathbf{a}^{(l)}(\mathbf{x})$, are obtained accordin to the iterative prescription given above. The resulting activations of the last layer, $\mathbf{a}^{(D)}$ constitute the output of the neural network. Therefore, a neural network is a function

$$f_\theta:\mathbf x\mapsto f_\theta(\mathbf x)$$

Here $\theta$ denotes the **parameters** of the network, i.e., the set of all weights and biases.

Remarkably, neural networks are **universal function approximators** in the limit of infinite depth, $D\to\infty$, or width, $N_l\to\infty$. This means that by choosing the right parameters $\theta$ the function $f_\theta$ can arbitrarily accurately approximate any function, if the network is only big enough. *Therefore, we can be optimistic to find also a set of parameters $\theta$ such that the neural network maps our images of hand-written digits to the corresponding digit.*

But first, let us set up our neural network:

In [None]:
function initialize_network(dimensions, seed=123)
    """This is a helper function to get a set of random initial parameters for a given network size.
    The size is given as a list of widths, one width for each layer.
    Args:
    * `dimensions`: List of layer widths.
    * `seed`: PRNG seed.
    Returns:
    A dictionary holding a list of "weights" and a list of "biases", where each entry 
    is the weight matrix/bias vector of the respective layer.
    """
    rndKey = Random.seed!(seed)
    
    params=Dict("weights"=> [], "biases"=> [])
    
    for j in 1:length(dimensions)-1
        weights=0.01*Random.randn(dimensions[j+1],dimensions[j])
        biases=zeros(dimensions[j+1])
        
        push!(params["weights"],weights)
        push!(params["biases"],biases)
    end
    return params
end
        
function neural_network(params, x)
    """This function evaluates the neural network with the given parameters.
    Args:
    * `params`: Neural network parameters
    * `x`: Input image.
    Returns: Obtained activations of the last layer.
    """
    
    a = reshape(x, (:,size(x)[end])) # flatten input and assign it to the activations of the zeroth layer
    # ! evaluate the network
    for (W,b) in zip(params["weights"],params["biases"])
        a = sigmoid.(W*a .+ b)
    end
    # ! return activations of last layer
    return a
end

Now we are set to see what the network thinks about our images of digits.

Clearly, there are some constraints on the network size for this purpose. The width of the first layer has to match the number of pixels in the images ($28\times28$). Also, we would like to have ten numbers as output - the network is supposed to indicate its answer ($0-9$) through the maximum value of the ten outputs. In addition we introduce one intermediate layer of width $N_1=100$.

*Remark: here you might get a warning regarding the absence of GPUs. No reason to be concerned.*

In [None]:
# ! Get random parameters for the desired network size (28*28,100,10)
params = initialize_network((28*28,30,10))

# ! Evaluate the network on the first three examples in our data set
neural_network(params, float.(train_x[:,:,1:3]))

## Cost function

Now, how can we find parameters $\theta$ that allow our neural network to do the job? How can our network *learn* to predict the correct labels for the input images? For this purpose we set up a *cost function* that defines the objective and quantifies how well the neural network solves the task.

$$\mathcal L_{\mathcal T}(\theta)=\frac{1}{|\mathcal T|}\sum_{(x,y)\in \mathcal T}\big(y-f_\theta(x)\big)^2$$

Here, $\mathcal T$ denotes the training data set.

Our network gives ten numbers as output ($f_\theta^r(x)$ for $r=0\ldots 9$) and we would like to interpret it such that the index of the maximal output indicates the digit shown in the input image. Therefore, we rewrite the cost function as

$$\mathcal L_{\mathcal T}(\theta)=\frac{1}{|\mathcal T|}\sum_{(x,y)\in \mathcal T}\sum_{r=0}^9\big(\delta_{r,y}-f_\theta^r(x)\big)^2$$

In [None]:
function cost_function(predictions, labels)
    """This function evaluates the cost function for given predictions and labels
    Args:
    * predictions: Predictions from neural net. Array of shape mathcal T x 10.
    * labels: Correct labels for the corresponding images. Array of mathcal T integers.
    Returns: Cost associated with the neural network predictions for the given data.
    """

    labels = onehotbatch(labels, 0:9)

    cost = sum((predictions-labels).^2)
    return cost / size(labels)[2]
end

In [None]:
batch = train_x[:,:,1:128]    # select a batch of images
labels = train_y[1:128] # and corresponding labels

# ! compute neural network predictions
predictions = neural_network(params,batch)

# ! evaluate the cost function
cost_function(predictions,labels)

## Stochastic gradient descent and backpropagation

You saw above that our network does not do particularly well yet in classifying the digits. We now have to **train** it and the basic idea is to do **gradient-based optimization** to minimize the cost function.

Generally, we can attempt to minimize the cost function using gradient descent, an iterative procedure where in each step the parameter update
$$\theta^{(j+1)}\leftarrow\theta^{(j)}-\eta\nabla_\theta\mathcal L_{\mathcal T}(\theta^{(j)})$$
is performed with some **learning rate** $\eta$. This way, we can reduce the loss until we reach a stationary state with $\nabla_\theta\mathcal L(\theta)=0$. Unfortunately, however, the cost landscape $\mathcal L_{\mathcal T}(\theta)$ is highly non-convex, meaning that it typically comprises an abundant number of local minima and saddle points. Plain gradient descent is prone to getting stuck in very these sub-optimal stationary points.

This is one reason why neural networks are in practice trained using **stochastic gradient descent** (SGD) or some more sophisticated variants of it. The term *stochastic* refers to the fact that in each step gradients of the cost function are computed only on a small randomly chosen subset of the full training set $\mathcal B_j\subset\mathcal T$. These subsets $\mathcal B_j$ are called **(mini-)batches** and the SGD update rule for step number $j$ is
$$\theta^{(j+1)}\leftarrow\theta^{(j)}-\eta\nabla_\theta\mathcal L_{\mathcal B_j}(\theta^{(j)})$$
The stochastic noise introduced in this way enables us to avoid getting stuck in saddle points and to overcome *cost barries* such that we ultimately reach better minima. Besides that, the batch-wise evaluation of gradients has practical advantages, because typical data sets of interest in machine learning often exceed the available memory capacities, such that computing gradients on the full data set would be extremely costly.

But how do we compute these gradients? In fact, this can be done very efficiently thanks to the layered structure of neural networks, which allows us to use the **backpropagation** algorithm. Backpropagation is essentially based on the observation that knowing the gradients of the cost function with respect to activations in the $l+1$-th layer enables us to very easily compute gradients of the cost function with respect to activations in the $l$-th layer because of the chain rule:

$$\frac{\partial\mathcal L_{\mathcal B}}{\partial a_i^{(l)}}=\sum_j\frac{\partial\mathcal L_{\mathcal B}}{\partial a_j^{(l+1)}}\frac{\partial a_j^{(l+1)}}{\partial a_i^{(l)}}$$

To write down the backpropagation rules for our fully connected neural network, we introduce, moreover, the pre-activations

$$z_i^{(l)}=\sum_j W_{ij}^{(l)}a_j^{(l-1)}+b_j^{(l)}$$

which are related to the activations via $a_i^{(l)}=\sigma\big(z_i^{(l)}\big)$. Including the pre-activations, we can write derivatives of the cost function w.r.t. our variational parameters $W_{ij}^{(l)}$ and $b_j^{(l)}$ as

$$\frac{\partial\mathcal L_{\mathcal B}}{\partial W_{ij}^{(l)}}=\sum_k\frac{\partial\mathcal L_{\mathcal B}}{\partial 
z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial W_{ij}^{(l)}}=\frac{\partial\mathcal L_{\mathcal B}}{\partial z_i^{(l)}}a_j^{(l-1)}$$
and
$$\frac{\partial\mathcal L_{\mathcal B}}{\partial b_{i}^{(l)}}=\sum_k\frac{\partial\mathcal L_{\mathcal B}}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial b_{i}^{(l)}}=\frac{\partial\mathcal L_{\mathcal B}}{\partial z_i^{(l)}}$$

Therefore, we introduce

$$\Delta_j^{(D)}=\frac{\partial\mathcal L_{\mathcal B}}{\partial z_j^{(D)}}\equiv\frac{\partial\mathcal L_{\mathcal B}}{\partial a_j^{(D)}}\sigma'\big(z_j^{(D)}\big)\\
\Delta_j^{(l)}=\frac{\partial\mathcal L_{\mathcal B}}{\partial z_j^{(l)}}=
\sum_k\frac{\partial\mathcal L_{\mathcal B}}{\partial z_k^{(l+1)}}\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}=\sum_k
\Delta_k^{(l+1)}W_{kj}^{(l+1)}\sigma'(z_j^{(l)})
$$

Here $\sigma'(\cdot)$ denotes the derivative of the non-linearity $\sigma(\cdot)$. The last four equations give us the prescription to obtain the gradients of the cost function as follows:

1. Perform a *forward evaluation* of the network and keep the activations $a_j^{(l)}$ as well as the pre-activations $z_j^{(l)}$ in memory.
2. Perform a *backward pass* to iteratively obtain the $\Delta_j^{(l)}$, and at each step compute the gradients with respect to the variational parameters in the corresponding layer.

Notice that due to the linearity of the gradient we can first compute the per-sample gradients $\nabla_\theta\mathcal L_{\{x\}}(\theta)$ for all $x\in\mathcal B$ and then obtain the mini-batch gradient as

$$\nabla_\theta\mathcal L_{\mathcal B}(\theta)=\sum_{x\in\mathcal B}\nabla_\theta\mathcal L_{\{x\}}(\theta)$$

In [None]:
function sigmoid_deriv(x)
    """ This function evaluates the derivative of the sigmoid non-linearity.
    Args:
    * x: Float.
    Returns: sigma'(x)
    """
    # ! compute and return the derivative of the sigmoid function
    y = sigmoid.(x)
    return y .* (1 .- y)
end

function cost_function_individual_gradients(params, x, label)
    """This function defines the per-sample gradient computed by backpropagation.
    Args:
    * `params`: Neural network parameters
    * `x`: Input image (2D array).
    * `label`: Label associated with the input image.
    Returns:
    Gradient of the cost function for the given sample.
    """

    label = onehotbatch(label, 0:9) # get one-hot encoding of label

    a = vec(x) # flatten input
    # Set up storage space for (pre-)activations
    a_list = [a]
    z_list = []
    
    # FORWARD PASS
    # ! forward evaluation of the network, storing (pre-)activations
    for (W,b) in zip(params["weights"],params["biases"])
        
        z = W*a + b
        a = sigmoid.(z)
        
        push!(z_list,z)
        push!(a_list,a)
    end

    # BACKWARD PASS
    # Set up storage space for gradients (layer-wise)
    weight_gradients = [zero(w) for w in params["weights"]]
    bias_gradients = [zero(b) for b in params["biases"]]

    # ! Apply backpropagation rules for top layer
    Delta = (a_list[end]-label) .* sigmoid_deriv(z_list[end])
    weight_gradients[end] = Delta * a_list[end-1]'
    bias_gradients[end] = Delta

    for l in 1:length(a_list)-2
        z = z_list[end-l]
        sp = sigmoid_deriv.(z)
        Delta = (transpose(params["weights"][end-l+1]) * Delta) .* sp
        weight_gradients[end-l] = Delta .* a_list[end-l-1]'
        bias_gradients[end-l] = Delta
    end
    
    return Dict("weights" => weight_gradients, "biases" => bias_gradients)
end

function cost_function_gradients(params, samples, labels)
    """This function computes the gradients for a given mini-batch of data.
    Args:
    * `params`: Neural network parameters.
    * `samples`: Batch of training images.
    * `labels`: Labels corresponding to the given images.
    Returns:
    Gradients of the cost function evaluated on the mini-batch of training data.
    """
    
    weight_gradients = [zero(w) for w in params["weights"]]
    bias_gradients = [zero(b) for b in params["biases"]]
    
    # Evaluate individual gradients
    for i in 1:length(labels)
        grads = cost_function_individual_gradients(params, samples[:,:,i], labels[i])
        for j in 1:length(params["weights"])
            weight_gradients[j] += grads["weights"][j]
            bias_gradients[j] += grads["biases"][j]
        end
    end
    
    return weight_gradients, bias_gradients
end

## Training and test data sets

You saw that above we loaded `trainData` as well as `testData`. The reason for this is that we would like to test the performance of our trained network on examples that it has not seen before. This check tells us how well the network *learned to generalize* from the examples in our training data. Moreover, we want to avoid that our network learns to solve the task by exploiting specific features that are only present in our training data set (*overfitting*). Once the network starts to overfit, the cost achieved on the test data set (**test error**) grows as training progresses, i.e. the generalization quality deteriorates. Therefore, it is important to monitor the test error during training.

## Training loop

We are not ready to compose the training loop from all the pieces defined above. Each iteration of the loop is called an **epsiode**. In each episode, the training data set is shuffled and split into equally sized **(mini-)batches**. Then the gradients are computed sequentially for each mini-batch and the parameters of the neural network are updated according to the SGD update rule.

After completion of each episode we assess the performance of the network by evaluating the cost function on the test data set and, in addition, by counting how many of the images in the test data set are classified correctly.

In [None]:
function evaluate_predictions(predictions, labels)
    """This is a helper function that counts how many of the given predictions match the labels.
    Args:
    * `predictions`: Predictions from neural network (=activations on output layer)
    * `labels`: correct labels
    Returns: Number of correct predictions, i.e., number of cases, in which the index of the maximal 
    activation matches the given label.
    """
    pred_labels = [Int(findmax(predictions[:,i])[2])-1 for i in 1:size(predictions)[2]]
        
    return sum(pred_labels .== labels)
end


prng_key = Random.seed!(123)
params = initialize_network((28*28,100,10))

# Here we define the hyperparamters
num_epochs = 10 # Number of epochs to loop over
learning_rate = 3.0 # Learning rate
batch_size = 128 # Size of mini-batches

# Compute the number of mini-batches that matches the chosen mini-batch size
batch_number = floor(Int,size(train_x)[end] / batch_size)

# Evaluate network and assess performance
predictions = neural_network(params, test_x)
current_cost = cost_function(neural_network(params, test_x), test_y)
correct_predictions = evaluate_predictions(predictions, test_y)
println("Initial cost: $(current_cost)")
println("Correctly predicted labels: $(correct_predictions) / $(length(test_y))")

for n in 1:num_epochs
    
    println("Episode $(n)")
    order = shuffle(1:length(train_y))
    samples, labels = ( reshape(train_x[:,:,order][:,:,1:Int(batch_number*batch_size)], 28,28,128,:), 
        reshape(train_y[order][1:Int(batch_number*batch_size)], 128,:))

    
    for i in 1:batch_number
        # compute gradients
        weight_gradients, bias_gradients = cost_function_gradients(params, samples[:,:,:,i], labels[:,i])  

        # Perform SGD parameter update step
        for j in 1:length(params["weights"])
            params["weights"][j] -= learning_rate*weight_gradients[j]/batch_size
            params["biases"][j] -= learning_rate*bias_gradients[j]/batch_size
        end
        
    end

    # Evaluate network and assess performance
    predictions = neural_network(params, test_x)
    current_cost = cost_function(predictions, test_y)    
    correct_predictions = evaluate_predictions(predictions, test_y)
    println("Current cost: $(current_cost)")
    println("Correctly predicted labels: $(correct_predictions/length(test_y))")
end