# Tutorial: Classification with Flux

This tutorial introduces the reader to classification in [Flux](https://github.com/fluxml/flux.jl) using the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset of handwritten digits to test the classification ability of different networks. MNIST is made up of 60000 training images and 10000 testing images, with the lowest error rate ever achieved on the dataset being $0.23%$. Since 2017 an extended MNIST dataset (EMNIST) is available, but MNIST remains a benchmark for different approaches. The tutorial is based on amalgamation redux of two examples from the [Flux model zoo](https://github.com/FluxML/model-zoo/blob/master/vision/mnist/conv.jl).

Structure:
    1. Classification using a multi-layer-perceptron
    2. Classification using a convolutional neural network
    3. Exercises

In [None]:
using Flux, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
using Printf

## 1. Classification using a multi-layer-perceptron (MLP)

The most basic neural network architecture to use is the multi-layer-perceptron also often called a feed-forward neural network. It has an input layer corresponding to the number of features we seek to feed to the neural network, and an output layer returning values, which can either be between $0$ and $1$ using `softmax` or capture nonlinearities inherent to the data we are dealing with by using rectified linear units `ReLU` or `tanh` activation functions.

<img src="presentation/imgs/MLP.png" width="600" height="300" />

We begin by loading the images from MNIST and concatenating them into a single large vector `X`

In [None]:
# Load dataset
imgs = Flux.Data.MNIST.images()

# Stack into one batch
X = hcat(float.(reshape.(imgs, :))...);

Load the labels and encode the labels using one-hot encoding, which in this corresponds to creating 10 columns, each representing a digit between 0-9, as can be found in MNIST, and then assigning each image a binary value. This allows us to encode the affiliation of every single image in a computer-friendly manner.

In [None]:
# Load labels
labels = Flux.Data.MNIST.labels()

# One-hot-encode the labels
Y = onehotbatch(labels, 0:9);

Use Flux's `Chain` syntax to stack the individual layers, with `softmax` as output layer to return classification probabilities for each image. `crossentropy` then acts as a loss, utlizing a log-scale to penalize wrong classifications:

<img src="presentation/imgs/cross_entropy.png" width="450" height="450" />

In [None]:
# Set up the MLP
m = Chain(
    Dense(28^2, 32, relu),
    Dense(32, 10),
    softmax
)

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y));

Set up a data-iterator with `repeated` for training with 200 epochs, using the `ADAM` optimizer.

In [None]:
# Set up the training
dataset = repeated((X, Y), 200)
evalcb = () -> @show(loss(X, Y))
opt = ADAM()

Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))

Assess the accuracy of the trained model on the training data

In [None]:
# Assess the accuracy
accuracy(X, Y)

Assess the accuracy of the trained model on the previously unseen test data

In [None]:
# Compute the test set accuracy
tX = hcat(float.(reshape.(Flux.Data.MNIST.images(:test), :))...)
tY = onehotbatch(Flux.Data.MNIST.labels(:test), 0:9)

accuracy(tX, tY)

## 2. Classification using a convolutional neural network

An alternative to MLPs is the use of convolutional neural networks, which through a series of kernel operations, pooling and fully-connected layers condense the input down to its essence and use that for classification/regression. An example CNN architecture for MNIST would look like this:

<img src="presentation/imgs/CNN.png" width="800" height="530" />

Where we have one more convolutional layer, than in the image. A completely classical CNN architecture would furthermore look like this:

$ Input \rightarrow Conv \rightarrow ReLU \rightarrow Conv \rightarrow ReLU \rightarrow Pool \rightarrow ReLU \rightarrow Conv \rightarrow ReLU \rightarrow Pool \rightarrow Fully Connected$

The convolution-kernels in the first layer detect edges and curves, with subsequent layers further condensing information with activation maps representing more and more complex features.

In [None]:
# Load labels and images
train_labels = Flux.Data.MNIST.labels()
train_imgs = Flux.Data.MNIST.images();

Collect the images together with their respective labels and structure them as minibatches, which make them easiert to digest for our CPU/GPU. For larger models the size of the GPU memory governs the size of the batches. If you want to see the effect of minibatch size, change `batch_size` and then time the training with the `@tic` and `@toc` macros. The test set is structured as one batch.

In [None]:
# Construct minibatches
function make_minibatch(X, Y, idxs)
    X_batch = Array{Float32}(undef, size(X[1])..., 1, length(idxs))
    for i in 1:length(idxs)
        X_batch[:, :, :, i] = Float32.(X[idxs[i]])
    end
    Y_batch = onehotbatch(Y[idxs], 0:9)
    return (X_batch, Y_batch)
end

batch_size = 128
mb_idxs = partition(1:length(train_imgs), batch_size)
train_set = [make_minibatch(train_imgs, train_labels, i) for i in mb_idxs]

# Test set as one minibatch
test_imgs = Flux.Data.MNIST.images(:test)
test_labels = Flux.Data.MNIST.labels(:test)
test_set = make_minibatch(test_imgs, test_labels, 1:length(test_imgs));

Construct a classical convolutional architecture with multiple $Conv \rightarrow ReLU \rightarrow MaxPool$ iterations followed by a final dense layer, which subsequently feeds its output into a softmax probability output, hence squashing the output values to probabilities on the digits.

In [None]:
model = Chain(
    # 1st convolutional layer, taking a 28x28 image
    Conv((3, 3), 1=>16, pad=(1, 1), relu),
    MaxPool((2, 2)),
    
    # 2nd convolutional layer, taking a 14x14 image
    Conv((3, 3), 16=>32, pad=(1, 1), relu),
    MaxPool((2, 2)),
    
    # 3rd convolutional layer, taking a 7x7 image
    Conv((3, 3), 32=>32, pad=(1, 1), relu),
    MaxPool((2, 2)),
    
    # Reshape 3d tensor into a 2d tensor of shape (3, 3, 32, N)
    x -> reshape(x, :, size(x, 4)),
    Dense(288, 10),
    
    # Softmax output layer
    softmax,
);

Optional GPU training, if the data and model are sent to the available GPU

In [None]:
# If GPU is enabled uncomment the lines below to load the model onto a GPU
#train_set = gpu.(train_set)
#test_set = gpu.(test_set)
#model = gpu(model)

Precompile the model to gain a better speed advantage upon training, as the first compilation always takes the longest.

In [None]:
model(train_set[1][1])

We again use crossentropy as the loss function, but do additionally inject Gaussian noise to make the process more noisy, which in turn makes the trained model more robust later on.

In [None]:
# Crossentropy loss between prediction and ground truth, add Gaussian noise to make model more robust
function loss(x, y)
    # Add random noise to x
    x_aug = x .+ 0.1f0 * randn(eltype(x), size(x))
    
    y_hat = model(x_aug)
    return crossentropy(y_hat, y)
end

accuracy(x, y) = mean(onecold(model(x)) .== onecold(y));

Configure the used ADAM optimizer

In [None]:
opt = ADAM(0.001)

best_acc = 0.0
last_improvement = 0;

Construct the training loop with included accuracy calculations, early stopping condition if the required accuracy is achieved and a reduction in learning rate if the optimizer becomes unable to minimize, i.e. overshoots the minimum as its learning rate is too large. One would usually consider to save the model parameters when the best model is found.

In [None]:
# Define the training loop and train for 100 epochs
for epoch_idx in 1:100
    global best_acc, last_improvement
    
    # Train for one epoch
    Flux.train!(loss, params(model), train_set, opt)
    
    # Calculate the accuracy
    acc = accuracy(test_set...)
    @info(@sprintf("[%d]: Test accuracy: %.4f", epoch_idx, acc))
    
    # Stop if accuracy is good enough
    if acc >= 0.999
        @info(" -> Early-exiting: We reached our target accuracy of 99.9%")
        break
    end
    
    # Reduce learning rate if there has been no improvement for 5 epochs
    if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
        opt.eta /= 10.0
        @warn(" -> Haven't improved in a while, dropping learning rate to $(opt.eta)!")
        
        last_improvement = epoch_idx
    end
    
    if epoch_idx - last_improvement >= 10
        @warn(" -> The model has converged.")
        break
    end
end

## 3. Exercise: Construct a different neural network for classification

- Experiment with other neural network architectures for classification:
    - Construct a radial basis network and test it on MNIST
         Hint: Combine it with Stheno to use Gaussians as special instances of radial basis functions as activations in a feed-forward network
    - Construct an autoencoder and test its classification ability on MNIST
         Hint: An [autoencoder](https://www.jeremyjordan.me/autoencoders/) consists on an encoder-decoder structure, which can be made up of only feed-forward, dense layers, or convolutional layers, which then amounts to a convolutional autoencoder.
    - Challenge: Build a variational autoencoder and test its performance
- Experiment with a mixture of the initial MLP-classification and convolutional classification, by introducing dense layers between convolutional layers.
    - How does the training time change?
    - How large is the influence of the activation functions on the performance?
- Construct the confusion matrix for the convolutional classification