# Tutorial: Introduction to Machine Learning with Flux

This tutorial introduces the reader to [Flux](https://github.com/fluxml/flux.jl) as a machine learning framework and its stylistic features and syntactical quirks. It is a condensed version of the 60 minute tutorial, which can be found in the [Flux model-zoo](https://github.com/FluxML/model-zoo/blob/master/tutorials/60-minute-blitz.jl) and lecture content created for the scientific machine learning course at TUM.

## Outline

**Section 1.** [Machine Learning Frameworks](#frameworks)

**Section 2.** [Reminder - Arrays in Flux](#arrays)

**Section 3.** [Sending Arrays to GPUs](#gpu)

**Section 4.** [Automatic Differentiation](#autodiff)

**Section 5.** [Example - Training a Classifier](#example)

In [None]:
using Pkg
Pkg.add("Metalhead")
Pkg.add("Images")
Pkg.add("Tracker")

## Agenda:

- Quick overview over Machine learning frameworks in Julia
- Introduction to Flux with an example
    - Arrays
    - Autodifferentiation
    - Example: Let's build a classifier
- Self-guided Tutorials
    - Regression in Flux
    - Classification in Flux
    - Gaussian Processes using STheno

## 1: Machine Learning Frameworks <a name="frameworks"></a>

### What is the logic behind Julia's Machine learning ecosystem?

- PyTorch and Tensorflow are both coping with the constraints of Python, while being their own constrained DSLs which are missing customization abilities, such as custom gradients
- Explosion in the necessary computational power to achieve state-of-the-art results opens the door for high-performance
- Julia is one of the languages, which is in a position to offer solutions to problems in expressiveness as well as being able to satisfy the needs for high-performance calculations of cutting edge research
- View it inherently as a language-level problem
- Computing AD-estimates in Julia allows gradients over all other Julia models, where others have to handle C++ models

### The Julia Ecosystem:

- Flux (in tutorials throughout the day)
- ONNX
- Probabilistic Programming
    - Gen (in tutorials later today)
    - Turing (in tutorials later today)
    - Stheno (in tutorials this morning)
- Automatic Differentiation
    - Zygote (used in some tutorials)
    - Cassette

### Reinforcement Learning:

- Underdeveloped as of now
- If you want the absolute best performance
    - PyTorch natively written in C++
    - Interface with Julia using CXX.jl

## 2: Reminder - Arrays in Flux <a name="arrays"></a>

The core of Flux is comprised of its `arrays`, which in square form become matrices

In [None]:
x_matrix = [1 2; 3 4]

For automatic generation of large arrays one relies on `rand`, which generates an array with random numbers between 0 and 1 in the shape $5 \times 3$

In [None]:
x = rand(5, 3)

Matrix multiplication works with a Matlab-like syntax

In [None]:
W = randn(5, 10)
x = rand(10)
W * x

## 3: Sending Arrays to GPUs <a name="gpu"></a>

Julia can be fully integrated with NVIDIA GPUs with the `CUDAdrv`, `CUDAnative`, and `CuArrays` packages

For example,

In [None]:
using CuArrays

N = 2^5

x_d = CuArrays.fill(1.0f0, N)
y_d = CuArrays.fill(2.0f0, N)

Benchmarking the performance on addition

In [None]:
@btime y_d .+= x_d

While CuArrays and the other CUDA-integrations produce fast code, we seek to stay on a higher abstraction level and do hence send our to the GPUs with the `cu` command, which can be as easy as

In [None]:
x = cu(rand(5, 3))

It is possible to write your own GPU-kernels using `CUDAnative`, but the normal user will fare better by staying at a high abstraction level. But, as with the rest of Julia, the option is always available to dive deep into the code of the individual functions and write your own operations, e.g.

Define your own custom addition operation in CUDA

In [None]:
using CUDAnative

function gpu_add1!(y, x)
    for i = 1:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

@cuda gpu_add1!(y_d, x_d)

## 4: Automatic Differentiation <a name="autodiff"></a>

The second cornerstone of Flux is its automatic differentation ability, which seamlessly interfaces with the rest of the Julia ecosystem:

*"Automatic Differentiation (AD) is a set of techniques based on the mechanical application of the chain rule to obtain derivatives of a function given as a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations such as additions or elementary functions such as exp(). By applying the chain rule of derivative calculus repeatedly to these operations, derivatives of arbitrary order can be computed automatically, and accurate to working precision."*

*Conceptually, AD is different from symbolic differentiation and approximations by divided differences.*

(Source: [www.autodiff.org/?module=Introduction](http://www.autodiff.org/?module=Introduction))

Let's start with a polynomial

In [None]:
f(x) = 3x^2 + 2x + 1

Let Flux take the derivative for us

In [None]:
using Tracker: gradient

df(x) = gradient(f, x; nest=true)[1]

Breaking the derivative command down into its bits we see that we have to feed the AD-engine the function `f`, the variable we want to take the derivative against `x`, whether the function is nested, and which derivative we seek to take. Any generic Julia code we write, can be taken the derivative of as long as the used mathematical functions are differentiable. Taking a Taylor approximation to the `sin` function for example:

In [None]:
mysin(x) = sum((-1)^k * x^(1 + 2k)/factorial(1 + 2k) for k in 0:5)

Testing the AD-engine

In [None]:
x = 0.5

val = gradient(mysin, x)

This machinery can also be applied to `arrays`; testing with a custom loss function we obtain gradients for each of the three inputs `W`, `b`, and `x`. Gradients, which can be utilized in classical optimization, as well as in machine learning models.

In [None]:
myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = rand(5)

gradient(myloss, W, b, x)

A second way to obtain gradients in Flux is by marking arrays with `param`, hence telling Flux to trace the execution of these arrays for later derivation. This branch has sadly been deprecated.

In [None]:
using Tracker: param, back!, grad

W = param(randn(3, 5))
b = param(zeros(3))
x = rand(5)

y = sum(W * x .+ b)

While this branch is deprecated it still serves as a good illustration of gradient descent's approach. Update the weights and perform optimisation of the weights using gradient descent, the formula for which is

<p align="center">
    $weights = weights - learning\_rate * gradient$
</p>

In [None]:
using Tracker: update!

η = 0.1
for p in params(m)
    update!(p, -η * grad(p))
end

Working with optimisers inherent to Flux we only have to define the learning for gradient descent

In [None]:
opt = Descent(0.01)

We now only need to construct a data iterator, which includes the number of epochs and is then feed to Flux's `train!` function for the actual training process. An alternative to this approach is to loop over our dataset and update our parameters in the process.

In [None]:
data, labels = rand(10, 100), fill(0.5, 2, 100)
loss(x, y) = sum(Flux.crossentropy(m(x), y))
Flux.train!(loss, params(m), [(data, labels)], opt)

## 5: Example - Training a Classifier <a name="example"></a>

Import `Metalhead.jl` as a framework for computer vision models packaged with predefined & pretrained models and dataloaders.

In [2]:
using Statistics
using Flux, Flux.Optimise
using Metalhead, Images
using Metalhead: trainimgs
using Images.ImageCore
using Flux: onehotbatch, onecold
using Base.Iterators: partition

┌ Info: CUDAdrv.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
└ @ CUDAdrv /home/lpaehler/.julia/packages/CUDAdrv/mCr0O/src/CUDAdrv.jl:69


Working with the CIFAR10 dataset we are faced with the following classification task:

<p align="center"><img src="imgs/cifar10.png" width="600" height="300" /></p>


We begin by one-hot encoding the different classes, i.e. split up the association with certain classes in different categories to achieves binary labels, which are easier for the model to digest. 

In [3]:
Metalhead.download(CIFAR10)
X = trainimgs(CIFAR10)
labels = onehotbatch([X[i].ground_truth.class for i in 1:50000], 1:10);

Functions to access images from the data set and assess the ground truth

In [4]:
image(x) = x.img
ground_truth(x) = x.ground_truth

ground_truth (generic function with 1 method)

The images of the cifar10 dataset can be viewed as $32 \times 32$ matrices with 3 color-channels. The images are then to rearranged into batches and a validation set in preparation for the minibatch learning. Minibatch learning is supposed to prevent us from getting stuck in saddle points.

In [5]:
getarray(X) = float.(permutedims(channelview(X), (2, 3, 1)))
imgs = [getarray(X[i].img) for i in 1:50000];

Split into a training and validation dataset with $98\%$ reserved for training and $2\%$ left for validation

In [6]:
train = gpu.([(cat(imgs[i]..., dims=4), labels[:, i]) for i in partition(1:49000, 1000)])
valset = 49001:50000
valX = cat(imgs[valset]..., dims=4) |> gpu
valY = labels[:, valset] |> gpu;

10×1000 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 0  0  0  0  1  0  1  0  0  0  0  0  0  …  0  0  0  0  1  0  1  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  1  0  0  0  1  0  0  0  0  1  1
 0  0  0  0  0  0  0  0  1  0  0  0  0     0  0  0  1  0  0  0  1  0  0  0  0
 0  0  0  0  0  0  0  0  0  1  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  1  0  0  0  0  0  0  0  0  0  0     0  0  1  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  1  0  0  0  0  0  0  0  …  1  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  1  0  0  0
 0  0  0  0  0  0  0  0  0  0  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  0  0  0  1  0  0  0  1  0     0  0  0  0  0  0  0  0  0  0  0  0
 0  1  0  1  0  0  0  0  0  0  0  0  1     0  0  0  0  0  0  0  0  0  1  0  0

A convolutional neural network is able to digest features of images; it analyzes by sliding its convolution kernel over the matrix to then return an intermediate representation, which unearth successively higher-order features.

In [7]:
m = Chain(
    Conv((5, 5), 3=>16, relu),
    MaxPool((2, 2)),
    Conv((5, 5), 16=>8, relu),
    MaxPool((2, 2)),
    x -> reshape(x, :, size(x, 4)),
    Dense(200, 120),
    Dense(120, 84),
    Dense(84, 10),
    softmax) |> gpu

Chain(Conv((5, 5), 3=>16, relu), MaxPool((2, 2), pad = (0, 0, 0, 0), stride = (2, 2)), Conv((5, 5), 16=>8, relu), MaxPool((2, 2), pad = (0, 0, 0, 0), stride = (2, 2)), #9, Dense(200, 120), Dense(120, 84), Dense(84, 10), softmax)

We rely on the cross-entropy loss to adequately penalize for misclassifications with its log-skewing of the loss:

<p align="center"><img src="imgs/cross_entropy.png" width="450" height="450" /></p>

This is used in conjunction with the ADAM optimiser.

In [8]:
using Flux: crossentropy, Momentum

loss(x, y) = sum(crossentropy(m(x), y))
opt = ADAM(0.001);

Keep tab of the model's accuracy

In [9]:
accuracy(x, y) = mean(onecold(m(x), 1:10) .== onecold(y, 1:10));

Loop over the network for 10 epochs of optimisation

In [10]:
epochs = 10

# Callback function
evalcb = () -> @show(accuracy(valX, valY))

for epich = 1:epochs
    Flux.train!(loss, params(m), train, opt, cb=evalcb)
end;

accuracy(valX, valY) = 0.085
accuracy(valX, valY) = 0.094
accuracy(valX, valY) = 0.13
accuracy(valX, valY) = 0.149
accuracy(valX, valY) = 0.155
accuracy(valX, valY) = 0.162
accuracy(valX, valY) = 0.183
accuracy(valX, valY) = 0.209
accuracy(valX, valY) = 0.194
accuracy(valX, valY) = 0.191
accuracy(valX, valY) = 0.207
accuracy(valX, valY) = 0.216
accuracy(valX, valY) = 0.206
accuracy(valX, valY) = 0.217
accuracy(valX, valY) = 0.226
accuracy(valX, valY) = 0.248
accuracy(valX, valY) = 0.255
accuracy(valX, valY) = 0.267
accuracy(valX, valY) = 0.282
accuracy(valX, valY) = 0.257
accuracy(valX, valY) = 0.252
accuracy(valX, valY) = 0.268
accuracy(valX, valY) = 0.285
accuracy(valX, valY) = 0.289
accuracy(valX, valY) = 0.32
accuracy(valX, valY) = 0.299
accuracy(valX, valY) = 0.315
accuracy(valX, valY) = 0.309
accuracy(valX, valY) = 0.308
accuracy(valX, valY) = 0.312
accuracy(valX, valY) = 0.314
accuracy(valX, valY) = 0.321
accuracy(valX, valY) = 0.328
accuracy(valX, valY) = 0.328
accuracy(valX, v

accuracy(valX, valY) = 0.503
accuracy(valX, valY) = 0.492
accuracy(valX, valY) = 0.517
accuracy(valX, valY) = 0.521
accuracy(valX, valY) = 0.515
accuracy(valX, valY) = 0.514
accuracy(valX, valY) = 0.513
accuracy(valX, valY) = 0.503
accuracy(valX, valY) = 0.508
accuracy(valX, valY) = 0.503
accuracy(valX, valY) = 0.504
accuracy(valX, valY) = 0.506
accuracy(valX, valY) = 0.501
accuracy(valX, valY) = 0.498
accuracy(valX, valY) = 0.502
accuracy(valX, valY) = 0.51
accuracy(valX, valY) = 0.517
accuracy(valX, valY) = 0.51
accuracy(valX, valY) = 0.513
accuracy(valX, valY) = 0.496
accuracy(valX, valY) = 0.501
accuracy(valX, valY) = 0.499
accuracy(valX, valY) = 0.497
accuracy(valX, valY) = 0.51
accuracy(valX, valY) = 0.493
accuracy(valX, valY) = 0.496
accuracy(valX, valY) = 0.512
accuracy(valX, valY) = 0.502
accuracy(valX, valY) = 0.48
accuracy(valX, valY) = 0.511
accuracy(valX, valY) = 0.489
accuracy(valX, valY) = 0.502
accuracy(valX, valY) = 0.495
accuracy(valX, valY) = 0.496
accuracy(valX, val

Predict the class labels of the neural network's outputs, and compare it to the groud-truth

In [11]:
# Preprocessing of the validation data set
valset = valimgs(CIFAR10)
valimg = [getarray(valset[i].img) for i in 1:10000]
labels = onehotbatch([valset[i].ground_truth.class for i in 1:10000],1:10)
test = gpu.([(cat(valimg[i]..., dims = 4), labels[:,i]) for i in partition(1:10000, 1000)])

ids = rand(1:10000, 10)
image.(valset[ids])

# Assess the model's performance
rand_test = getarray.(image.(valset[ids]))
rand_test = cat(rand_test..., dims = 4) |> gpu
rand_truth = ground_truth.(valset[ids])
m(rand_test)

# Check the model's performance on test data
accuracy(test[1]...)

0.527

Examine the classifier's performance on all classes on their own

In [12]:
class_correct = zeros(10)
class_total = zeros(10)
for i in 1:10
    preds = m(test[i][1])
    lab = test[i][2]
    for j = 1:1000
        pred_class = findmax(preds[:, j])[2]
        actual_class = findmax(lab[:, j])[2]
        if pred_class == actual_class
            class_correct[pred_class] += 1
        end
        class_total[actual_class] += 1
    end
end

class_correct ./ class_total

10-element Array{Float64,1}:
 0.523
 0.583
 0.321
 0.32 
 0.298
 0.47 
 0.809
 0.506
 0.697
 0.631