# Intro to Flux.jl

We have learned how machine learning allows us to classify data as apples or bananas with a single neuron. However, some of those details are pretty fiddly! Fortunately, Julia has a powerful package that does much of the heavy lifting for us, called [`Flux.jl`](https://fluxml.github.io/).

*Using `Flux` will make classifying data and images much easier!*

## Using `Flux.jl`

In the next notebook, we are going to see how Flux allows us to redo the calculations from the previous notebook in a simpler way. We can get started with `Flux.jl` via:

#### Helpful built-in functions

When working we'll `Flux`, we'll make use of built-in functionality that we've had to create for ourselves in previous notebooks.

For example, the sigmoid function, σ, that we have been using already lives within `Flux`:

Importantly, `Flux` allows us to *automatically create neurons* with the **`Dense`** function. For example, in the last notebook, we were looking at a neuron with 2 inputs and 1 output:
 
 <img src="data/single-neuron.png" alt="Drawing" style="width: 500px;"/>
 
We could create a neuron with two inputs and one output via
 
This `model` object comes with places to store weights and biases:

Unlike in previous notebooks, note that `W` is no longer a `Vector` (1D `Array`) and `b` is no longer a number! Both are now stored in so-called `TrackedArray`s and `W` is effectively being treated as a matrix with a single row. We'll see why in the next notebook.

Other helpful built-in functionality includes ways to automatically calculate gradients and also the cost function that we've used in previous notebooks - 

$$L(w, b) = \sum_i \left[y_i - f(x_i, w, b) \right]^2$$

This is the "mean square error" function, which in `Flux` is named **`Flux.mse`**.

In [None]:
import Pkg
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("Flux")
Pkg.add("TextParse")
using CSV
using DataFrames
using Flux
using TextParse

In [None]:
?σ

In [None]:
model = Dense(2, 1, σ)

In [None]:
model.W

In [None]:
model.b

In [None]:
typeof(model.W)

# Learning with a single neuron using Flux.jl

In this notebook, we'll use `Flux` to create a single neuron and teach it to learn, as we did by hand in notebook 10!

### Read in data and process it

Let's start by reading in our data

and processing it to extract information about the red and green coloring in our images:

In [None]:
applecols, applecolnames = TextParse.csvread("data/Apple_Golden_1.dat", '\t')
bananacols, bananacolnames = TextParse.csvread("data/bananas.dat", '\t')

apples = DataFrame(Dict(strip(name)=>col for (name, col) in zip(applecolnames, applecols)))
bananas = DataFrame(Dict(strip(name)=>col for (name, col) in zip(bananacolnames, bananacols)));

In [None]:
col1 = :red
col2 = :green

x_apples  = [ [apples[i, col1], apples[i, col2]] for i in 1:size(apples)[1] ]
x_bananas = [ [bananas[i, col1], bananas[i, col2]] for i in 1:size(bananas)[1] ]

xs = vcat(x_apples, x_bananas)

ys = vcat( zeros(size(x_apples)[1]), ones(size(x_bananas)[1]) );

The input data is in `xs` and the labels (true classifications as bananas or apples) in `ys`.

### Using `Flux.jl`

Now we can load `Flux` to really get going!

We saw in the last notebook that σ is a built-in function in `Flux`.

Another function that is used a lot in neural networks is called `ReLU`; in Julia, the function is called `relu`.

#### Exercise 1

Use the docs to discover what `ReLU` is all about.

`relu.([-3, 3])` returns

A) [-3, 3] <br>
B) [0, 3] <br>
C) [0, 0] <br>
D) [3, 3] <br>

### Making a single neuron in Flux

Let's use `Flux` to build our neuron with 2 inputs and 1 output:

 <img src="data/single-neuron.png" alt="Drawing" style="width: 500px;"/>
 
We previously put the two weights in a vector, $\mathbf{w}$. Flux instead puts weights in a $1 \times 2$ matrix (i.e. a matrix with 1 *row* and 2 *columns*). 

Previously, to compute the dot product of $\mathbf{w}$ and $\mathbf{x}$ we had to use either the `dot` function, or we had to transpose the vector $\mathbf{w}$:

```julia
# transpose w
b = w' * x
# or use dot!
b = dot(w, x)
```
If the weights are instead stored in a $1 \times 2$ matrix, `W`, then we can simply multiply `W` and `x` together instead!

We start off with random values for our parameters and data:

Note that the product of `W` and `x` will now be an array (vector) with a single element, rather than a single number:

This means that our bias `b` is treated as an array when we're using `Flux`:

#### Exercise 2

Write a function `mypredict` that will take a single input, array `x` and use `W`, `b`, and built-in `σ` to generate an output prediction (stored as an array). This function defines our neural network!

Hint: This function will look very similar to $f_{\mathbf{w},\mathbf{b}}$ from the last notebook but has changed since our data structures to store our parameters have changed!

#### Exercise 3

Define a loss function called `loss`.

`loss` should take two inputs: a vector storing data, `x`, and a vector storing the correct "labels" for that data. `loss` should return the sum of the squares of differences between the predictions and the correct labels.

## Calculating gradients using Flux: backpropagation

For learning, we know that what we need is a way to calculate derivatives of the `loss` function with respect to the parameters `W` and `b`. So far, we have been doing that using finite differences. 

`Flux.jl` instead implements a numerical method called **backpropagation** that calculates gradients (essentially) exactly, in an automatic way, by indirectly applying the rules of calculus.
To do so, it provides a new type of object called "tracked" arrays. These are arrays that store not only their current value, but also information about gradients, which is used by the backpropagation method.

[If you want to understand the maths behind backpropagation, we recommend e.g. [this lecture](https://www.youtube.com/watch?v=i94OvYb6noo).]

To do so, `Flux` provides a function `param` to define such objects that will contain the information for a *param*eter.

Let's start, as usual, by setting up some random initial values for the parameters:

In [None]:
W = rand(1,2)

In [None]:
x = rand(2)

In [None]:
W_data = rand(1, 2)  
b_data = rand(1)

W_data, b_data

In [None]:
W = Flux.params(W_data)
b = Flux.params(b_data)

Here, `params` is a function that `Flux` provides to create an object that represents a parameter of a machine learning model, i.e. an object which has both a value and derivative information, and such that other objects know how to *keep track* of when it is used in an expression.

#### Exercise 4

What type does `W` have?

A) Array (1D) <br>
B) Array (2D) <br>
C) TrackedArray (1D) <br>
D) TrackedArray (2D) <br>
E) Parameter (1D) <br>
F) Parameter (2D) <br>

#### Exercise 5

`W` stores not only its current value, but also has space to store gradient information. You can access the values and gradient of the weights as follows:

```julia
W.data
W.grad
```

At this point, are the values of the weights or the gradient of the weights larger?

A) the values of the weights <br>
B) the gradient of the weights

#### Exercise 6

For data `x` and `y` where

```julia
x, y = [0.413759, 0.692204], [0.845677]
```
apply the loss function to `x` and `y` to give a new variable `l`. What is the type of `l`? (How many dimensions does it have?)

A) Array (0D) <br>
B) Array (1D) <br>
C) TrackedArray (0D) <br>
D) TrackedArray (1D)<br> 
E) Float64<br>
F) Int64<br>

### Stochastic gradient descent

We can now use these features to reimplement stochastic gradient descent, following the method we used in the previous notebook, but now using backpropagation!

#### Exercise 7

Modify the code from the previous notebook for stochastic gradient descent to use Flux instead.

### Investigating stochastic gradient descent

Let's look at the values stored in `b` before we run stochastic gradient descent:

In [None]:
b

In [None]:
x, y = [0.413759, 0.692204], [0.845677]

In [None]:
W_final, b_final = Descent(loss, W, b, xs, ys, 100000)

W_final
b_final

#### Exercise 8

Plot the data and the learned function.
    
#### Exercise 9

Do this plot every so often as the learning process is proceeding in order to have an animation of the process.

### Automation with Flux.jl

We will need to repeat the above process for a lot of different systems.
Fortunately, `Flux.jl` provides us with tools to automate this!

Flux allows to create a neuron in a simple way:

In [None]:
using Flux
model = Dense(2,1,σ)

In [None]:
typeof(model)

The `2` and `1` refer to the number of inputs and outputs, and the neuron is defined using the $\sigma$ function.

We have made an object of type `Dense`, defined by `Flux`, with the name `model`. This represents a "dense neural network layer" (see later for more on neural network layers).
The parameters that will be modified during the learning process live *inside* the `model` object.

#### Exercise 10

Investigate which variables live inside the `model` object and what type they are. How does that compare to the call to create the `Dense` object that we started with?

### Model object as a function

We can apply the `model` object to data just as if it were a standard function:

#### Exercise 11

Prove to yourself that you understand what is going on when we call `model`. Create two arrays `W` and `b` with the same elements as `model.W` and `model.b`. Use `W` and `b` to generate the same answer that you get when we call `model([.5, .5])`.

### Using Flux

We now need to provide Flux with three pieces of information: 

1. A loss function
2. Some training data
3. An optimization method

### Loss functions

Flux has various loss functions built in, for example the mean-squared error (`mse`) that we have been using:

Another common one is the cross entropy, `Flux.crossentropy`.

### Data

The data can take a couple of different forms. 
One form is a single **iterator**, consisting of pairs $(x, y)$ of data and labels. We can achieve this with `zip`.

#### Exercise 12

Use `zip` to "zip together" `xs` and `ys`. Then use the `collect` function to check what `zip` actually does.

### Optimization routine

Now we need to tell Flux what kind of optimization routine to use. It has several built in; the standard stochastic gradient descent algorithm that we have been using is called `SGD`. We must pass it two things: a list of parameter objects which will be modified by the optimization routine, and a step size:

The gradient calculations and parameter updates will be carried out by this optimizer function; we do not see those details, but if you are curious, you can, of course, look at the `Flux.jl` source code!

### Training

We now have all the pieces in place to actually **train** our model (a single neuron) on the data. 
"Training" refers to using pre-labeled data to learn the function that relates the input data to the desired output data given by the labels.

`Flux` provides the function `train!`, which performs a single pass through the data and does a single step of optimization using the partial cost function for each data point:

We can then just repeat this several times to train the network more and coax it towards the minimum of the cost function:

Now let's look at the parameters after training:

Instead of writing out a list of parameters to modify, `Flux` provides the function `params`, which extracts all available parameters from a model:

## Adding more features

#### Exercise 13

So far we have just used two features, red and green. 

(i) Add a third feature, blue. Plot the new data.

(ii) Train a neuron with 3 inputs and 1 output on the data.

(iii) Can you find a good way to visualize the result?

In [None]:
model(rand(2))

In [None]:
loss(x,y) = Flux.mse(model(x),y)

In [None]:
opt = Flux.Descent([model.W, model.b], 0.01)
# give a list of the parameters that will be modified

In [None]:
Flux.train!(loss,data,opt)

In [None]:
for i in 1:100
    Flux.train!(loss, data, opt)
end

In [None]:
model.W

In [None]:
model.b

In [None]:
opt = Descent(params(model), 0.01)

In [None]:
params(model)

## Adding more features

#### Exercise 13

So far we have just used two features, red and green. 

(i) Add a third feature, blue. Plot the new data.

(ii) Train a neuron with 3 inputs and 1 output on the data.

(iii) Can you find a good way to visualize the result?

## Neural networks

Now that we know what neurons are, we are ready for the final step: the neural network!. A neural network is literally made out of a network of neurons that are connected together. 

So far, we have just looked at single neurons, that only have a single output.
What if we want multiple outputs?


### Multiple output models

What if we wanted to distinguish between apples, bananas, *and* grapes? We could use *vectors* of `0` or `1` values to symbolize each output.

<img src="data/fruit-salad.png" alt="Drawing" style="width: 300px;"/>

The idea of using vectors is that different directions in the space of outputs encode information about different types of inputs.

Now we extend our previous model to give multiple outputs by repeating it with different weights. For the first element of the array we'd use:

$$\sigma(x;w^{(1)},b^{(1)}) := \frac{1}{1 + \exp(-w^{(1)} \cdot x + b^{(1)})};$$

then for the second we'd use

$$\sigma(x;w^{(2)},b^{(2)}) := \frac{1}{1 + \exp(-w^{(2)} \cdot x + b^{(2)})};$$

and if you wanted $n$ outputs, you'd have for each one

$$\sigma(x;w^{(i)},b^{(i)}) := \frac{1}{1 + \exp(-w^{(i)} \cdot x + b^{(i)})}.$$

Notice that these equations are all the same, except for the parameters, so we can write this model more succinctly, as follows. Let's write $b$ in an array:

$$b=\left[\begin{array}{c}
b_{1}\\
b_{2}\\
\vdots\\
b_{n}
\end{array}\right]$$

and put our array of weights as a matrix:

$$ \mathsf{W}=\left[\begin{array}{c}
\\
\\
\\
\\
\end{array}\begin{array}{cccc}
w_{1}^{(1)} & w_{2}^{(1)} & \ldots & w_{n}^{(1)}\\
w_{1}^{(2)} & w_{2}^{(2)} & \ldots & w_{n}^{(2)}\\
\vdots & \vdots &  & \vdots\\
w_{1}^{(n)} & w_{2}^{(n)} & \ldots & w_{n}^{(n)}
\end{array}\right]
$$

We can write this all in one line as:

$$\sigma(x;w,b)= \left[\begin{array}{c}
\sigma^{(1)}\\
\sigma^{(2)}\\
\vdots\\
\sigma^{(n)}
\end{array}\right] = \frac{1}{1 + \exp(-\mathsf{W} x + b)}$$

$\mathsf{W} x$ is the operation called "matrix multiplication"

[Show small matrix multiplication]

It takes each column of weights and does the dot product against $x$ (remember, that's how $\sigma^{(i)}$ was defined) and spits out a vector from doing that with each column. The result is a vector, which makes this version of the function give a vector of outputs which we can use to encode larger set of choices. 

Matrix multiplication is also interesting since **GPUs (Graphics Processing Units, i.e. graphics cards) are basically just matrix multiplication machines**, which means that by writing the equation this way, the result can be calculated really fast.

This "multiple input and multiple output" version of the sigmoid function is known as a *layer of neurons*.

Previously we worked with a single neuron, which we visualized as

<img src="data/single-neuron.png" alt="Drawing" style="width: 300px;"/>

where we have two pieces of data (green) coming into a single neuron (pink) that returned a single output. We could use this single output to do binary classification - to identify an image of a fruit as `1`, meaning banana or as `0`, meaning not a banana (or an apple).

To do non-binary classification, we can use a layer of neurons, which we can visualize as

<img src="data/single-layer.png" alt="Drawing" style="width: 300px;"/>

We now have stacked a bunch of neurons on top of each other to hopefully work together and train to output results of more complicated features. 

We still have two input pieces of data, but now have several neurons, each of which produces an output for a given binary classification: 
* neuron 1: "is it an apple?"
* neuron 2: "is it a banana?"
* neuron 3: "is it a grape?"

# Building a single neural network layer using `Flux.jl`

In this notebook, we'll move beyond binary classification. We'll try to distinguish between three fruits now, instead of two. We'll do this using **multiple** neurons arranged in a **single layer**.

## Read in and process data

We can start by loading the necessary packages and getting our data into working order with similar code we used at the beginning of the previous notebooks, except that now we will combine three different apple data sets, and will add in some grapes to the fruit salad!

In [None]:
# Load apple data in with `readdlm` for each file
apples1, applecolnames1 = readdlm("data/Apple_Golden_1.dat", '\t', header = true)
apples2, applecolnames2 = readdlm("data/Apple_Golden_2.dat", '\t', header = true)
apples3, applecolnames3 = readdlm("data/Apple_Golden_3.dat", '\t', header = true)

# Check that the column names are the same for each apple file
println( applecolnames1 == applecolnames2 == applecolnames3)

Since each apple file has columns with the same headers, we know we can concatenate these columns from the different files together:

In [None]:
apples = vcat(apples1, apples2, apples3)

And now let's build an array called `x_apples` that stores data from the `red` and `blue` columns of `apples`. From `applecolnames1`, we can see that these are the 3rd and 5th columns of `apples`:

In [None]:
applecolnames1[3], applecolnames1[5]

In [None]:
length(apples[:,1])

In [None]:
x_apples  = [ [apples[i, 3], apples[i, 5]] for i in 1:length(apples[:, 3]) ]

In [None]:
# similarly, let's create arrays called x_bananas and x_grapes
# Load data from *.dat files
bananas, bananacolnames = readdlm("data/Banana.dat", '\t', header = true)
grapes1, grapecolnames1 = readdlm("data/Grape_White.dat", '\t', header = true)
grapes2, grapecolnames2 = readdlm("data/Grape_White_2.dat", '\t', header = true)

# Concatenate data from two grape files together
grapes = vcat(grapes1, grapes2)

# Check that column 3 and column 5 refer to the "red" and "blue" columns from each file
println("All column headers are the same: ", bananacolnames == grapecolnames1 == grapecolnames2 == applecolnames1)

# Build x_bananas and x_grapes from bananas and grapes
x_bananas  = [ [bananas[i, 3], bananas[i, 5]] for i in 1:length(bananas[:, 3]) ]
x_grapes = [ [grapes[i, 3], grapes[i, 5]] for i in 1:length(grapes[:, 3]) ]

## One-hot vectors

Now we wish to classify *three* different types of fruit. It is not clear how to encode these three types using a single output variable; indeed, in general this is not possible.

Instead, we have the idea of encoding $n$ output types from the classification into *vectors of length $n$*, called "one-hot vectors":

$$
\textrm{apple} = \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix};
\quad
\textrm{banana} = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix};
\quad
\textrm{grape} = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}.
$$

The term "one-hot" refers to the fact that each vector has a single $1$, and is $0$ otherwise.

Effectively, the first neuron will learn whether or not (1 or 0) the data corresponds to an apple, the second whether or not (1 or 0) it corresponds to a banana, etc.

`Flux.jl` provides an efficient representation for one-hot vectors, using advanced features of Julia so that it does not actually store these vectors, which would be a waste of memory; instead `Flux` just records in which position the non-zero element is. To us, however, it looks like all the information is being stored:

In [None]:
using Flux: onehot

onehot(1, 1:3)

#### Exercise 1

Make an array `labels` that gives the labels (1, 2 or 3) of each data point. Then use `onehot` to encode the information about the labels as a vector of `OneHotVector`s.

## Single layer in Flux

Let's suppose that there are two pieces of input data, as in the previous single neuron notebook. Then the network has 2 inputs and 3 outputs:

In [None]:
include("draw_neural_net.jl")
draw_network([2, 3])

In [None]:
using Flux

In [None]:
model = Dense(2,3,)

#### Exercise 2

Now what do the weights inside `model` look like? How does this compare to the diagram of the network layer above?

## Training the model

Despite the fact that the model is now more complicated than the single neuron from the previous notebook, the beauty of `Flux.jl` is that the rest of the training process **looks exactly the same**!

#### Exercise 3

Implement training for this model.

#### Exercise 4

Visualize the result of the learning for each neuron. Since each neuron is sigmoidal, we can get a good idea of the function by just plotting a single contour level where the function takes the value 0.5, using the `contour` function with keyword argument `levels=[0.5, 0.501]`.

#### Exercise 5

Interpret the results by checking which fruit each neuron was supposed to learn and what it managed to achieve.

In [None]:
plot()

contour!(0:0.01:1, 0:0.01:1, (x,y)->model([x,y]).data[1], levels=[0.5, 0.501], color = cgrad([:blue, :blue]))
contour!(0:0.01:1, 0:0.01:1, (x,y)->model([x,y]).data[2], levels=[0.5,0.501], color = cgrad([:green, :green]))
contour!(0:0.01:1, 0:0.01:1, (x,y)->model([x,y]).data[3], levels=[0.5,0.501], color = cgrad([:red, :red]))

scatter!(first.(x_apples), last.(x_apples), m=:cross, label="apples")
scatter!(first.(x_bananas), last.(x_bananas), m=:circle, label="bananas")
scatter!(first.(x_grapes), last.(x_grapes), m=:square, label="grapes")

## Going deep: Deep neural networks

So far, we've learned that if we want to classify more than two fruits, we'll need to go beyond using a single neuron and use *multiple* neurons to get multiple outputs. We can think of stacking these multiple neurons together in a single neural layer.

Even so, we found that using a single neural layer was not enough to fully distinguish between bananas, grapes, **and** apples. To do this properly, we'll need to add more complexity to our model. We need not just a neural network, but a *deep neural network*. 

There is one step remaining to build a deep neural network. We have been saying that a neural network takes in data and then spits out `0` or `1` predictions that together declare what kind of fruit the picture is. However, what if we instead put the output of one neural network layer into another neural network layer?

This gets pictured like this below:

<img src="data/deep-neural-net.png" alt="Drawing" style="width: 500px;"/>

On the left we have 3 data points in blue. Those 3 data points each get fed into 4 neurons in purple. Each of those 4 neurons produces a single output, but those output are each fed into three neurons (the second layer of purple). Each of those 3 neurons spits out a single value, and those values are fed as inputs into the last layer of 6 neurons. The 6 values that those final neurons produce are the output of the neural network. This is a deep neural network.

### Why would a deep neural network be better?

This is a little perplexing when you first see it. We used neurons to train the model before: why would sticking the output from neurons into other neurons help us fit the data better? The answer can be understood by drawing pictures. Geometrically, the matrix multiplication inside of a layer of neurons is streching and rotating the axis that we can vary:

[Show linear transformation of axis, with data]

A nonlinear transformation, such as the sigmoid function, then adds a bump to the line:

[Show the linear transformed axis with data, and then a bumped version that fits the data better]

Now let's repeat this process. When we send the data through another layer of neurons, we get another rotation and another bump:

[Show another rotation, then another bump]

Visually, we see that if we keep doing this process we can make the axis line up with any data. What this means is that **if we have enough layers, then our neural network can approximate any model**. 

The trade-off is that with more layers we have more parameters, so it may be harder (i.e. computationally intensive) to train the neural network. But we have the guarantee that the model has enough freedom such that there are parameters that will give the correct output. 

Because this model is so flexible, the problem is reduced to that of learning: do the same gradient descent method on this much larger model (but more efficiently!) and we can make it classify our data correctly. This is the power of deep learning.

# Multiple neural network layers with `Flux.jl`

In a previous notebook, we saw that one layer of neurons wasn't enough to distinguish between three types of fruit (apples, bananas *and* grapes), since the data is quite complex. To solve this problem, we need to use more layers, so heading into the territory of **deep learning**!

By adding another layer between the inputs and the output neurons, a so-called "hidden layer", we will get our first serious **neural network**, looking something like this:

We will continue to use two input data and try to classify into three types, so we will have three output neurons. We have chosen to add a single "hidden layer" in between, and have arbitrarily chosen to put 4 neurons there.

Much of the *art* of deep learning is choosing a suitable structure for the neural network that will allow the model to be sufficiently complex to model the data well, but sufficiently simple to allow the parameters to be learned in a reasonable length of time.

## Read in and process data

As before, let's load some pre-processed data using code we've seen in the previous notebook.

In [None]:
include("draw_neural_net.jl")
draw_network([2, 4, 3])

In [None]:
using CSV
using DataFrames
using Flux
using Flux: onehot

apples_1 = CSV.read("data/Apple_Golden_1.dat",DataFrame, delim='\t')
apples_2 = CSV.read("data/Apple_Golden_2.dat",DataFrame, delim='\t')
apples_3 = CSV.read("data/Apple_Golden_3.dat",DataFrame, delim='\t')
bananas = CSV.read("data/Banana.dat", delim='\t')
grapes_1 = CSV.read("data/Grape_White.dat", delim='\t')
grapes_2 = CSV.read("data/Grape_White_2.dat", delim='\t');

apples = vcat(apples_1, apples_2, apples_3)
grapes = vcat(grapes_1, grapes_2);

In [None]:
col1 = :red
col2 = :blue

x_apples  = [ [apples_1[i, col1], apples_1[i, col2]] for i in 1:size(apples_1)[1] ]
append!(x_apples, [ [apples_2[i, col1], apples_2[i, col2]] for i in 1:size(apples_2)[1] ])
append!(x_apples, [ [apples_3[i, col1], apples_3[i, col2]] for i in 1:size(apples_3)[1] ])

x_bananas = [ [bananas[i, col1], bananas[i, col2]] for i in 1:size(bananas)[1] ]

x_grapes = [ [grapes_1[i, col1], grapes_1[i, col2]] for i in 1:size(grapes_1)[1] ]
append!(x_grapes, [ [grapes_2[i, col1], grapes_2[i, col2]] for i in 1:size(grapes_2)[1] ])

xs = vcat(x_apples, x_bananas, x_grapes);

In [None]:
labels = [ones(length(x_apples)); 2*ones(length(x_bananas)); 3*ones(length(x_grapes))];

ys = [onehot(label, 1:3) for label in labels];  # onehotbatch(labels, 1:3)

In [None]:
inputs = 2
hidden = 4
outputs = 3

layer1 = Dense(inputs, hidden, σ)
layer2 = Dense(hidden, outputs, σ)

In [None]:
model = Chain(layer1, layer2)

We now we wish to classify the three types of fruit, so we again use one-hot vectors to represent the desired outputs $y^{(i)}$:

The input data is in `xs` and the one-hot vectors are in `ys`.

## Multiple layers in Flux

Let's tell Flux what structure we want the network to have. We first specify the number of neurons in each layer, and then construct each layer as a `Dense` layer:

To stitch together multiple layers to make a multi-layer network, we use Flux's `Chain` function:

#### Exercise 1

What is the internal structure and sub-structure of this `model` object?

## Training the model

We have now set up a model and we have some training data.
How do we train the model on the data?
    
The amazing thing is that the rest of the code in `Flux` is **exactly the same as before**. This is possible thanks to the design of Julia itself, and of the `Flux` package.

#### Exercise 2

Train the model as before, now using the popular `ADAM` optimizer. You may need to train the network for longer than before, since we have many more parameters.

## Visualizing the results

What does this neural network represent? It is simply a more complicated function with two inputs and three outputs, i.e. a function $f: \mathbb{R}^2 \to \mathbb{R}^3$. 
Before, with a single layer, each component of the function $f$ basically corresponded to a hyperplane; now it will instead be a **more complicated nonlinear function** of the input data!

#### Exercise 3

Visualize each component of the output separately as a heatmap and/or contours superimposed on the data. Interpret the results.

## What we have learned

Adding an intermediate layer allows the network to start to deform the separating surfaces that it is learning into more complicated, nonlinear (curved) shapes. This allows it to separate data that were previously unable to be separated!

However, using only two features means that data from different classes overlaps. To distinguish it we would need to use more features.

### Exercise 4

Use three features (red, green and blue) and build a network with one hidden layer. Does this help to distinguish the data better?

# Learning to recognize handwritten digits using a neural network

We have now reached the point where we can tackle a very interesting task: applying the knowledge we have gained with machine learning in general, and `Flux.jl` in particular, to create a neural network that can recognize handwritten digits! The data are from a data set called MNIST, which has become a classic in the machine learning world.

[We could also try to apply the techniques to the original images of fruit instead. However, the fruit images are much larger than the MNIST images, which makes the learning a suitable neural network too slow.]

## Data munging

As we know, the first difficulty with any new data set is locating it, understanding what format it is stored in, reading it in and decoding it into a useful data structure in Julia.

The original MNIST data is available [here](http://yann.lecun.com/exdb/mnist); see also the [Wikipedia page](https://en.wikipedia.org/wiki/MNIST_database). However, the format that the data is stored in is rather obscure.

Fortunately, various packages in Julia provide nicer interfaces to access it. We will use the one provided by `Flux.jl`.

The data are images of handwritten digits, and the corresponding labels that were determined by hand (i.e. by humans). Our job will be to get the computer to **learn** to recognize digits by learning, as usual, the function that relates the input and output data.

### Loading and examining the data

First we load the required packages:

In [None]:
using Flux, Flux.Data.MNIST

labels = MNIST.labels();
images = MNIST.images();  
# the semi-colon (`;`) here is important: 
# it prevents Julia from displaying the object

#### Exercise 1

Examine the `labels` data. Then examine the first few images. *Do not try to view the whole of the `images` object!* Try to drill down to discover how the data is laid out.

#### Exercise 2

Convert the first image to a matrix of `Float64`.

### Munging the data

In the previous notebooks, we arranged the input data for Flux as a `Vector` of `Vector`s.
Now we will use an alternative arrangement, as a matrix, since that allows `Flux` to use matrix operations, which are more efficient.

The column $i$ of the matrix is a vector consisting of the $i$th data point $\mathbf{x}^{(i)}$.  Similarly, the desired outputs are given as a matrix, with the $i$th column being the desired output $\mathbf{y}^{(i)}$.

#### Exercise 3

An image is a matrix of colours, but now we need a vector instead. To do so, we just arrange all of the elements of the matrix in a certain way into a single list; fortunately, Julia already provides the function `vec` to do so!

1. Which order does `vec` use? [This reflects the underlying way in which the matrix is stored in memory.]

2. How can you convert an image into a `Vector` of `Float64`?

3. Define a variable $n$ that is the length of these vectors.

#### Exercise 4
Make a function `rewrite` that accepts a range and converts that range of images to floating-point vectors and stacks them horizontally using `hcat` and the "splat" operator `...`. 

We also want a matrix of one-hot vectors. `Flux` provides a function `onehotbatch` to do this (you will need to import it). It works like `onehot`, but takes in a vector of labels and outputs a matrix `Y`.

Return the pair `(X, Y)`.

## Setting up the neural network

Now we must set up a neural network. Since the data is complicated, we may expect to need several layers.
But we can start with a single layer.

- The network will take as inputs the vectors $\mathbf{x}^{(i)}$, so the input layer has $n$ nodes. 

- The output will be a one-hot vector encoding the digit from 1 to 9 or 0 that is desired. There are 10 possible categories, so we need an output layer of size 10. 

It is then our task as neural network designers to insert layers between these input and output layers, whose weights will be tuned during the learning process. *This is an art, not a science*! But major advances have come from finding a good structure for the network.

### Softmax

We will make a network with a single layer; let's choose each neuron in the layer to use the `relu` activation function. 
The output `relu` can be arbitrarily large, but in the end we will wish to compare the network's output with one-hot vectors, i.e. values between $0$ and $1$.

In order to make this work, we will thus use an extra function at the end that takes in a vector of arbitrary real numbers and maps it ("squashes it down") to a vector of numbers between $0$ and $1$.

The most used function with this property is $\mathrm{softmax}$. Firstly we take the exponential of each input variable to make them positive. Then we divide by the sum to make sure they lie between $0$ and $1$.

$$\mathrm{softmax}(\mathbf{x})_i := \frac{\exp (x_i)}{\sum_j \exp(x_j)}$$

Note that here we have written the result for the $i$th component of the function $\mathbf{R}^n \to \mathbf{R}^n$. Note also that the function returns a vector of numbers that are positive, and whose components sum to $1$. Thus, in fact, they can be thought of as probabilities.

In the neural network context, using a `softmax` after the final layer thus allows us to interpret the outputs as probabilities, in our case the probability that the network assigns that a given image represents each possible output value ($0$-$9$)!

#### Exercise 5

Make a neural network with one single layer, using the function $\sigma$, and a final `softmax`.

## Training

As we know, **training** consists of iteratively adjusting the model's parameters to decrease the `loss` function. Which parameters need to be adjusted? All of them!

Since the `loss` function contains a call to the `model` function, calling `back!` on the result of the loss function updates the information about the gradient of the loss function with respect to *every node in the network!*:

This is what is going on inside the `train!` function. 
In fact, `train!(loss, data, opt)` iterates over each object in `data` and runs this function.
For this reason, `data` must consist of an iterable object that returns pairs `(X, Y)` at each step.

Alternatively, we can make one call to the `train!` function iterate over several copies of `data`, using `repeated`. This is an **iterator**; it does not copy the data 100 times, which would be very wasteful; it just gives an object that repeatedly loops over the same data:

#### Exercise 6

Train the model on a subset of $N$ images with $N = 5000$.

This is (approximately) equivalent to just doing a `for` loop to run the previous `train!` command 100 times.

### Using callbacks

The `train!` function can take an optional keyword argument, `cb` (short for "*c*all*b*ack"). A callback function is a function that you provide as an argument to a function `f`, which "calls back" your function every so often.

This provides the possibility to provide a function that is called at each step or every so often during the training process.
A common use case is to provide a visual trace of the training process by printing out the current value of the `loss` function:

However, it is expensive to calculate the complete `loss` function and it is not necessary to output it every step. So `Flux` also provides a function `throttle`, that provides a mechanism to call a given function at most once every certain number of seconds:

## Testing phase

We now have trained a model, i.e. we have found the parameters `W` and `b` for the network layer(s). In order to **test** if the learning procedure was really successful, we check how well the resulting trained network performs when we test it with images that the network has not yet seen! 

Often, a dataset is split up into "training data" and "test (or validation) data" for this purpose, and indeed the MNIST data set has a separate pool of training data. We can instead use the images that we have not included in our reduced training process.

#### Exercise 8

Use the `indmax` function to write a function `prediction` that reports which digit `model` predicts, as the index with the maximum weight.

#### Exercise 9

Count the number of correct predictions over the whole data set, and hence the percentage of images that are correctly predicted. [This percentage is what is used to compare different machine learning techniques.]

## Improving the prediction

So far we have used a single layer. In order to improve the prediction, we probably need to use more layers.

#### Exercise 10

Introduce an intermediate, hidden layer. Does it give a better prediction?

In [None]:
l = loss(X, Y)

Flux.Tracker.back!(l)

In [None]:
data = ((X, Y), )  # one-element tuple

In [None]:
dataset = Base.Iterators.repeated((X, Y), 100)

In [None]:
N = 5_000
X, Y = rewrite(1:N)

In [None]:
loss(X, Y)

In [None]:
@time Flux.train!(loss, data, opt)

In [None]:
@time Flux.train!(loss, dataset, opt)

In [None]:
callback() = @show(loss(X, Y))

Flux.train!(loss, data, opt; cb = callback)

In [None]:
Flux.train!(loss, dataset, opt; cb = callback)

In [None]:
Flux.train!(loss, dataset, opt; cb = Flux.throttle(callback, 1))

In [None]:
for i in 1:100
    Flux.train!(loss, dataset, opt; cb = Flux.throttle(callback, 1))
end

In [None]:
X_test, Y_test = rewrite(N+1:N+100)

In [None]:
loss(X_test, Y_test)

In [None]:
display(images[N+1])
labels[N+1]

In [None]:
[model(X_test[:,1]) Y_test[:,1]]

In [None]:
loss(X_test[:,1], Y_test[:,1])

In [None]:
loss(X_test, Y_test)

# Autodiff:  <br> Calculus  from another angle 
(and the special role played by Julia's multiple dispatch and compiler technology)


   At the heart of modern machine learning, so popular in (2018),  is an optimization
problem.  Optimization means gradients, so suddenly differentiation, especially automatic differentiation, is exciting.


  The first time one  hears about automatic differentiation, it is easy to imagine what it is.  Surely it is  straightforward symbolic differentiation applied to code.  One imagines   automatically doing what is  learned  in a calculus class. 
  <img src="http://www2.bc.cc.ca.us/resperic/math6a/lectures/ch5/1/IntegralTable.gif" width="190">
  .... and anyway if it is not that, then it must be finite differences, like one learns in a numerical computing class.
  
<img src="http://image.mathcaptain.com/cms/images/122/Diff%202.png" width="150">

## Babylonian sqrt

We start with a simple example, the computation of sqrt(x), where  how autodiff works comes as both a mathematical surprise, and a computing wonder.  The example is  the Babylonian algorithm, known to mankind for millenia, to compute sqrt(x):  


 > Repeat $ t \leftarrow  (t+x/t) / 2 $ until $t$ converges to $\sqrt{x}$.
 
 Each iteration has one add and two divides. For illustration purposes, 10 iterations suffice.

In [None]:
function Babylonian(x; N = 10) 
    t = (1+x)/2
    for i = 2:N; t=(t + x/t)/2  end    
    t
end

In [None]:
α = π
Babylonian(α), √α

In [None]:
x=2; Babylonian(x),√x  # Type \sqrt+<tab> to get the symbol

In [None]:
import Pkg
Pkg.add("Plots")
Pkg.add("Plotly")
using Plots
using Plotly
plotly()
gr()
pyplot()

In [None]:
## Warning first plots load packages, takes time
i = 0:.01:49

plot([x->Babylonian(x,N=i) for i=1:5],i,label=["Iteration $j" for i=1:1,j=1:5])

plot!(sqrt,i,c="black",label="sqrt",
      title = "Those Babylonians really knew how to √")

## ...and now the derivative, almost by magic

Eight lines of Julia!  No mention of 1/2 over sqrt(x).
D for "dual number", invented by the famous algebraist Clifford in 1873.

The same algorithm with no rewrite at all computes properly
the derivative as the check shows.

In [None]:
struct D <: Number  # D is a function-derivative pair
    f::Tuple{Float64,Float64}
end

Sum Rule: (x+y)' = x' + y' <br>
Quotient Rule: (x/y)' = (yx'-xy') / y^2

In [None]:
import Base: +, /, convert, promote_rule
+(x::D, y::D) = D(x.f .+ y.f)
/(x::D, y::D) = D((x.f[1]/y.f[1], (y.f[1]*x.f[2] - x.f[1]*y.f[2])/y.f[1]^2))
convert(::Type{D}, x::Real) = D((x,zero(x)))
promote_rule(::Type{D}, ::Type{<:Number}) = D

In [None]:
x=49; Babylonian(D((x,1))), (√x,.5/√x)

In [None]:
x=π; Babylonian(D((x,1))), (√x,.5/√x)

## It just works!

How does it work?  We will explain in a moment.  Right now marvel that it does.  Note we did not
import any autodiff package.  Everything is just basic vanilla Julia.

## The assembler

Most folks don't read assembler, but one can see that it is short.
The shortness is a clue that suggests speed!

## Symbolically

We haven't yet explained how it works, but it may be of some value to understand that the below is mathematically
equivalent, though not what the computation is doing.

Notice in the below that Babylonian works on SymPy symbols.

Note: Python and Julia are good friends.  It's not a competition!  Watch how nicely we can use the same code now with SymPy.

In [None]:
@inline function Babylonian(x; N = 10) 
    t = (1+x)/2
    for i = 2:N; t=(t + x/t)/2  end    
    t
end  
@code_native(Babylonian(D((2,1))))

In [None]:
import Pkg;
Pkg.add("SymPy")
using SymPy

In [None]:
x = symbols("x")
display("Iterations as a function of x")
for k = 1:5
 display( simplify(Babylonian(x,N=k)))
end

display("Derivatives as a function of x")
for k = 1:5
 display(simplify(diff(simplify(Babylonian(x,N=k)),x)))
end

In [None]:
function dBabylonian(x; N = 10) 
    t = (1+x)/2
    t′ = 1/2
    for i = 1:N;  
        t = (t+x/t)/2; 
        t′= (t′+(t-x*t′)/t^2)/2; 
    end    
    t′

end

The code is computing answers mathematically equivalent to the functions above, but not symbolically, numerically.

## How autodiff is getting the answer
Let us by hand take the "derivative" of the Babylonian iteration with respect to x. Specifically t′=dt/dx.  This is the old fashioned way of a human rewriting code.

See this rewritten code gets the right answer.  So the trick is for the computer system to do it for you, and without any loss of speed or convenience.

What just happened?  Answer: We created an iteration by hand for t′ given our iteration for t. Then we ran the iteration alongside the iteration for t.

How did this work?  It created the same derivative iteration that we did by hand, using very general rules that are set once and need not be written by hand.

Important:: The derivative is substituted before the JIT compiler, and thus efficient compiled code is executed.

## Dual Number Notation

Instead of D(a,b) we can write a + b ϵ, where ϵ satisfies ϵ^2=0.  (Some people like to recall imaginary numbers where an i is introduced with i^2=-1.) 

Others like to think of how engineers just drop the O(ϵ^2) terms.

The four rules are

$ (a+b\epsilon) \pm (c+d\epsilon) = (a+c) \pm (b+d)\epsilon$

$ (a+b\epsilon) * (c+d\epsilon) = (ac) + (bc+ad)\epsilon$

$ (a+b\epsilon) / (c+d\epsilon) = (a/c) + (bc-ad)/d^2 \epsilon $

In [None]:
x = π; dBabylonian(x), .5/√x

In [None]:
Babylonian(D((x,1)))

In [None]:
Base.show(io::IO,x::D) = print(io,x.f[1]," + ",x.f[2]," ϵ")

In [None]:
# Add the last two rules
import Base: -,*
-(x::D, y::D) = D(x.f .- y.f)
*(x::D, y::D) = D((x.f[1]*y.f[1], (x.f[2]*y.f[1] + x.f[1]*y.f[2])))

In [None]:
D((1,0))

In [None]:
D((0,1))^2

In [None]:
D((2,1)) ^2

In [None]:
ϵ = D((0,1))
@code_native(ϵ^2)

In [None]:
ϵ * ϵ

In [None]:
ϵ^2

In [None]:
1/(1+ϵ)  # Exact power series:  1-ϵ+ϵ²-ϵ³-...

In [None]:
(1+ϵ)^5 ## Note this just works (we didn't train powers)!!

## Generalization to arbitrary roots

In [None]:
function nthroot(x, n=2; t=1, N = 10) 
    for i = 1:N;   t += (x/t^(n-1)-t)/n; end   
    t
end

In [None]:
nthroot(2,3), ∛2 # take a cube root

In [None]:
nthroot(2+ϵ,3)

In [None]:
nthroot(7,12), 7^(1/12)

In [None]:
x = 2.0
nthroot( x+ϵ,3), ∛x, 1/x^(2/3)/3

## Forward Diff
Now that you understand it, you can use the official package

In [None]:
Pkg.add("ForwardDiff")
using ForwardDiff

In [None]:
ForwardDiff.derivative(sqrt, 2)

In [None]:
ForwardDiff.derivative(Babylonian, 2)

In [None]:
@which ForwardDiff.derivative(sqrt, 2)

In [None]:
setprecision(3000)
round.(Float64.(log10.([Babylonian(BigFloat(2),N=k) for k=1:10] - √BigFloat(2))),3)

In [None]:
struct D1{T} <: Number  # D is a function-derivative pair
    f::Tuple{T,T}
end

In [None]:
z = D((2.0,1.0))
z1 = D1((BigFloat(2.0),BigFloat(1.0)))

In [None]:
import Base: +, /, convert, promote_rule
+(x::D1, y::D1) = D1(x.f .+ y.f)
/(x::D1, y::D1) = D1((x.f[1]/y.f[1], (y.f[1]*x.f[2] - x.f[1]*y.f[2])/y.f[1]^2))
convert(::Type{D1{T}}, x::Real) where {T} = D1((convert(T, x), zero(T)))
promote_rule(::Type{D1{T}}, ::Type{S}) where {T,S<:Number} = D1{promote_type(T,S)}

In [None]:
A = randn(3,3)

In [None]:
x = randn(3)

In [None]:
ForwardDiff.gradient(x->x'A*x,x)

In [None]:
(A+A')*x

In [None]:
n = 4
Strang = SymTridiagonal(2*ones(n),-ones(n-1))

##  But wait there's more!

Many packages need to be taught how to compute autodiffs of matrix factorications such as the svd or lu.  Julia will "just do it," no
teaching necessary for reasons such as the above.  This is illustrated in another notebook, not included here.

# Express path to classifying images

In this notebook, we will show how to run classification software similar to how Google images works.

Julia allows us to load in various pre-trained models for classifying images, via the `Metalhead.jl` package.

Let's download an image of an elephant:

We'll use the VGG19 model, which is a deep convolutional neural network trained on a subset of the ImageNet database. As this is your first notebook, very likely the words "convolutional", and "neural net," and "deep," may seem mysterious.  At the end of this course these words will no longer be mysterious.

In [None]:
import Pkg
Pkg.add("Images")
Pkg.add("Metalhead")
using Images
using Metalhead  # To run type <shift> + enter
using Metalhead: classify

In [None]:
download("http://www.mikebirkhead.com/images/EyeForAnElephant.jpg", "elephant.jpg")

In [None]:
image = load("elephant.jpg") # open up a new cell type ESC + b (for below)

In [None]:
vgg = VGG19()

In [None]:
for i=1:28
  println(vgg.layers[i])
end

In [None]:
image

In [None]:
classify(vgg, image)

In [None]:
image = load("data/philip.jpg")

In [None]:
classify(vgg, image)

In [None]:
Metalhead.imagenet_classes[rand(1:1000,1,1)]

In [None]:
probs = Metalhead.forward(vgg, image)

In [None]:
perm = sortperm(probs)
probs[273]

In [None]:
[Metalhead.imagenet_classes(vgg)[perm] probs[perm]][end:-1:end-10,:]

## What is going on here?

VGG19 classifies images according to the following 1000 different classes:

The model is a Convolutional Neural Network (CNN), made up of a sequence of layers of "neurons" with interconnections. The huge number of parameters making up these interconnections have previously been learnt to correctly predict a set of training images representing each class.

Running the model on an image spits out the probability that the model assigns to each class:

We can now see which are the most likely few labels:

## What are the questions to get a successful classifier via machine learning?

The key questions to obtain a successful classifier in machine learning are:

- How do we define a suitable model that can model the data adequately?

- How do we train it on suitably labelled data?

These are the questions that this course is designed to address.