# Julia 深度學習：類神經網路模型簡介

本範例需要使用到的套件有 Flux，請在執行以下範例前先安裝。

```
] add Flux
```

注意：近期 Flux 正在持續更新，請確保您的 Julia 在 v1.3 版以上，以及 Flux 在 v0.10.4 以上或是最新版。

- Flux 是 Julia 中知名的深度學習框架，它是完全以 Julia 實作，運算效率上是依賴 Julia 語⾔言本⾝身。套件本⾝身使⽤用 Julia 語⾔言的陣列列，並與語法相容。
- Flux 的⾃自動微分功能是由 Zygote 提供
- Flux 有 Keras 般以層為基礎的網路路搭建方式

In [1]:
# using Pkg
# Pkg.add("Flux")
# Pkg.add("MLDatasets")

In [2]:
# if error occurred during precompling, close jupyter and re-open as administrator
using Flux

In [3]:
using Flux.Data: DataLoader # bring just DataLoader into scope from Flux.Data
using Flux: @epochs, onecold, onehotbatch, throttle, logitcrossentropy
using MLDatasets
using Statistics

In [4]:
using Images

## 載入資料
- `train_x`: input in training phase
- `test_x`: input information for making prediction
- `train_y`: answer (id of category) for in training phase
- `test_y`: answer (id of category) for in predicting phase

In [4]:
# 我們使用 MLDatasets 套件中的 MNIST 資料集。
train_X, train_y0 = MNIST.traindata(Float32)
test_X, test_y = MNIST.testdata(Float32)

(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

...

Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]

Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [7, 2, 1, 0, 4, 1, 4, 9, 5, 9  …  7, 8, 9, 0, 1, 2, 3, 4, 5, 6])

In [11]:
println("Training data X: type = $(typeof(train_X)), size = $(size(train_X))")
println("Training data y: type = $(typeof(train_y0)), size = $(size(train_y0))")
println("Testing data X: type = $(typeof(test_X)), size = $(size(test_X))")
println("Testing data y: type = $(typeof(test_y)), size = $(size(test_y))")

Training data X: type = Array{Float32,3}, size = (28, 28, 60000)
Training data y: type = Array{Int64,1}, size = (60000,)
Testing data X: type = Array{Float32,3}, size = (28, 28, 10000)
Testing data y: type = Array{Int64,1}, size = (10000,)


### 這邊需要先將資料切成 minibatch

flatten: convert each input data into 1 dimension
  - e.g. the following train_X has a size of $784\times 60000$, $784 = 28 \times 28$ is the flattened 1d-array, $60000$ is total number of input data.

In [15]:
# Transform (w, h, c, b)-shaped input into (w × h × c, b)-shaped output by
#   linearizing all values for each element in the batch.
train_X = Flux.flatten(train_X)
test_X = Flux.flatten(test_X)
println("Flattened training data X: type = $(typeof(train_X)), size = $(size(train_X))")
println("Flattened testing data X: type = $(typeof(test_X)), size = $(size(test_X))")

Flattened training data X: type = Array{Float32,2}, size = (784, 60000)
Flattened testing data X: type = Array{Float32,2}, size = (784, 10000)


onehot:
e.g. convert [1 2 2 1] (id of category) into two dimensional [1 0 0 1; 0 1 1 0] array

In [16]:
# total possible category
unique(train_y0)

10-element Array{Int64,1}:
 5
 0
 4
 1
 9
 2
 3
 6
 7
 8

In [17]:
train_y = onehotbatch(train_y0, 0:9) # 0:9 because unique(train_y) is 0:9
test_y = onehotbatch(test_y, 0:9)

10×10000 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 0  0  0  1  0  0  0  0  0  0  1  0  0  …  0  0  0  0  0  1  0  0  0  0  0  0
 0  0  1  0  0  1  0  0  0  0  0  0  0     0  0  0  0  0  0  1  0  0  0  0  0
 0  1  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  1  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  1  0  0  0
 0  0  0  0  1  0  1  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  1  0  0
 0  0  0  0  0  0  0  0  1  0  0  0  0  …  1  0  0  0  0  0  0  0  0  0  1  0
 0  0  0  0  0  0  0  0  0  0  0  1  0     0  1  0  0  0  0  0  0  0  0  0  1
 1  0  0  0  0  0  0  0  0  0  0  0  0     0  0  1  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  1  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  1  0  1  0  0  1     0  0  0  0  1  0  0  0  0  0  0  0

In [18]:
train_y0[1] # category 5 (category from 0 to 9)

5

In [19]:
train_y[:,1] # converted to a vector identifying the category by 1 or 0.

10-element Flux.OneHotVector:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0

Cut the data into smaller batches of each batchsize (i.e. minibatch)

`DataLoader`: An object that iterates over mini-batches of data
- batchsize
    - Considering train_X has a size of (256,n); train_y, (10,n); n is 10000 the total number of data (each column is a set of input for 1st level):
        - DataLoader(train_X, train_y, batchsize=128) split the data into batches. For train_X, each batch has a size of (256,128); for train_y, (10,128)
    - This is required to render the process tractable for GPU.  

In [20]:
batchsize = 1024
train = DataLoader(train_X, train_y, batchsize=batchsize, shuffle=true)
test = DataLoader(test_X, test_y, batchsize=batchsize)

DataLoader((Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Bool[0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0]), 1024, 10000, true, 10000, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 10000], false)

## FFN (feedforward neural network) 模型
- `Dense(dim_in, dim_out, activ_fn, sigm)`
    - `dim_in`: dimensions of the input variable
    - `dim_out`: dimensions of the output variable
    - `activ_fn`: activation function. `identity()` in default
    - `sigm`: sigmoid function
    - a layer is a function
- `softmax`: Softmax function，又被稱為『歸一化指數函數』，基本上是將一組向量（就好比說我們 Machine Learning 最後輸出的預測結果有多個分類，每個分類有著一個分數）映射為每個向量當中的元素都位於 (0, 1) 之間，其實就是代表著每個分類的機率分佈。當然，既然是機率分佈，那麼這個向量的總和應該要為 1。[Source](https://clay-atlas.com/blog/2019/10/20/machine-learning-chinese-softmax-function/)

#### link layers with `Flux.Chain` 
- Chain 也可以用在一般的函數或是其他函式。

#### Create a traditional `Dense` layer with parameters W and b.
- y = σ.(W * x .+ b)
- The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out

In [6]:
chainfunc = Chain(
x -> x+2,
x -> x*3 
)
chainfunc(2)

12

In [21]:
layer1in = size(train_X,1);
layer1out = 256;
layer2out = 128;
finalout = size(train_y,1);

model = Chain(
  Dense(layer1in, layer1out, relu), # First layer, where the input has to be an array of 784 by 1.
  Dense(layer1out, layer2out, relu), # a layer is a function
  Dense(layer2out, 10), # Final layer of 10 categories (total 10 possible answers).
  softmax)

Chain(Dense(784, 256, relu), Dense(256, 128, relu), Dense(128, 10), softmax)

In [28]:
model = Chain(
  Dense(784, 256, relu), # First layer, where the input has to be an array of 784 by 1.
  Dense(256, 128, relu), # a layer is a function
  Dense(128, 10), # Final layer of 10 categories (total 10 possible answers).
  logsoftmax)

Chain(Dense(784, 256, relu), Dense(256, 128, relu), Dense(128, 10), logsoftmax)

## 損失函數
- minimize residuals
    - $\text{residual} = y - \hat{y}$, in which $y$表示實際類別，$\hat{y}$表示預測類別
- popular loss function
    - for Regression
        - Mean square error，MSE
        - Mean absolute error，MAE
        - MAE對outlier比較有用，但因為微分不連續(剛剛的例子在x=0時，MAE函數就不可以微分)，因此可能在執行時容易出錯，MSE對outlier較敏感，但在求解時，比較容易找到穩定的解。
        - also see L1, L2 Regularization
- **分類問題**常用的損失函數: **cross-entropy**
    - A考試及格的機率是$p(xA)=0.4$，B考試及格的機率是$p(xB)=0.99$。這時候$I(xA)=-log(0.4)= 1.322$，$I(xB)=-log(0.99)= 0.014$
        - A的訊息量比B還大，這怎麼解釋哩，A及格的機率很低，如果A忽然及格了，會引起大家的注意，所以相對的訊息量較大，但B因為幾乎都滿分，大家對B及格習以為常，B考及格大家都不是很在意，所以信息量較小。
        - 機率越隨機(可能一下成績高一下成績低)的情況，訊息量比較大。
        - **Entropy是量測不確定性**: 從此例可以得知B還沒考試我就知道他考試及格機率是0.99，白話說考一百次才不及格一次，幾乎不會猜錯(很確定)，算出來的Entropy很小。但A及格機率是0.4，因為一百次考試她會及格40次，我們也很難猜到她會不會及格，所以很容易猜錯(不確定性大)，算出來的Entropy很大。那什麼時候Entropy最大哩，答案就是p=0.5時候，完全猜不到的情況，Entropy=1
        - this may also related to KLD (Kullback-Leibler divergence, based on entropy): a meansure of suprise
    - cross-entropy越小，代表模型越好
        
[好文](https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8-%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E4%BB%8B%E7%B4%B9-%E6%90%8D%E5%A4%B1%E5%87%BD%E6%95%B8-loss-function-2dcac5ebb6cb)

- ` logitcrossentropy(ŷ, y; weight = 1)`

In [29]:
loss(x, y) = logitcrossentropy(model(x), y)

loss (generic function with 1 method)

## Callback 函式

In [None]:
typeof(test)

In [None]:
iterate(test)[1][1]

`test` (type: `DataLoader`) is a generator that outputs (x,y)


In [30]:
function test_loss()
    L = 0f0 # Use Float32 to save GPU memory
    for (x, y) in test
        L += loss(x, y)
    end
    L/length(test)
end

test_loss (generic function with 1 method)

`evalcb()`: evaluate call back

In [31]:
evalcb() = @show(test_loss()) # for displaying current progress only

evalcb (generic function with 1 method)

## 模型訓練

`cb`: an additional argument, used for callbacks so that you can see the training process. 
`ADAM`: an optimiser

In [32]:
epochs = 20
timeout_in_seconds = 10
ps = Flux.params(model)
@epochs epochs Flux.train!(loss, ps, train, ADAM(0.005), cb=throttle(evalcb, timeout_in_seconds))

┌ Info: Epoch 1
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 1.8612522f0
test_loss() = 1.0767492f0


┌ Info: Epoch 2
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.35899606f0


┌ Info: Epoch 3
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.23955083f0


┌ Info: Epoch 4
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.12156515f0


┌ Info: Epoch 5
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.16015524f0


┌ Info: Epoch 6
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.100254f0


┌ Info: Epoch 7
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.1186793f0


┌ Info: Epoch 8
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121
┌ Info: Epoch 9
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.105872974f0
test_loss() = 0.11349766f0


┌ Info: Epoch 10
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.10412069f0


┌ Info: Epoch 11
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121
┌ Info: Epoch 12
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.12895682f0
test_loss() = 0.12616928f0


┌ Info: Epoch 13
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.10845351f0


┌ Info: Epoch 14
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.14078231f0


┌ Info: Epoch 15
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121
┌ Info: Epoch 16
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.13358337f0
test_loss() = 0.14387587f0


┌ Info: Epoch 17
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121
┌ Info: Epoch 18
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.16247737f0
test_loss() = 0.16038367f0


┌ Info: Epoch 19
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121
┌ Info: Epoch 20
└ @ Main C:\Users\HSI\.julia\packages\Flux\Fj3bt\src\optimise\train.jl:121


test_loss() = 0.13691534f0


## 模型評估
- `onecold`: the inverse operation of `onehot`

In [33]:
accuracy(x, y) = mean(onecold(model(x)) .== onecold(y))

accuracy (generic function with 1 method)

In [34]:
accuracy(test_X, test_y)

0.9766