In [1]:
using Flux, DiffEqFlux, Plots, DifferentialEquations, Random, Optim, CuArrays
Random.seed!(42)
plotlyjs() # optional backend for plotting

┌ Info: CUDAdrv.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
└ @ CUDAdrv /home/julius/.julia/packages/CUDAdrv/b1mvw/src/CUDAdrv.jl:67


Plots.PlotlyJSBackend()

# Neural Differential Equations in Julia
> Exploring the [Flux.jl](https://github.com/FluxML/Flux.jl) and [DiffEqFlux.jl](https://github.com/JuliaDiffEq/DiffEqFlux.jl) packages. 


## Warm-Up: Using Flux for Linear Regression

[Flux](https://julialang.org/blog/2018/12/ml-language-compiler/): "...typical frameworks are all-encompassing monoliths in hundreds of thousands of lines of C++, Flux is only a thousand lines of straightforward Julia code. Simply take one package for gradients (Zygote.jl), one package for GPU support (CuArrays.jl), sprinkle with some light convenience functions, bake for fifteen minutes and out pops a fully-featured ML stack."

**Problem:** Given data $(x_i,y_i)_{i=0}^m$ we want to approximately solve the problem 

$$ \min_{W,b} \sum_{i=0}^m | Wx_i+b - y_i |^2. $$

### Data

In [2]:
# data specification
samples = 30
noise_lvl = 0.1

# underlying (unkown) model;  Note: comment out '|> gpu' for execution on gpu
Ŵ = rand(1,2) #|> gpu
b̂ = rand(1) #|> gpu
ξ = rand(1, samples) #|> gpu

# affine linear mapping
aff(mat,vec) = (x -> mat*x .+ vec)

# create data
x = rand(2, samples) #|> gpu
y = aff(Ŵ, b̂)(x) .+ noise_lvl.*ξ

# plot helper function 
function plotting(model = nothing)
  x1 = x2 = range(0, 1; length=100) 
  p = scatter(x[1,:], x[2,:], vec(y), markersize = 2, label="data")
  if model != nothing
      # on gpu: plot!(x1, x2, (x1, x2) -> model(cu([x1,x2]))[], st=:surface, fα=0.5, colorbar_entry=false) 
      plot!(x1, x2, (x1, x2) -> model([x1,x2])[], st=:surface, fα=0.5, colorbar_entry=false)
  end
  return p
end

# plot
plotting()

### Model

In [3]:
# initial model 
W = rand(1,2) #|> gpu
b = rand(1) #|> gpu
model = aff(W,b)

# mean squared error loss 
MSE(y1, y2) = sum(abs2, y1 .- y2) 
loss(model) = MSE(model(x), y)

# print loss
printloss(model, i=0) = i%20==0 ? println("Step: $i Loss: $(loss(model))") : nothing
printloss(model)

# plot
plotting(model)

Step: 0 Loss: 1.2746710559117715


### Gradient Descent

**Idea:** To improve the prediction we can take the gradient of the loss w.r.t. $W$ and $b$ and perform gradient descent.

In contrast to TensorFlow or PyTorch in Python this is possible without tracing the operations in advance (Julia is just-in-time compiled, the *computational graph* is Julia’s own syntax).

In [4]:
# gradient steps and learning rate
steps = 100
ν = 0.02

# gradient descent
for i=1:steps
  g = gradient(() -> loss(model), params(W, b))
  W .-= ν .* g[W]
  b .-= ν .* g[b]
  printloss(model, i)
end

# plot
plotting(model)

Step: 20 Loss: 0.08186884586587508
Step: 40 Loss: 0.026613589974306313
Step: 60 Loss: 0.023346350154615253
Step: 80 Loss: 0.02312973935262624
Step: 100 Loss: 0.02311477129711602


### Shortcut

Let us use predefined Flux functions!

In [5]:
# model and initial parameters
flux_model = Dense(2, 1) #|> gpu
ps = Flux.params(flux_model)

# loss
printloss(flux_model)

Step: 0 Loss: 25.56036611261563


In [6]:
# train
for i=1:steps
  Flux.train!((x,y) -> MSE(flux_model(x), y), ps, [(x,y)], ADAM(0.02))
  printloss(flux_model, i)
end

# plot
plotting(flux_model)

Step: 20 Loss: 1.948328258011641
Step: 40 Loss: 0.26326352321651353
Step: 60 Loss: 0.08121725247885071
Step: 80 Loss: 0.08121723392122139
Step: 100 Loss: 0.08121724600080926


In [7]:
# compare the model parameters
println("Parameter of first model: W = $W, b = $b") 
println("Parameter of second model: W = $(flux_model.W), b = $(flux_model.b)") 

Parameter of first model: W = [0.26711052520295 0.9777963713187146], b = [0.49359334045346415]
Parameter of second model: W = Float32[0.18328257 1.1058999], b = Float32[0.4800001]


## Neural Differential Equations using DiffEqFlux

[DiffEqFlux](https://julialang.org/blog/2019/01/fluxdiffeq/): "Layers have traditionally been simple functions like matrix multiply, but in the spirit of differentiable programming people are increasingly experimenting with much more complex functions, such as ray tracers and physics engines. Turns out that differential equations solvers fit this framework, too."


**Problem:** Given data $(t_i, u(t_i))_{i=0}^m$ of the solution to an *unkown* ODE

$$ u'(t) = f(u), \quad u(t_0) = u_0 $$

**Goal:**  Train a neural network model $\mathcal{N}_\Phi$ (with learnable parameters $\Phi$) to approximately recover $f$, i.e. learn the underlying ODE from data.

**Idea:** Numerically solve the *neural* ODE 

$$ \tilde{u}_\Phi'(t) = \mathcal{N}_{\Phi}(\tilde{u}_\Phi), \quad \tilde{u}_\Phi(t_0) = u_0 $$

at times $(t_i)_{i=0}^t$ with a package that allows computing the gradient of the error 
$$\sum_{i=0}^m \big( \tilde{u}_\Phi(t_i)-u(t_i)\big)^2$$

w.r.t. to $\Phi$ in order to perform first-order optimization. 

### Underlying (Unkown) Dynamics

In [8]:
# initial condition 
u0 = Float32[2.0f0] 

# time horizon, number of samples, and uniformly distributed points 
tspan = (0.0f0,15f0)
datasize = 100
t = sort(tspan[1] .+ rand(Float32, datasize)*(tspan[2]-tspan[1]))

# noise and noise_lvl
noise_lvl = 0.1
ξ = rand(Float32, datasize)

# true du/dt
f(u,p,t) = 2*sin.(u)

# solution of the true ODE at time-points t and initial condition u0 with additional noise
# (we could also use the exact solution u(t)=2cot^{-1}(2^{-2t}cot(1)) but we want to explore DifferentialEquations.jl)
noisy_u(u0, t, ξ) = Array(solve(ODEProblem(f, u0, tspan), Tsit5(), saveat=t)) .+ noise_lvl.* ξ'
u = noisy_u(u0, t, ξ)

# plot helper function
function plotsol(u, û = nothing)
  p = scatter(t, vec(u), label="data")
  if û != nothing
    scatter!(t, vec(û), label="prediction") 
  end
  return p
end   

# plot solution
plotsol(u)

### Neural Network Model

In [9]:
# neural network model
model = Chain(Dense(1,50,relu), Dense(50,100,relu), Dense(100,1))

# ODE solver for the neural network model and initial model parameters
n_ode = NeuralODE(model, tspan, Tsit5(), saveat=t)
Φ = n_ode.p

# prediction for given initial condition
ũ(Φ) = n_ode(u0,Φ)

# plot of the data and the (untrained) neural ODE prediction
plotsol(u, ũ(Φ))

### Optimization

In [10]:
# loss 
loss(Φ) = MSE(ũ(Φ), u)

# callback for training
function callback(p, l) 
  println("Loss: $l")
  return false
end

# optimize with ADAM
res1 = DiffEqFlux.sciml_train(loss, Φ, ADAM(0.008), cb=callback, maxiters=300)

Loss: 2605.7443089360518
Loss: 446.86031976794976
Loss: 728.2911896094091
Loss: 788.7606012134377
Loss: 789.2819245140869
Loss: 760.8749516380996
Loss: 704.5445661725139
Loss: 604.3947895015518
Loss: 436.0798531016153
Loss: 178.96486845518362
Loss: 80.11227997667746
Loss: 156.8499433707727
Loss: 29.187423572241187
Loss: 91.83110417255699
Loss: 132.33080602317355
Loss: 125.90860021984945
Loss: 86.06812047360269
Loss: 36.82487973222664
Loss: 29.795640428764028
Loss: 82.92604207875057
Loss: 61.87491452383769
Loss: 23.46347535010954
Loss: 29.910784396627456
Loss: 48.1245955355483
Loss: 55.40550964861706
Loss: 47.968759158169156
Loss: 32.07667071827135
Loss: 20.554653524196308
Loss: 25.201341987282174
Loss: 37.76316730050019
Loss: 34.39183617634949
Loss: 22.064602359282848
Loss: 19.378997531305224
Loss: 24.741740818352223
Loss: 29.293207305308172
Loss: 28.494026708432784
Loss: 23.378723773568034
Loss: 18.66754497914894
Loss: 18.856607372422843
Loss: 22.830494429455165
Loss: 23.4908822426573

 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Minimizer: [-1.06e-01, -3.56e-02, -2.60e-01,  ...]
    Minimum:   4.410904e-01

 * Found with
    Algorithm:     ADAM
    Initial Point: [-1.59e-01, -7.84e-02, -3.17e-01,  ...]

 * Convergence measures
    |x - x'|               = NaN ≰ 0.0e+00
    |x - x'|/|x'|          = NaN ≰ 0.0e+00
    |f(x) - f(x')|         = NaN ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
    |g(x)|                 = NaN ≰ 0.0e+00

 * Work counters
    Seconds run:   2485  (vs limit Inf)
    Iterations:    300
    f(x) calls:    300
    ∇f(x) calls:   300


In [11]:
# plot
plotsol(u, ũ(res1.minimizer))

In [12]:
# optimize with LBFGS
res2 = DiffEqFlux.sciml_train(loss, res1.minimizer, LBFGS(), cb=callback)

Loss: 0.43741249165506785
Loss: 0.43641017754418465
Loss: 0.09886923455194511
Loss: 0.07735333562666946
Loss: 0.07706489876682218
Loss: 0.07705192363471956
Loss: 0.07704946863259016
Loss: 0.07704946863259016
Loss: 0.07703972669377326
Loss: 0.07703972669377326
Loss: 0.07703972669377326


 * Status: success

 * Candidate solution
    Minimizer: [-1.06e-01, -3.56e-02, -2.60e-01,  ...]
    Minimum:   7.703973e-02

 * Found with
    Algorithm:     L-BFGS
    Initial Point: [-1.06e-01, -3.56e-02, -2.60e-01,  ...]

 * Convergence measures
    |x - x'|               = 9.31e-10 ≰ 0.0e+00
    |x - x'|/|x'|          = 9.29e-10 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 6.38e-03 ≰ 1.0e-08

 * Work counters
    Seconds run:   232  (vs limit Inf)
    Iterations:    10
    f(x) calls:    156
    ∇f(x) calls:   156


In [13]:
# plot
plotsol(u, ũ(res2.minimizer))

### Extrapolate

In [14]:
# compare the neural diff. eq. solution to the groundtruth for different initial values
u0 = Float32[6.] # new initial condition
t = sort(tspan[1] .+ rand(Float32, datasize)*(tspan[2]-tspan[1])) # new time points
plotsol(noisy_u(u0, t, 0), NeuralODE(model, tspan, Tsit5(), saveat=t)(u0, res2.minimizer)) # plot

In [15]:
# compare the functions f and the neural network model directly 
du = range(1, 5; length=100)
plot(du, f(du, (), ()), label="f")
plot!(du, vec(n_ode.re(res2.minimizer)(du')), label="neural network")

Note that we can also continue our training for different initial conditions (if such data is available).