From 24ea6aa3bab42602af9e2a308964bec8a74be957 Mon Sep 17 00:00:00 2001 From: Chris Rackauckas Date: Sun, 13 Dec 2020 07:20:36 -0500 Subject: [PATCH] Add diffeq machine learning lecture --- README.md | 25 ++ lecture15/diffeq_machine_learning.html | 337 +++++++++++++++---------- lecture15/diffeq_machine_learning.jmd | 159 +++++++++++- 3 files changed, 380 insertions(+), 141 deletions(-) diff --git a/README.md b/README.md index 6c69022b..7d92db23 100644 --- a/README.md +++ b/README.md @@ -514,6 +514,31 @@ those in scientific computing by looking at the relationship between convolution neural networks and partial differential equations. It turns out they are more than just similar: the two are both stencil computations on spatial data! +## Lecture 15: + +- [Mixing Differential Equations and Neural Networks for Physics-Informed Learning (Lecture)](https://youtu.be/VHtugbwyNKg) +- [Mixing Differential Equations and Neural Networks for Physics-Informed Learning (Notes)](https://mitmath.github.io/18337/lecture15/diffeq_machine_learning) + +Neural ordinary differential equations and physics-informed neural networks are +only the tip of the iceberg. In this lecture we will look into other algorithms +which are utilizing the connection between neural networks and machine learning. +We will generalize to augmented neural ordinary differential equations and +universal differential equations with DiffEqFlux.jl, which now allows for stiff +equations, stochasticity, delays, constraint equations, event handling, etc. to +all take place in a neural differential equation format. Then we will dig into +the methods for solving high dimensional partial differential equations through +transformations to backwards stochastic differential equations (BSDEs), and the +applications to mathematical finance through Black-Scholes along with stochastic +optimal control through Hamilton-Jacobi-Bellman equations. We then look into +alternative training techniques using reservoir computing, such as continuous-time +echo state networks, which alleviate some of the gradient issues associated with +training neural networks on stiff and chaotic dynamical systems. We showcase a +few of the methods which are being used to automatically discover equations in +their symbolic form such as SINDy. To end it, we look into methods for +accelerating differential equation solving through neural surrogate models, and +uncover the true idea of what's going on, along with understanding when these +applications can be used effectively. + ## Lecture 16: Probabilistic Programming - [From Optimization to Probabilistic Programming (Lecture)](https://youtu.be/32rAwtTAGdU) diff --git a/lecture15/diffeq_machine_learning.html b/lecture15/diffeq_machine_learning.html index f7124efe..43eb8297 100644 --- a/lecture15/diffeq_machine_learning.html +++ b/lecture15/diffeq_machine_learning.html @@ -631,7 +631,7 @@ pre, samp { font-family: Menlo, Monaco, Consolas, "Courier New", monospace; - font-size: 13px; + font-size: 0.9em; } @@ -646,132 +646,138 @@ div.title {text-align: center;} + + +
+
+
+
+

Mixing Differential Equations and Neural Networks for Physics-Informed Learning

+
Chris Rackauckas
+
December 13th, 2020
+
- - - -
-
-
- -
-

Mixing Differential Equations and Neural Networks for Physics-Informed Learning

-
Chris Rackauckas
-
November 19th, 2019
-
- -

Given this background in both neural network and differential equation modeling, let's take a moment to survey some methods which integrate the two ideas. Neural differential equations are of course one possible method, but there are methods which utilize the composition of these ideas. Some are good, some are not.

-

Julia codes for these methods are being developed, optimized, and tested in the NeuralNetDiffEq.jl package. A generalized neural differential equation solver can be found at DiffEqFlux.jl. Note that as a standard and maintained Julia package, these methods are available today for solving these kinds of problems with these neural-enhanced methods.

-

Generalized Neural Differential Equations

-

While our previous lectures focused on ordinary differential equations, the larger classes of differential equations can also have neural networks, for example:

+

Given this background in both neural network and differential equation modeling, let's take a moment to survey some methods which integrate the two ideas. In this course we have fully described how Physics-Informed Neural Networks (PINNs) and neural ordinary differential equations are both trained and used. There are many other methods which utilize the composition of these ideas.

+

Julia codes for these methods are being developed, optimized, and tested in the SciML organization. Some packages to note are

+

and many more collaborations with scientists around the world (too many to note). And there are some scattered packages in other languages to note too, such as:

+ -

For each of these equations, one can come up with an adjoint definition in order to define a backpropogation. However, there are limitatinos to this. In many cases, such as with stochastic differential equations, the time complexity of the continuous adjoint method is actually worse than backpropogation through the solver itself. Secondly, every single one of these equations requires an adjoint method, and some of them need different adjoint methods in different contexts (state-dependent delay differential equations vs constant-time delay differential equations).

-

DiffEqFlux.jl is the first library to support the wide gambit of possible differential equations. It does so by using Julia's language-wide AD tooling, such as Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

-

For example, a neural stochastic differential equation can be trained using the following example, and known differential equations can be mixed with unknown portions. Additionally, this blog post showcases the mixing of event handling with neural stochastic differential equations and using stiff solvers for ODEs to train neural partial differential equations.

-

Many of the methods below can be encapsolated as a choice of a neural differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl. A paper detailing this is to come out in the next few weeks.

-

Artificial neural networks for solving ordinary and partial differential equations

-

Solving Ordinary Differential Equations

-

This is a result due to Lagaris et. al from 1998. The idea is to solve differential equations using neural networks by representing the solution by a neural network and training the resulting network to satisfy the conditions required by the differential equation.

-

Let's say we want to solve a system of ordinary differential equations

-

\[ -u' = f(u,t) -\]

-

with $t \in [0,1]$ and a known initial condition $u(0)=u_0$. To solve this, we approximate the solution by a neural network:

-

\[ -N(t) \approx u(t) -\]

-

If $N(t)$ was the true solution, then it would hold that $N'(t) = f(N(t),t)$ for all $t$. Thus we turn this condition into our loss function. This motivates the loss function:

-

\[ -L(p) = \sum_i \left(\frac{dN(t_i)}{dt} - f(N(t_i),t_i) \right)^2 -\]

-

The choice of $t_i$ could be done in many ways: it can be random, it can be a grid, etc. Anyways, when this loss function is minimized (gradients computed with standard reverse-mode automatic differentiation), then we have that $\frac{dN(t_i)}{dt} \approx f(N(t_i),t_i)$ and thus $N(t)$ approximately solves the differential equation.

-

Quick Question for Understanding: What is an effective way to compute dN/dt?

-

If you thought reverse-mode automatic differentiation, go back and think about why that is incorrect! Hint: what is the dimensionality of the input and the output?

-

Note that we still have to handle the initial condition. One simple way to do this is to add an initial condition term to the cost function. While that would work, it can be more efficient to encode the initial condition into the function itself so that it's trivially satisfied for any possible set of parameters. For example, instead of directly using a neural network, we can use:

-

\[ -T(t) = u_0 + tN(t) -\]

-

as our solution. Notice that this will always satisfy the initial condition, so if we train this to satisfy the derivative function then it will automatically be a solution to the derivative function.

-

Solving Partial Differential Equations

-

This same idea can be employed to solve partial differential equations. For example, take the Poisson equation:

-

\[ -\Delta u = f(x,y) -\]

-

for the 2-dimensional Poisson equation with $x \in [0,1]$ and $y \in [0,1]$. Assume that the boundary conditions are Dirichlet, i.e. $u(0,y) = u_{0x}(y)$, $u(1,y) = u_{1x}(y)$, $u(x,0) = u_{0y}(x)$, and $u(x,1)=u_{1y}(y)$ are all given. If this is the case, we can have

-

\[ -T(x,y) = A(x,y) + x(1-x)y(1-y)N(x,y) -\]

-

where $A(x,y)$ is a function that is made so $T(x,y)$ satisfies the boundary conditions. Then we have the loss function:

-

\[ -L(p) = \sum_i \left(\Delta T(x_i,y_i) - f(x_i,y_i) \right)^2 -\]

-

Julia Implementation

-

A Julia implementation for solving ODEs via neural networks through this method exists in NeuralNetDiffEq.jl. Take a look at the code and make sure you understand it. It should be easily understandable given this description of the method!

-

Analysis of the Method and Dimensionality

-

While the method "will work" with large enough neural networks due to the universal approximation theorem, this method is highly inefficient since no information about differential equations are embedded into the problem. There is no convergence rate guarantee, and there are no strict bounds like one would expect from numerical analysis. All implementations of the method that I have found (including my own) show that this method is very very inefficient. For example, it can be hard to train a solver on the Lotka-Volterra equation using a few GPU hours, while a Runge-Kutta method will solve it in nanoseconds...

-

That said, the method itself is an interesting starting point. While other ODE and PDE solvers need to approximate an equation at specific points, i.e. on a mesh, this method is mesh-free in the sense that there is no mesh and just a direct representation of the continuous solution itself. This algorithm was reinvented in 2018 as the Deep Galerkin Method (DGM) and exploited the mesh-free behavior to see this as a method for high dimensional partial differential equations. While a finite difference method would require $N^d$ many points, a neural network just needs to have its parameters in memory, and thus it can scale well to high dimensional partial differential equation problems.

-

Physics-Informed Deep Learning

-

The results on this are found across three papers, the main being this article from the Journal of Computational Physics. with this JMLR paper having a similar flavor and all stemming from this arxiv paper. The authors are M.Raissi, P.Perdikaris, and G.E.Karniadakis, two at Brown university and Perdikaris at UPenn.

-

The main idea behind the Physics-Informed Neural Network (PINN) is to use known physical information as a form of prior information in the structure of the neural network architecture in order to build a more data-efficient form of machine learning for scientific computing applications. This setup starts by assuming that the underlying data follows some nonlinear partial differential equation

-

\[ -u_t + \mathcal{N}[u;\lambda] = 0 -\]

-

for some $\lambda$ parameterized PDE operator over some space $x \in \Omega$ and $t \in [0,T]$. From here, we proceed by approximating $u(t,x)$ by a neural network $N(t,x)$. Let

+

and many more. This lecture is a quick servey on different directions that people have taken so far in this field. It is by no means comprehensive.

+

The Augmented Neural Ordinary Differential Equation

+

Note that not every function can be represented by an ordinary differential equation. Specifically, $u(t)$ is an $\mathbb{R} \rightarrow \mathbb{R}^n$ function which cannot loop over itself except when the solution is cyclic. The reason is because the flow of the ODE's solution is unique from every time point, and for it to have "two directions" at a point $u_i$ in phase space would have two solutions to the problem

\[ -f(t,x) = N_t + \mathcal{N}[N;\lambda] +u' = f(u,p,t) \]

-

The loss function that is used is:

-

\[ -MSE = MSE_u + MSE_f -\]

-

where

-

\[ -MSE_u = \frac{1}{N_u} \sum_i |N(t_i^u,x_i^u) - u_i |^2 -\]

-

is the difference between the neural network and the data $u$, while

-

\[ -MSE_f = \frac{1}{N_f} \sum_i |f(t_i^f,x_i^f)|^2 -\]

-

.

-

is the loss on the PDE function itself. Notice that, if $f=0$ then the PDE is satisfied, and thus the second term indeed looks like a regularization to push towards the solution of the PDE.

-

Thus in some sense one can understand this as an extension to the previous deep learning methods for solving partial differential equations, where here the $MSE_f$ term is enforcing that the neural network solves the PDE, just like before, but it is mixed with a loss function that drives $N(t,x)$ towards the known data $u_i(t_i,x_i)$. In this sense, this is a method not for solving a partial differential equation with a neural network, but by using the structure of a neural network PDE solver as a component within a data-driven deep learning approach to relax the solution towards an underlying known physical structure.

-

Once this is established, the authors present two separate ways of training the physics-informed neural network $f$. One way to do this is to directly evaluate the loss function $MSE_f$ at chosen or random points in time using automatic differentiation. However, the number of points needed for this method to work is seen as a computational bottleneck (indeed, as noted before, the continuous mesh-free algorithm for partial differential equations is highly ineffecient!). To overcome this difficulty, they move towards a discrete-time model.

-

For this discrete time model, they discretize the partial differential equation by a Runge-Kutta method. For example, we the simplest Runge-Kutta method is the Euler method, which we can write out as:

-

\[ -u_{n+1} = u_n - \Delta t \mathcal{N}[u_n;\lambda] -\]

-

Now in this case, instead of making a neural network apply to $u(t,x)$ and try to approximate the full continuous function of time, the authors make a neural networks

-

\[ -N_n(x) = u_{n}(x) -\]

-

In this form, the previous loss function can be reinterpreted to be on this discretized set of neural networks. Specifically,

-

\[ -SSE = SSE_n + SSE_f + SSE_b -\]

-

where

+

where $u(0)=u_i$, and thus this cannot happen (with $f$ sufficiently nice). However, if we have another degree of freedom we can ensure that the ODE does not overlap with itself. This is the augmented neural ordinary differential equation.

+

We only need one degree of freedom in order to not collide, so we can do the following. We can add a fake state to the ODE which is zero at every single data point. This then allows this extra dimension to "bump around" as neccessary to let the function be a universal approximator. In code this looks like:

+ + +
+dudt = Chain(...) # Flux neural network
+p,re = Flux.destructure(dudt)
+dudt_(u,p,t) = re(p)(u)
+prob = ODEProblem(dudt_,[u0,0f0],tspan,p)
+augmented_data = vcat(ode_data,zeros(1,size(ode_data,2)))
+
+ + +

Extensions to other Differential Equations

+

While our previous lectures focused on ordinary differential equations, the larger classes of differential equations can also have neural networks, for example:

+ +

For each of these equations, one can come up with an adjoint definition in order to define a backpropogation, or perform direct automatic differentiation of the solver code. One such paper in this area includes neural stochastic differential equations

+

The Universal Ordinary Differential Equation

+

This formulation of the nueral differential equation in terms of a "knowledge-embedded" structure is leading. If we already knew something about the differential equation, could we use that information in the differential equation definition itself? This leads us to the idea of the universal differential equation, which is a differential equation that embeds universal approximators in its definition to allow for learning arbitrary functions as pieces of the differential equation.

+

The best way to describe this object is to code up an example. As our example, let's say that we have a two-state system and know that the second state is defined by a linear ODE. This mean we want to write:

\[ -SSE_n = \sum_i |N_i(x_i) - u_i|^2 +x' = NN(x,y) \]

-

and $SSE_b$ is a loss function of each $u_i$ against the boundary conditions. Here $SSE_f$ is simply the loss on the equation of the discretization:

\[ -SSE_f = \sum_i N_{n+1}(x_i) - N_n(x_i) + \Delta t \mathcal{N}[N_n;\lambda] +y' = p_1 x + p_2 y \]

-

to relax towards the solution of the PDE.

-

Using this style, the authors show that this method is a data-efficient form of learning because it can use the known physics as a form of heavy prior information.

+

We can code this up as follows:

+ + +
+u0 = Float32[0.8; 0.8]
+tspan = (0.0f0,25.0f0)
+
+ann = Chain(Dense(2,10,tanh), Dense(10,1))
+
+p1,re = Flux.destructure(ann)
+p2 = Float32[-2.0,1.1]
+p3 = [p1;p2]
+ps = Flux.params(p3)
+
+function dudt_(du,u,p,t)
+    x, y = u
+    du[1] = re(p[1:41])(u)[1]
+    du[2] = p[end-1]*y + p[end]*x
+end
+prob = ODEProblem(dudt_,u0,tspan,p3)
+concrete_solve(prob,Tsit5(),u0,p3,abstol=1e-8,reltol=1e-6)
+
+ + +

and we can train the system to be stable at 1 as follows:

+ + +
+function predict_adjoint()
+  Array(concrete_solve(prob,Tsit5(),u0,p3,saveat=0.0:0.1:25.0))
+end
+loss_adjoint() = sum(abs2,x-1 for x in predict_adjoint())
+loss_adjoint()
+
+data = Iterators.repeated((), 300)
+opt = ADAM(0.01)
+iter = 0
+cb = function ()
+  global iter += 1
+  if iter % 50 == 0
+    display(loss_adjoint())
+    display(plot(solve(remake(prob,p=p3,u0=u0),Tsit5(),saveat=0.1),ylim=(0,6)))
+  end
+end
+
+# Display the ODE with the current parameter values.
+cb()
+
+Flux.train!(loss_adjoint, ps, data, opt, cb = cb)
+
+ + +

DiffEqFlux.jl supports the wide gambit of possible universal differential equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

+

Many of the methods below can be encapsolated as a choice of a universal differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers have like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

@@ -794,7 +800,7 @@

The Deep BSDE Method

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

-

with initial condition $X(0)=\zeta$. Previous work has shown that the solution to satisfies the following BSDE:

+

with initial condition $X(0)=\zeta$. Previous work has shown that the solution satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ @@ -802,12 +808,12 @@

The Deep BSDE Method

& + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

-

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochstic differential equation system, one arrives at a recurrent neural network:

+

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochastic differential equation system, one arrives at a recurrent neural network:

Deep BSDE

Julia Implementation

-

A Julia implementation for the deep BSDE method can be found at NeuralNetDiffEq.jl. The examples considered below are part of the standard test suite.

+

A Julia implementation for the deep BSDE method can be found at NeuralPDE.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

-

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unless a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

+

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unlike a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 @@ -833,7 +839,7 @@

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

-

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propogate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ is given and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

+

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propogate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t @@ -847,7 +853,7 @@

Stochastic Optimal Control as a Deep BSDE Application

\frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

-

This PDE can be rewritten into the canonical form of \Cref{pdeform} by setting:

+

This PDE can be rewritten into the canonical form of the deep BSDE method by setting:

\[ \begin{align} \mu &= 0, \\ @@ -857,9 +863,83 @@

Stochastic Optimal Control as a Deep BSDE Application

\]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

+

Connections of Reservoir Computing to Scientific Machine Learning

+

Reservoir computing techniques are an alternative to the "full" neural network techniques we have previously discussed. However, the process of training neural networks has a few caveats which can cause difficulties in real systems:

+
    +
  1. The tangent space diverges exponentially fast when the system is chaotic, meaning that results of both forward and reverse automatic differentiation techniques (and the related adjoints) are divergent on these kinds of systems.

    +
  2. +
  3. It is hard for neural networks to represent stiff systems. There are many reasons for this, one being that neural networks tend to drop high frequency behavior

    +
  4. +
+

There are ways being investigated to alleviate these issues. For example, shadow adjoints can give a non-divergent average sense of a derivative on ergodic chaotic systems, but is significantly more expensive than the traditional adjoint.

+

To get around these caveats, some research teams have investigated alternatives which do not require gradient-based optimization. The clear frontrunner in this field is a type of architecture called echo state networks. A simplified formulation of an echo state network essentially fixes a neural network that defines a reservoir, i.e.

+

\[ +x_{n+1} = \sigma(W x_n + W_{fb} y_n) +\]

+

\[ +y_n = g(W_{out} x_n) +\]

+

where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the training process, $x_n$ is called the reservoir state, and $y_n$ is the output state for the observables. The idea is to find a projection $W_{out}$ from the high dimensional random reservoir $x$ to model the timeseries by $y$. If the reservoir is a big enough and nonlinear enough random system, there should in theory exist a projection from that random system that matches any potential timeseries. Indeed, one can prove that echo state networks are universal adaptive filters under certain conditions.

+

If $g$ is invertible (and in many cases $g$ is taken to be the identity), then one can directly apply the inversion of $g$ to the data. This turns the training of $W_{out}$, the only non-fixed portion, into a standard least squares regression between the reservoir and the observation series. This is then solved by classical means like SVD factorizations which can be stable in ill-conditioned cases.

+

Echo state networks have been shown to accurately reproduce chaotic attractors which are shown to be hard to train RNNs against. A demonstration via ReservoirComputing.jl clearly highlights this prediction ability:

+

+

However, this methodology still is not tailored to the continuous nature of dynamical systems found in scientific computing. Recent work has extended this methodolgy to allow for a continuous reservoir, i.e. a continuous-time echo state network. It is shown that using the adaptive points of a stiff ODE integrator gives a non-uniform sampling in time that makes it easier to learn stiff equations from less training points, and demonstrates the ability to learn equations where standard physics-informed neural network (PINN) training techniques fail.

+

+

This area of research is still far less developed than PINNs and neural differential equations but shows promise to more easily learn highly stiff and chaotic systems which are seemingly out of reach for these other methods.

+

Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data

+

The SINDy algorithm enables data-driven discovery of governing equations from data. It leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. Given a set of observations

+

\[ +\begin{array}{c} +\mathbf{X}=\left[\begin{array}{c} +\mathbf{x}^{T}\left(t_{1}\right) \\ +\mathbf{x}^{T}\left(t_{2}\right) \\ +\vdots \\ +\mathbf{x}^{T}\left(t_{m}\right) +\end{array}\right]=\left[\begin{array}{cccc} +x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ +x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots \\ +x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) +\end{array}\right] \\ +\end{array} +\]

+

and a set of derivative observations

+

\[ +\begin{array}{c} +\dot{\mathbf{X}}=\left[\begin{array}{c} +\mathbf{x}^{T}\left(t_{1}\right) \\ +\dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ +\vdots \\ +\mathbf{x}^{T}\left(t_{m}\right) +\end{array}\right]=\left[\begin{array}{cccc} +\dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ +\dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots \\ +\dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) +\end{array}\right] +\end{array} +\]

+

we can evaluate the observations in a basis $\Theta(X)$:

+

\[ +\Theta(\mathbf{X})=\left[\begin{array}{llllllll} +1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots +\end{array}\right] +\]

+

where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example,

+

\[ +\mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} +x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ +x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ +x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) +\end{array}\right] +\]

+

Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a given candidate library $\mathbf{\Theta}$ by solving the sparse regression problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. minimizing the objective function $\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. This method and other variants of SInDy, along with specialized optimizers for the LASSO $L_1$ optimization problem, have been implemented in packages like DataDrivenDiffEq.jl and pysindy. The result of these methods is LaTeX for the missing dynamical system.

+

Notice that to use this method, derivative data $\dot{X}$ is required. While in most publications on the subject this information is assumed. To find this, $\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic spline and taking the approximated derivatives at the observation points. However, for this estimation to be stable one needs a fairly dense timeseries for the interpolation. To alleviate this issue, the universal differential equations work estimates terms of partially described models and then uses the neural network as an oracle for the derivative values to learn from subsets of the dynamical system. This allows for the neural network's training to smooth out the derivative estimate between points while incorporating extra scientific information.

+

Other ways are being investigated for incorporating deep learning into the model discovery process. For example, extensions have been investigated where elements are defined by neural networks reprsenting a basis of the Koopman operator. Additionally, much work is going on in improving the efficiency of the symbolic regression methods themselves, and making the methods implicit and parallel.

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

-

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this equation is we do the following:

+

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this function is we do the following:

  • Solve the differential equation with parameters $p$

  • @@ -868,20 +948,19 @@

    Surrogate Acceleration Methods

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

-

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsolated as a $g(p)$ and learned with sufficient data.

- - - -
- +

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsulated as a $g(p)$ and learned with sufficient data.

+
+
- +
+ + diff --git a/lecture15/diffeq_machine_learning.jmd b/lecture15/diffeq_machine_learning.jmd index 465e7433..adead11f 100644 --- a/lecture15/diffeq_machine_learning.jmd +++ b/lecture15/diffeq_machine_learning.jmd @@ -1,7 +1,7 @@ --- title: Mixing Differential Equations and Neural Networks for Physics-Informed Learning author: Chris Rackauckas -date: December 11th, 2020 +date: December 13th, 2020 --- Given this background in both neural network and differential equation modeling, @@ -94,7 +94,7 @@ $$y' = p_1 x + p_2 y$$ We can code this up as follows: -```julia +```julia;eval=false u0 = Float32[0.8; 0.8] tspan = (0.0f0,25.0f0) @@ -116,7 +116,7 @@ concrete_solve(prob,Tsit5(),u0,p3,abstol=1e-8,reltol=1e-6) and we can train the system to be stable at 1 as follows: -```julia +```julia;eval=false function predict_adjoint() Array(concrete_solve(prob,Tsit5(),u0,p3,saveat=0.0:0.1:25.0)) end @@ -140,8 +140,8 @@ cb() Flux.train!(loss_adjoint, ps, data, opt, cb = cb) ``` -DiffEqFlux.jl is the first library to support the wide gambit of possible -universal differential equations with combinations of stiffness, delays, stochasticity, +DiffEqFlux.jl supports the wide gambit of possible universal differential +equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to @@ -351,16 +351,151 @@ The Julia code for solving this exact problem in 100 dimensions [can be found he ## Connections of Reservoir Computing to Scientific Machine Learning - - - -ADD STUFF! - - - +Reservoir computing techniques are an alternative to the "full" neural network +techniques we have previously discussed. However, the process of training +neural networks has a few caveats which can cause difficulties in real systems: + +1. The tangent space diverges exponentially fast when the system is chaotic, + meaning that results of both forward and reverse automatic differentiation + techniques (and the related adjoints) are divergent on these kinds of systems. +2. It is hard for neural networks to represent stiff systems. There are many + reasons for this, one being that neural networks [tend to drop high frequency behavior](https://arxiv.org/abs/1806.08734) + +There are ways being investigated to alleviate these issues. For example, +[shadow adjoints](https://www.sciencedirect.com/science/article/pii/S0021999117304783) +can give a non-divergent average sense of a derivative on ergodic chaotic systems, +but is significantly more expensive than the traditional adjoint. + +To get around these caveats, some research teams have investigated alternatives +which do not require gradient-based optimization. The clear frontrunner in this +field is a type of architecture called [echo state networks](http://www.scholarpedia.org/article/Echo_state_network). +A simplified formulation of an echo state network essentially fixes a neural +network that defines a reservoir, i.e. + +$$x_{n+1} = \sigma(W x_n + W_{fb} y_n)$$ +$$y_n = g(W_{out} x_n)$$ + +where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the +training process, $x_n$ is called the reservoir state, and $y_n$ is the output +state for the observables. The idea is to find a projection $W_{out}$ from the +high dimensional random reservoir $x$ to model the timeseries by $y$. If the +reservoir is a big enough and nonlinear enough random system, there should in +theory exist a projection from that random system that matches any potential +timeseries. Indeed, one can prove that echo state networks are universal adaptive +filters under certain conditions. + +If $g$ is invertible (and in many cases $g$ is taken to be the identity), then +one can directly apply the inversion of $g$ to the data. This turns the training +of $W_{out}$, the only non-fixed portion, into a standard least squares regression +between the reservoir and the observation series. This is then solved by classical +means like SVD factorizations which can be stable in ill-conditioned cases. + +Echo state networks have been shown to [accurately reproduce chaotic attractors](https://arxiv.org/pdf/1906.08829.pdf) +which are shown to be hard to train RNNs against. A demonstration via +[ReservoirComputing.jl](https://github.com/SciML/ReservoirComputing.jl) clearly +highlights this prediction ability: + +![](https://user-images.githubusercontent.com/10376688/81470264-42f5c800-91ea-11ea-98a2-a8a8d7d96155.png) +![](https://user-images.githubusercontent.com/10376688/81470281-5a34b580-91ea-11ea-9eea-d2b266da19f4.png) + +However, this methodology still is not tailored to the continuous nature of +dynamical systems found in scientific computing. Recent work has extended +this methodolgy to allow for a continuous reservoir, i.e. a +[continuous-time echo state network](https://arxiv.org/abs/2010.04004). It is +shown that using the adaptive points of a stiff ODE integrator gives a non-uniform +sampling in time that makes it easier to learn stiff equations from less training +points, and demonstrates the ability to learn equations where standard physics-informed +neural network (PINN) training techniques fail. + +![](https://user-images.githubusercontent.com/1814174/102009514-dc97d180-3d05-11eb-9542-bcd8d0f8b3a4.PNG) + +This area of research is still far less developed than PINNs and neural differential +equations but shows promise to more easily learn highly stiff and chaotic systems +which are seemingly out of reach for these other methods. ## Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data +[The SINDy algorithm](https://www.pnas.org/content/116/45/22445) enables +data-driven discovery of governing equations from data. It leverages the fact +that most physical systems have only a few relevant terms that define the +dynamics, making the governing equations sparse in a high-dimensional nonlinear +function space. Given a set of observations + +$$\begin{array}{c} +\mathbf{X}=\left[\begin{array}{c} +\mathbf{x}^{T}\left(t_{1}\right) \\ +\mathbf{x}^{T}\left(t_{2}\right) \\ +\vdots \\ +\mathbf{x}^{T}\left(t_{m}\right) +\end{array}\right]=\left[\begin{array}{cccc} +x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ +x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots \\ +x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) +\end{array}\right] \\ +\end{array}$$ + +and a set of derivative observations + +$$\begin{array}{c} +\dot{\mathbf{X}}=\left[\begin{array}{c} +\mathbf{x}^{T}\left(t_{1}\right) \\ +\dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ +\vdots \\ +\mathbf{x}^{T}\left(t_{m}\right) +\end{array}\right]=\left[\begin{array}{cccc} +\dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ +\dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots \\ +\dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) +\end{array}\right] +\end{array}$$ + +we can evaluate the observations in a basis $\Theta(X)$: + +$$\Theta(\mathbf{X})=\left[\begin{array}{llllllll} +1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots +\end{array}\right]$$ + +where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example, + +$$\mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} +x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ +x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ +\vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ +x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) +\end{array}\right]$$ + +Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a +given candidate library $\mathbf{\Theta}$ by solving the sparse regression +problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. +minimizing the objective function +$\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. +This method and other variants of SInDy, along with specialized optimizers for +the LASSO $L_1$ optimization problem, have been implemented in packages like +[DataDrivenDiffEq.jl](https://github.com/SciML/DataDrivenDiffEq.jl) and +[pysindy](https://github.com/dynamicslab/pysindy). The result of these methods +is LaTeX for the missing dynamical system. + +Notice that to use this method, derivative data $\dot{X}$ is required. While in +most publications on the subject this information is assumed. To find this, +$\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic +spline and taking the approximated derivatives at the observation points. However, +for this estimation to be stable one needs a fairly dense timeseries for the +interpolation. To alleviate this issue, the +[universal differential equations work](https://arxiv.org/abs/2001.04385) +estimates terms of partially described models and then uses the neural network +as an oracle for the derivative values to learn from subsets of the dynamical +system. This allows for the neural network's training to smooth out the derivative +estimate between points while incorporating extra scientific information. + +Other ways are being investigated for incorporating deep learning into the model +discovery process. For example, extensions have been investigated where +[elements are defined by neural networks reprsenting a basis of the Koopman operator](https://www.nature.com/articles/s41467-018-07210-0). +Additionally, much work is going on in improving the efficiency of the symbolic +regression methods themselves, and making the methods +[implicit and parallel](https://royalsocietypublishing.org/doi/full/10.1098/rspa.2020.0279). + ## Surrogate Acceleration Methods Another approach for mixing neural networks with differential equations is as a