## 1. Residual Neural Network


### 1.1 Inspiration

The method of neural ODEs (neural ordinary differential equations) was first proposed in a 2018 paper titled "Neural Ordinary Differential Equations" This paper won the Best Paper Award at NeurIPS that same year. The inspiration for neural ODEs came from observing a specific neural network model called the residual neural network (ResNet). This model was introduced by a research team at Microsoft in 2015. Unlike traditional neural networks, ResNet incorporates residual connections by adding a residual term to the output of each layer.
$$
h_{t+1} = h_t + ReLU(W_t * h_t + b_t)\tag{1}
$$
$$
h_{t+1} = h_t + f(h_t, θ_t)\tag{2}
$$

Here, the second term in equation $(1)$ is in the form of a typical neural network. To get the state at layer $t+1$, we perform calculation using the current state $h_t$, the weight matrix $W_t$ and bias vector $b_t$  at layer $t$, and a non-linear activation function $ReLU$. The first term of $(1)$, $h_t$ is the residual term, representing the identity of the hidden state at layer $t$.
We can write equation $(1)$ in the form of equation $(2)$, where $f$ is a function that depends on the state at layer $t$ and some parameters related to this layer. Then, by transforming equation $(2)$ we get:
$$
h_{t+1} - h_t = f(h_t, θ_t)
$$
$$
\frac{h_{t+1} - h_t }{1}= f(h_t, θ_t)
$$
$$
\frac{h_{t+\Delta t} - h_t }{\Delta t}\Bigg|_{\Delta t=1}= f(h_t, θ_t)
$$

Such a form inspires us to make $\Delta t$ infinitesimally small, allowing us to transform this discrete form into a continuous one:
$$
\lim_{\Delta t\to 0}\frac{h_{t+\Delta t} - h_t }{\Delta t}= f(h_t, θ_t,t)
$$
This motivates us to consider ordinary differential equations, which can be written as:
$$
\frac{d h(t) }{dt}= f(h(t), θ,t)
$$

### 1.2 Compare NeuralODEs with ResNet

So the inspiration for Neural ODEs originates from ResNet, where it transforms a discrete function into a continuous one. In ResNet, with $L$ layers, the transition between states from one layer to the next is discrete, and each layer has its own function to alter the current state. However, in an ODE network, it defines a continuous vector field, essentially representing a neural network with infinitely many layers. The state change can be interpreted as a flow within this vector field.

<p align="center">
<img src='./images/ResNetvsODENet.png' width="500"></img>
</p>
<p align='center'>Figure 1</p>

## 2. Backpropagation in Neural ODEs

Similar to classical neural networks, we aim to optimize the parameters in the neural ODE model to better fit the real data. We achieve this by minimizing the Cost function. Considering our approximation process as a vector field, for an input 
$z(t_0)$, the output produced by this model is $z(t_1)$, with:
$$
z(t_1)=ODESolve(z(t_0),f,\theta(t),t_0,t_1)
$$
So the Cost function would be applied to $z(t_1)$. The next step to develop a method to update the parameter $\theta$. It is equvilant to calculate the gradient of the Cost function with respect to $\theta$. For traditional neural networks, the computational cost of backpropagation increases as the number of layers increases. In the case of neural ODEs, if we want to perform backpropagation, we cannot use the same method as we do for residual networks. Backpropagating through an ODE solver would be very inefficient. This motivates us to use a method to calculate the gradient without needing the explicit solution.

### 2.1 Adjoint Sensitivity Method

The adjoint sensitivity method was first proposed by Pontryagin in 1962. This method provides a way to find the desired gradients by solving another ODE. Consequently, we do not need to store all the activated states for backpropagation, which helps to reduce the computational cost.

One important point to note is that in the adjoint method, we treat $\theta$ as a constant, meaning it does not change over time. Consequently, this method can only be used to solve specific classes of neural ODEs.

To minimize the Cost function, we introduce a quantity $a(t) = \frac{\partial \mathcal{L}}{\partial z(t)}$, called $adjoint$. We use this to simplify the process of obtaining the gradients with respect to $z(t_0)$, $\theta$, $t_0$ and $t_1$. Here we provided a detailed method to find the gradient with respect to $z(t_0)$.
$$
\frac{\partial \mathcal{L}}{\partial z(t)} = \frac{\partial \mathcal{L}}{\partial z(t+\epsilon)}\frac{\partial z(t+\epsilon)}{\partial z(t)}=a(t+\epsilon)\frac{\partial z(t+\epsilon)}{\partial z(t)}\tag{3}
$$
Additionaly, we know that for an ODE $z(t)$, we can obtain its solution by performing integration. Which means:
$$
z(h+\epsilon) = z(t)+\int_{t}^{t+\epsilon}f(z(s),\theta,t)\,ds
$$
take the partial derivative with respect to $h(t)$ to both sides, then we have:
$$
\frac{\partial z(t+\epsilon)}{\partial z(t)} = 1+\frac{\partial}{\partial z(t)}\int_{t}^{t+\epsilon}f(z(s),\theta,t)\,ds\tag{4}
$$
We plug the result of $(4)$ to $(3)$ and the we can get:
$$
a(t)=\frac{\partial \mathcal{L}}{\partial z(t)} = a(t+\epsilon)+a(t+\epsilon)\frac{\partial}{\partial z(t)}\int_{t}^{t+\epsilon}f(z(s),\theta,t)\,ds
$$

With these equations, now we can try to find the value for $\frac{da(t)}{dt}$:
$$
\begin{align}
\frac{da(t)}{dt} &= \lim_{\epsilon \to 0^{+}}\frac{a(t+\epsilon)-a(t)}{\epsilon}\notag\\
&=\lim_{\epsilon \to 0^{+}}\frac{-a(t+\epsilon)\frac{\partial}{\partial z(t)}\int_{t}^{t+\epsilon}f(z(s),\theta,t)\,ds}{\epsilon}\notag\\
&=-a(t)\frac{\partial f(z(t),\theta,t)}{\partial z(t)}\tag{5}
\end{align}
$$
Since we know the loss function is a function with $z(t_1)$ as its only variable, we trun the problem into an initial value problem(IVP):
$$\frac{da(t)}{dt} = -a(t)\frac{\partial f(z(t),\theta,t)}{\partial z(t)}\quad\quad\quad a(t_1) = \frac{d\mathcal{L}}{dz(t_1)}
$$

One can then compute:
$$
\frac{\partial \mathcal{L}}{dz(t_0)} =a(t_1)- \int_{t_1}^{t_0}a(s)\frac{\partial f(z(s),\theta,s)}{\partial z(s)}\,ds
$$

For the gradient with respect to $\theta$ and $t$, similarly, we define two more adjoints:
$$a_{\theta}(t):=\frac{\partial\mathcal{L}}{\partial \theta(t)}
\quad\quad\quad a_{t}(t):=-\frac{\partial\mathcal{L}}{\partial t}$$
(Note the minus sign in the second term is added to simplify the future calculations.)

Use the same method, we can obtain the following two IVPs. For $a_\theta(t)$, to simplify the computation, we assume that $a_\theta(t_1)=0$:
$$
a_\theta(t_1)=0
$$
$$
\frac{da_\theta}{dt}=-a(t)\frac{\partial f}{\partial \theta}\tag{6}
$$

And for $a_t(t)$:
$$
a_t(t_1) = -\frac{d\mathcal{L}}{dz(t_1)}\frac{dz(t_1)}{dt_1}=-a(t_1)f(z(t_1),\theta, t_1)
$$
$$
\frac{da_t}{dt}=-a(t)\frac{\partial f}{\partial t}\tag{7}
$$


We noticed that equation $(5)$ $(6)$ $(7)$ have a samilar pattern. So we can define $a_{aug}:=[a,a_\theta,a_t]^T$. It allows us to concatenate all these three adjoints together, so we have:
$$
\frac{da_{aug}}{dt} = -[a\frac{\partial f}{\partial z},a\frac{\partial f}{\partial \theta},a\frac{\partial f}{\partial t}]^T(t)
$$
$$
a_{aug}(t_1)=[\frac{d\mathcal{L}}{dz(t_1)},0,-a(t_1)f(z(t_1),\theta, t_1)]^T
$$
Thus we only need to solve this initial value problem. By calling to the ODE solver on the augmented adjoint once, we are able to calculate all the necessary gradients with respect to $z(t_0)$, $\theta$, $t_0$ and $t_1$.