## 1. Residual Neural Network


### 1.1 Inspiration

The method of neural ODEs (neural ordinary differential equations) was first proposed in a 2018 paper titled "Neural Ordinary Differential Equations" This paper won the Best Paper Award at NeurIPS that same year. The inspiration for neural ODEs came from observing a specific neural network model called the residual neural network (ResNet). This model was introduced by a research team at Microsoft in 2015. Unlike traditional neural networks, ResNet incorporates residual connections by adding a residual term to the output of each layer.
$$
h_{t+1} = h_t + ReLU(W_t * h_t + b_t)\tag{1}
$$
$$
h_{t+1} = h_t + f(h_t, θ_t)\tag{2}
$$

Here, the second term in equation $(1)$ is in the form of a typical neural network. To get the state at layer $t+1$, we perform calculation using the current state $h_t$, the weight matrix $W_t$ and bias vector $b_t$  at layer $t$, and a non-linear activation function $ReLU$. The first term of $(1)$, $h_t$ is the residual term, representing the identity of the hidden state at layer $t$.
We can write equation $(1)$ in the form of equation $(2)$, where $f$ is a function that depends on the state at layer $t$ and some parameters related to this layer. Then, by transforming equation $(2)$ we get:
$$
h_{t+1} - h_t = f(h_t, θ_t)
$$
$$
\frac{h_{t+1} - h_t }{1}= f(h_t, θ_t)
$$
$$
\frac{h_{t+\Delta t} - h_t }{\Delta t}\Bigg|_{\Delta t=1}= f(h_t, θ_t)
$$

Such a form inspires us to make $\Delta t$ infinitesimally small, allowing us to transform this discrete form into a continuous one:
$$
\lim_{\Delta t\to 0}\frac{h_{t+\Delta t} - h_t }{\Delta t}= f(h_t, θ_t,t)
$$
This motivates us to consider ordinary differential equations, which can be written as:
$$
\frac{d h(t) }{dt}= f(h(t), θ,t)
$$

### 1.2 Compare NeuralODEs with ResNet

So the inspiration for Neural ODEs originates from ResNet, where it transforms a discrete function into a continuous one. In ResNet, with $L$ layers, the transition between states from one layer to the next is discrete, and each layer has its own function to alter the current state. However, in an ODE network, it defines a continuous vector field, essentially representing a neural network with infinitely many layers. The state change can be interpreted as a flow within this vector field.

<p align="center" width="100%">
<img src='pics/ResNetvs.ODENet.png' width="500">
</p>
<p align='center'>Figure 1</p>

## 2. Backpropagation in Neural ODEs

Similar to classical neural networks, we aim to optimize the parameters in the neural ODE model to better fit the real data. We achieve this by minimizing the Cost function. Considering our approximation process as a vector field, for an input 
$z(t_0)$, the output produced by this model is $z(t_1)$, with:
$$
z(t_1)=ODESolve(z(t_0),f,\theta(t),t_0,t_1)
$$
So the Cost function would be applied to $z(t_1)$. The next step to develop a method to update the parameter $\theta$. It is equvilant to calculate the gradient of the Cost function with respect to $\theta$.