# Integral Network and Derivative Network
### Notation of a neural network
Let $\aleph(X|\Theta) = \hat{Y}$ be a neural network with inputs $x_{i} \in X$, outputs $\hat{y}_{j} \in \hat{Y}$ and weights $\theta_{k} \in \Theta$. One could also treat $\aleph$ as a vector function $\aleph(X) = \hat{Y}$ and $\aleph_{j}(X) = \hat{y}_{j}$ denote the j-th output of the neural network or vector function.
### To approximate a function
To approximate a vector function $F(X) = Y$, where $Y$ is a vector with elements $y_{j} \in Y$. With samples of $F(X)$, one could use the mean squared error loss function:
$$\mathcal{L}_{MSE}(X,Y|\Theta) = \mathbf{E}_{j}[(\aleph_{j}(X|\Theta) - y_{j})^{2}]$$
And the respected cost function:
$$\mathcal{J}_{MSE}(X,Y|\Theta) = \mathbf{E}_{sample}[\mathcal{L}_{MSE}(X,Y|\Theta)]$$
Where $\mathbf{E}$ is the expected value notation. One could further performing gradient descent to minimize the loss function by updating the weights with:
$$\theta_{k} \leftarrow \theta_{k} - \alpha\frac{\partial}{\partial\theta_{k}}\mathcal{J}_{MSE}(X,Y|\Theta)$$
Where $\alpha$ is the learning rate. And when the cost converge to 0, the quality of the approximation becomes better.
### To approximate the derivative of a function
To approximate the derivative of a function, or more accurately, to approximate the jacobian matrix of a vector function $F(X) = Y$, where the jacobain matrix $\mathbf{J}_{F}$ of $F(X)$ is defined as:
$$\mathbf{J}_{F_{ji}} = \frac{\partial}{\partial x_{i}}F_{j}(X) = \frac{\partial y_{j}}{\partial x_{i}}$$
If a neural network $\aleph(X|\Theta)$ approximate $F(X)$, then we denote as $F(X) \approx \aleph(X|\Theta)$. The jacobian matrix of the neural network will also approximate the jacobian matrix of the function, where we denote as:
$$\mathbf{J}_{F}(X) \approx \mathbf{J}_{\aleph}(X)$$
$$\frac{\partial}{\partial x_{i}}F_{j}(X) \approx \frac{\partial}{\partial x_{i}}\aleph_{j}(X|\Theta),\forall j,\forall i$$
$$\frac{\partial y_{j}}{\partial x_{i}} \approx \frac{\partial \hat{y}_{j}}{\partial x_{i}},\forall j,\forall i$$
We call this transformed network with inputs $x_{i} \in X$, outputs $\mathbf{J}_{\aleph}(X) = \frac{\partial \hat{y}_{j}}{\partial x_{i}} \in \mathbb{R}^{Dim(\hat{Y})\times Dim(X)}$ and shared weights $\theta_{k} \in \Theta$ as the derivative network and we call this method which end to end learning the jacobian matrix as the derivative network method with a nickname dNN/dx.
### Trainning with the derivative network
If one have samples of $X$ to $\mathbf{J}_{F}$ instead, one could also approximate $F(X) \approx \aleph(X|\Theta)$ by performing gradient descent on the derivative network with the modified loss function:
$$\mathcal{L}_{MSE}^{'}(X,\mathbf{J}_{F}(X)|\Theta) = \mathbf{E}_{j,i}[(\mathbf{J}_{\aleph_{ji}} - \mathbf{J}_{F_{ji}})^{2}]$$
And the respected cost function:
$$\mathcal{J}_{MSE}^{'}(X,\mathbf{J}_{F}(X)|\Theta) = \mathbf{E}_{sample}[\mathcal{L}_{MSE}^{'}(X,\mathbf{J}_{F}(X)|\Theta)]$$
And updating the weights with:
$$\theta_{k} \leftarrow \theta_{k} - \alpha\frac{\partial}{\partial\theta_{k}}\mathcal{J}_{MSE}^{'}(X,\mathbf{J}_{F}(X)|\Theta)$$
We claim the derivative network is also a universal function approximator.
### To approximate the integral of a function
We now show that when the derivative network approximate the jacobian matrix of function $F$:
$$\mathbf{J}_{F} \approx \mathbf{J}_{\aleph}$$
The original neural network $\aleph(X|\Theta)$ will approximate $F(X)$ with a shifting vector constant $C = c_{j} \in \mathbb{R}^{Dim(\hat{Y})}$:
$$\frac{\partial y_{j}}{\partial x_{i}} \approx \frac{\partial\hat{y}_{j}}{\partial x_{i}},\forall j,\forall i$$
$$\int\frac{\partial y_{j}}{\partial x_{i}}dx_{i} \approx \int\frac{\partial\hat{y}_{j}}{\partial x_{i}}dx_{i}, \forall j,\forall i$$
$$y_{j} \approx \hat{y}_{j} + c_{j},\forall j$$
$$F(X) \approx \aleph(X|\Theta) + C$$
Since the mapping $X \rightarrow Y$ of $F(X)$ is missing in the trainning samples, the trainning recovered $F(X)$ from samples of $\mathbf{J}_{F}$. We claim that trainning with derivative network is equal to indefinite integrating jacobian matrix elements of function $F(X)$ approximately.
When the antiderivative is approximated, the definite integral respected to jacobian matrix element $\mathbf{J}_{F_{ji}}$ can be computed as following:
$$\int_{b}^{a}\mathbf{J}_{F_{ji}}dx_{i} \approx \aleph_{j}(x_{i}=a;X|\Theta) - \aleph_{j}(x_{i}=b;X|\Theta),\forall j,\forall i$$
When F(X) is a $\mathbb{R}\rightarrow\mathbb{R}$ scalar function, the jacobian matrix $\mathbf{J}_{F}$ will also reduce to a $1 \times 1$ matrix, and the above idea reduce to equal to the standard way to integrate a single variable scalar function.
We call this method which end to end learning the antiderivative as the integral network method with a nickname ∫NNdx.
### The derivative network can also be trianned with only part of the jacobian matrix
Let's consider a dynamic equation set $G$ mapping inputs $x_{i} \in X$ to outputs $y_{i}^{G} \in Y^{G}$, where the number of inputs and outputs are equal, and its jacobian matrix $\mathbf{J}_{G}$ is a square matrix. The samples of G is only consists of X mapping to the diagonal elements of its jacobian matrix:
$$\mathbf{J}_{G}^{diag} = \mathbf{J}_{G_{ii}} \in \mathbf{J}_{G}, \forall i$$
One could performs gradient descent with the reparameterized loss function:
$$\mathcal{L}_{MSE}^{'}(X,\mathbf{J}_{G}^{diag}|\Theta) = \mathbf{E}_{i}[(\mathbf{J}_{\aleph_{ii}} - \mathbf{J}_{F_{ii}})^{2}]$$
If outputs of the derivative network approximate the diagonal elements of the jacobian matrix:
$$\mathbf{J}_{G_{ii}} \approx \mathbf{J}_{\aleph_{ii}},\forall i$$
The original neural network $\aleph(X|\Theta)$ will also approximate $G(X)$:
$$\aleph(X|\Theta) \approx G(X)$$
More importantly, the derivative network will also learns the correlations between the equations in the dynamic equation set $G$ and approximate the non-diagonal elements of $\mathbf{J}_{G}$:
$$\mathbf{J}_{G_{ji}} \approx \mathbf{J}_{\aleph_{ji}},\forall j,\forall i,j \neq i$$
And this not only applies when only diagonal elements of $\mathbf{J}_{G}$ is presented in trainning, but also any element combination, as long as the elements are sufficent to describe the correlation between the equations in the dynamic equation set.
### To integrate single variable on a multivariable scalar function
Let's consider a multivariable scalar function $H(x,Z)$ mapping inputs $x \in \mathbb{R}, z_{n} \in Z$ to a scalar output $y^{H} \in \mathbb{R}$, one want to end to end learning the antiderivative $\int H(x,Z)dx$. The antiderivative can be end to end learnt by setting up the following neural network:
$$\aleph(x,Z|\Theta) = \hat{y}_{integrated}$$
And trainning with the following partial derivative network:
$$\dot{\aleph}(x,Z|\Theta) = \frac{\partial\hat{y}_{integrated}}{\partial x}$$
With the loss function:
$$\mathcal{L}_{MSE}^{partial}(x,Z,y^{H}|\Theta) = (\dot{\aleph}(x,Z|\Theta) - y^{H})^{2}$$
If $y^{H} \approx \dot{\aleph}(x,Z|\Theta)$, antiderivative can be approximated by:
$$\int H(x,Z)dx \approx \aleph(x,Z|\Theta) + C$$
And the definite integral can be approximated by:
$$\int_{b}^{a} H(x,Z)dx \approx \aleph(a,Z|\Theta) - \aleph(b,Z|\Theta)$$