# Adaptive PDE discretizations on cartesian grids : Algorithmic tools

## Part : Automatic differentiation
## Chapter : Reverse automatic differentiation

This notebook illustrates *reverse first order* automatic differentiation. Recall that this approach is recommended for functions from a large dimensional space, to a small dimensional space, and whose jacobian does not have a sparse structure. It is typically appropriate for large scale optimization problems.

**Disclaimer.** Reverse first order automatic differentiation is a classical feature, found in many packages better optimized and maintained than this one. If you only need this specific feature, then it could be wise to use another implementation.

In [1]:
import sys; sys.path.append("..") # Allow imports from parent directory
from Miscellaneous import TocTools; TocTools.displayTOC('Reverse','Algo')

[**Summary**](Summary.ipynb) of this series of notebooks. 

[**Main summary**](../Summary.ipynb), including the other volumes of this work. 


# Table of contents

  * [1. Generating variables and requesting an expression's derivatives](#1.-Generating-variables-and-requesting-an-expression's-derivatives)
    * [1.1 Gradient](#1.1-Gradient)
    * [1.2 Hessian operator](#1.2-Hessian-operator)
  * [2. General operators and their adjoints](#2.-General-operators-and-their-adjoints)
    * [2.1 Linear mapping](#2.1-Linear-mapping)
    * [2.2 Linear inverse](#2.2-Linear-inverse)
    * [2.3 Non-linear operator](#2.3-Non-linear-operator)
  * [3 Loops](#3-Loops)




**Acknowledgement.** The experiments presented in these notebooks are part of ongoing research, 
some of it with PhD student Guillaume Bonnet, in co-direction with Frederic Bonnans.

Copyright Jean-Marie Mirebeau, University Paris-Sud, CNRS, University Paris-Saclay


**Known bugs and incompatibilities.** Our implementation of automatic differentiation is based on subclassing the numpy array class. While simple and powerful, this approach suffers from a few pitfalls, described in the notebook linked below.

In [2]:
TocTools.MakeLink('ADBugs','Algo')

Notebook [ADBugs](./Notebooks_Algo/ADBugs.ipynb) , from volume Algo [Summary](./Notebooks_Algo/Summary.ipynb) 

## 0. Importing the required libraries

In [3]:
import numpy as np
import scipy.sparse; import scipy.sparse.linalg

In [4]:
import NumericalSchemes.AutomaticDifferentiation as ad
import NumericalSchemes.FiniteDifferences as fd

In [5]:
def LInfNorm(a): return np.max(np.abs(a))

## 1. Generating variables and requesting an expression's derivatives

Reverse automatic differentiation works by keeping a history of the computations. The sensitivity of the result w.r.t some components is obtained by propagating the sensitivities backward in the computation queue and using the operator adjoints.

Our implementation of the reverseAD is differs from the denseAD or sparseAD classes: here we do not define an data type overloading the arithmetic operators and the basic special functions. Instead, the reverseAD class is only meant to keep a history of user specified computations.

This section only shows how to generate variables, and request an expression's derivatives. See the next section for an actual use of automatic differentiation.
For a beginning, we create an empty history.
<!---Our implementation of reverse automatic differentiation is probably --->

### 1.1 Gradient

In [6]:
x0=np.linspace(0,np.pi,4)
gridScale=x0[1]-x0[0]
u0=np.sin(x0)
v0=np.cos(x0)

In [7]:
# Create an empty history
rev = ad.Reverse.empty()
# Create AD variables w.r.t which the gradient will be required
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)

The generated AD variables are of sparse AD type.

In [8]:
print("u1:",u1)
print("v1:",v1)

u1: spAD(array([0.00000000e+00, 8.66025404e-01, 8.66025404e-01, 1.22464680e-16]), array([[1.],
       [1.],
       [1.],
       [1.]]), array([[0],
       [1],
       [2],
       [3]]))
v1: spAD(array([ 1. ,  0.5, -0.5, -1. ]), array([[1.],
       [1.],
       [1.],
       [1.]]), array([[4],
       [5],
       [6],
       [7]]))


If we make some computation using the variables registered in the reverseAD class, then we can request the gradient of the final expression.

In [9]:
result = (u1**2+v1**2).sum()

In [10]:
rev.gradient(result)

array([ 0.00000000e+00,  1.73205081e+00,  1.73205081e+00,  2.44929360e-16,
        2.00000000e+00,  1.00000000e+00, -1.00000000e+00, -2.00000000e+00])

If needed, this expression can be reshaped similar to the input variables.

In [11]:
u_grad,v_grad = rev.to_inputshapes(rev.gradient(result))

In [12]:
assert LInfNorm(v_grad-2*v0) < 1e-6

In [13]:
[(1,2)]==[]

False

### 1.2 Hessian operator

The hessian operator may be accessed as well, using the Reverse2 submodule. Note that the hessian matrix itself is never assembled, since in applications it would typically be dense and of large size.

In [14]:
# Create an empty history
rev = ad.Reverse2.empty()
# Create AD variables w.r.t which the gradient will be required
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)
result = (u1**2+v1**2).sum()

In [15]:
grad = rev.gradient(result) # Gradient
hess = rev.hessian(result) # Hessian operator

In [16]:
grad

array([ 0.00000000e+00,  1.73205081e+00,  1.73205081e+00,  2.44929360e-16,
        2.00000000e+00,  1.00000000e+00, -1.00000000e+00, -2.00000000e+00])

In this specific example, the hessian operator is twice the identity.

In [17]:
hess(grad)

array([ 0.00000000e+00,  3.46410162e+00,  3.46410162e+00,  4.89858720e-16,
        4.00000000e+00,  2.00000000e+00, -2.00000000e+00, -4.00000000e+00])

## 2. General operators and their adjoints

We illustrate Reverse automatic differentiation, in its intended use case involving operators operators whose jacobians should, in principle, be both high-dimensional and non-sparse. Note that linear operators often correspond to this desciption. For instance:
* A linear mapping, given in sparse form, but iterated many times.
* The inverse of a linear mapping, given in sparse form.
* The fast fourier transform, etc

Obviously, we also address non-linear operators.

### 2.1 Linear mapping

We first construct some sparse matrix, for the exposition, based on a finite difference scheme.
Here we consider a transport scheme.

In [188]:
def TransportScheme(u,h,speed,dt):
    return u+dt*speed*fd.DiffUpwind(u,(1,),h,padding=0.)

In order to construct the matrix, we evaluate the scheme on a sparse AD variable.

In [189]:
speed = 1.; T=1.5; nsteps=5; dt = T/nsteps;
transport_ad = TransportScheme(ad.Sparse.identity(u0.shape),gridScale,speed,dt)

In [190]:
transport_matrix = scipy.sparse.coo_matrix(transport_ad.triplets()).tocsr()

In [191]:
rev = ad.Reverse.empty()
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)

u2 = rev.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps)

result = (u2*v1).sum()

Due to the numerous iterations, the variable $u_2$ does not depend in a sparse manner on $u_1$. However, the dependence is linear, and thus has a simple adjoint, which will be exploited.
The variable $u_2$ is of sparseAD type, but features negative placeholder indices. Likewise for the result.

In [192]:
u2

spAD(array([5.02050076e-01, 4.17158585e-01, 1.38706040e-01, 2.77367654e-33]), array([[1.],
       [1.],
       [1.],
       [1.]]), array([[-1],
       [-2],
       [-3],
       [-4]]))

In [193]:
result

spAD(array(0.64127635), array([ 1.00000000e+00,  5.00000000e-01, -5.00000000e-01, -1.00000000e+00,
        5.02050076e-01,  4.17158585e-01,  1.38706040e-01,  2.77367654e-33]), array([-1, -2, -3, -4,  4,  5,  6,  7]))

In [194]:
rev.gradient(result)

array([ 0.00000000e+00,  8.03222547e-01,  6.77741743e-01, -2.49367755e-17,
        5.02050076e-01,  4.17158585e-01,  1.38706040e-01,  2.77367654e-33])

We can check the validity of the result using dense automatic differentiation, since this specific instance is small.

In [221]:
n=x0.size
#u1 = ad.Dense2.identity(constant=u0)
u1 = ad.Dense2.identity(constant=u0,shift=(0,n))
v1 = ad.Dense2.identity(constant=v0,shift=(n,0))

u2 = ad.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps)
#u2 = ad.apply_linear_mapping(transport_matrix,u1**2)
#result=u2.sum() #
result = (u2*v1).sum()

In [222]:
result.to_first()

denseAD(array(0.64127635),
array([ 0.00000000e+00,  8.03222547e-01,  6.77741743e-01, -2.49367755e-17,
        5.02050076e-01,  4.17158585e-01,  1.38706040e-01,  2.77367654e-33]))

In [223]:
r1 = np.dot(result.coef2,result.coef1)

The same works for second order.

In [230]:
ad.reload_submodules()

In [231]:
rev = ad.Reverse2.empty()
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)

u2 = rev.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps) # Also inserted quadratic non-linearity
#u2 = rev.apply_linear_mapping(transport_matrix,u1**2)
#result=u2.sum() #
result = (u2*v1).sum()

grad = rev.gradient(result)
hess = rev.hessian(result)

In [232]:
grad

array([ 0.00000000e+00,  8.03222547e-01,  6.77741743e-01, -2.49367755e-17,
        5.02050076e-01,  4.17158585e-01,  1.38706040e-01,  2.77367654e-33])

In [233]:
r2 = hess(grad)

In [234]:
r2

array([ 0.00000000e+00,  1.20144921e+00,  1.10232870e+00,  6.28712488e-17,
        8.66488999e-01,  6.93122236e-01,  2.17099574e-01, -1.12957547e-33])

In [235]:
LInfNorm(r2-r1)

4.440892098500626e-16

### 2.2 Linear inversion

In [22]:
def Scheme(u,h):
    return fd.DiffUpwind(u,(1,),h,padding=0.)

residue = Scheme(ad.Sparse.identity(u0.shape),gridScale)
matrix = scipy.sparse.coo_matrix(residue.triplets()).tocsr()

In [15]:


def mysolve(u,co_output=None):
    solver = scipy.sparse.linalg.spsolve
    if co_output is None:
        return solver(matrix,u)
    else: 
        return [(solver(matrix.T,co_output),u)]

Since solving linear problems is a common operation, a factory method `ad.Reverse.linear_solution_with_adjoint` is provided to produce functions equivalent to `mysolve`. It is illustrated in a further subsection.

Two equivalent syntaxes are available to evaluate a function using reverse automatic differentiation:
* Using `rev.apply`, where rev is the reverseAD variable keeping the history of the computations.
* Using `ad.apply`, and specifying the `reverse_history = rev` optional argument. This latter approach has the advantage of being compatible with the `envelope` keyword for differentiating extrema.

In [16]:
rev = ad.Reverse.empty()
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)

u2 = rev.apply(mysolve,u1)
u3 = ad.apply(mysolve,u2,reverse_history=rev) # Equivalently : u3=rev.apply(mysolve,u2) 

result = (u3*v1).sum()

Both `u2`, `u3` and the result contains *virtual* symbolic perturbations, with negative indices. They are eliminated in the backpropagation step.

In [17]:
result

spAD(array(5.69821876), array([ 1.00000000e+00,  5.00000000e-01, -5.00000000e-01, -1.00000000e+00,
        4.74851563e+00,  2.84910938e+00,  9.49703126e-01,  1.34297549e-16]), array([-5, -6, -7, -8,  4,  5,  6,  7]))

In [18]:
rev.to_inputshapes(rev.gradient(result))

(array([1.09662271, 2.74155678, 3.83817949, 3.83817949]),
 array([4.74851563e+00, 2.84910938e+00, 9.49703126e-01, 1.34297549e-16]))

In this simple case, the gradient may also be computed by hand.

In [19]:
def co_mysolve(co_output): return mysolve(u0,co_output=co_output)[0][0]
co_mysolve(co_mysolve(v0)), u3.value

(array([1.09662271, 2.74155678, 3.83817949, 3.83817949]),
 array([4.74851563e+00, 2.84910938e+00, 9.49703126e-01, 1.34297549e-16]))

### 2.3 Non-linear operator

## 3 Loops


**Note on complexity** Reverse automatic differentiation requires saving program state at each intermediate computation steps.
If a program contains a loop, this may result in severe memory usage. Therefore, it is common to only store a small proportion of the intermediate states, referred to as keypoints, and to use recomputations for reconstructing the other steps. For best efficiency, this procedure must be made recursive. 
*TODO : We will provide a helper function for that purpose.*

In [20]:
mysolve = ad.Reverse.linear_inverse_with_adjoint(scipy.sparse.linalg.spsolve,matrix)

In [21]:
rev = ad.Reverse.empty()
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)

u3 = rev.iterate(mysolve,u1,niter=2)
#u3 = rev.apply_linear_inverse(matrix,scipy.sparse.linalg.spsolve,u1,niter=2) # Equivalent

result = (u3*v1).sum()

In [22]:
result

spAD(array(5.69821876), array([ 1.00000000e+00,  5.00000000e-01, -5.00000000e-01, -1.00000000e+00,
        4.74851563e+00,  2.84910938e+00,  9.49703126e-01,  1.34297549e-16]), array([-5, -6, -7, -8,  4,  5,  6,  7]))

In [23]:
rev.gradient(result)

array([1.09662271e+00, 2.74155678e+00, 3.83817949e+00, 3.83817949e+00,
       4.74851563e+00, 2.84910938e+00, 9.49703126e-01, 1.34297549e-16])