# Lecture #18: Automatic Differentiation
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2021

<img src="fig/logos.jpg" style="height:150px;">

In [1]:
### Import basic libraries
import numpy
import autograd.numpy as np
import autograd.numpy.random as npr
import autograd.scipy.stats.multivariate_normal as mvn
import autograd.scipy.stats.norm as norm
from autograd import grad
from autograd.misc.optimizers import adam
import numpy
import scipy as sp
import pandas as pd
import sklearn as sk
import math
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import rc
from IPython.display import HTML
from IPython.display import YouTubeVideo
%matplotlib inline

## Outline
1. Review of BBVI
2. Automatic Differentiation

# Review of Black Box Variational Inference

## Developments in Computationally Efficient Variational Inference

1. **Black-box Variational Inference (2013)**
  Uses the log-derivative trick to rewrite the gradient of the ELBO as:
  
  $$\nabla_{\mu, \Sigma} \, ELBO(\mu, \Sigma) = \mathbb{E}_{\mathbf{W} \sim q(\mathbf{W} | \mu, \Sigma)}\left[ \nabla_{\mu, \Sigma}\, q(\mathbf{W} | \mu, \Sigma) * \log \left( \frac{p(\mathbf{W}) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)})}{q(\mathbf{W} | \mu, \Sigma)} \right) \right]$$
  
  This requires **only** the computation of the gradient of $q(\mathbf{W})$, which is generally much simpler than $p(\mathbf{W}) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)})$. 
  
  Implementation of BBVI means hard-coding a large library of different kinds of variational distributions $q(\mathbf{W} | \lambda)$ and their gradients. User inputs the joint distribution of their Bayesian model and chooses a variational family $q(\mathbf{W} | \lambda)$ - then you can optimize the variational parameters $\lambda$ to best approximate the target posterior by gradient descent.

## Developments in Computationally Efficient Variational Inference

2. **Weight Uncertainty in Neural Networks (2015)**
  Assuming the variational family is mean-field Gaussian, uses the reparametrization trick to rewrite the gradient of the ELBO as:
  \begin{align}
  \nabla_{\mu, \Sigma} \, ELBO(\mu, \Sigma)=& \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \mathbf{I})}\left[\nabla_{\mu, \Sigma} \log \left[p(\epsilon^\top \Sigma^{1/2} + \mu) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \epsilon^\top \Sigma^{1/2} + \mu)\right]\right]\\
  &+ \nabla_{\mu, \Sigma}\underbrace{\mathbb{E}_{\mathbf{W} \sim \mathcal{N}(\mu, \Sigma )}\left[\log \mathcal{N}(\mathbf{W};\mu, \Sigma ) \right]}_{\text{Guassian entropy: has closed form}}
  \end{align}
  
  For Bayesian Neural Networks, the gradient in the above can be computed by backpropagation - then you can optimize the variational parameters $\mu, \Sigma$ to best approximate the target posterior by gradient descent. This algorithm is called **Bayes by Backprop**.<br><br>

## Developments in Computationally Efficient Variational Inference

3. **Automatic Differentiaition Variational Inference (2016)**
  Anytime the gradient of the ELBO can be written as the expectation of a gradient (of an expression without integrals), the gradient can be computed by any automatic differentiation package - then you can optimize the variational parameters $\lambda$ to best approximate the target posterior by gradient descent.

# Automatic Differentiation

## Types of Computational Differentiation

1. Manually computing closed form expressions of derivatives and then coding them up (say using `numpy` functions).<br><br>

2. **numeric differentiation:** approximate a derivative $\frac{df(x)}{dx}$ with a rate of change $\frac{f(x+h) - f(x)}{h}$ at $x=a$.<br><br>

3. **symbolic differentiation:** manipulate expressions of functions using pre-programmed rules (of differentiation).<br><br>

4. **automatic differentiation:** algorithmic computation of exact numeric derivatives. `autograd` is a `python` implementation of automatic differentiation. `pytorch` implements one particular mode of automatic differentiation, also called `autograd`.

## Numeric Differentiation

Given $f: \mathbb{R}^n \to \mathbb{R}^m$, approximate each gradient $\nabla f_j = \left(\frac{\partial f_j}{\partial x_i}\ldots \frac{\partial f_j}{\partial x_n}\right)$ for $i = 1, \ldots, n$ and $j = 1, \ldots, m$ with

$$
\frac{\partial{f}}{\partial x_i} \approx \frac{f(x + he_i) - f(x)}{h}
$$
where $e_i$ is the $i$-th standard basis vector of $\mathbb{R}^n$.

This is numerically unstable when $h\approx 0$ and biased when $h$ is large. For each gradient $\nabla f_j$, it requires $O(n)$ computations.

## Symbolic Differentiation

Given an expression for a function $f: \mathbb{R}^n \to \mathbb{R}^m$, we represent the expression as a tree and automatically manipulate the expression tree by applying transformations representing differentiation:

$$
\frac{d}{dx}[h(x) + g(x)] \rightarrow \frac{d}{dx}h(x) + \frac{d}{dx}g(x)
$$

This can be computationally inefficient since expressions of derivatives can be exponentially longer than the original function expression:

\begin{align}
f(x) &= h(x)^2 g(x) + \ln(h(x)) + g(x)\\
f'(x) &= 2h(x)h'(x)g(x) + h(x)^2g'(x) + \frac{h'(x)}{h(x)} + g'(x)
\end{align}

Numerical evaluation of the derivative can be inefficient due to the redudant evaluation of the components of $f$.

## Automatic Differentiation: The Idea

We decompose a function $f(a, b) = (a + b)(b + 1)$ into elementary operations:

<img src="fig/computation_graph.png" style="height:250px;">

We apply symbolic differentiation to elementary operations, like: arithmetic operations, elementary functions (exponential, logarithmic, trignometric, power).

We keep intermediate values of the components of $f$ so that they can be reused.


## Evaluation Trace and Computational Graph

Given a function $f: \mathbb{R}^n \to \mathbb{R}^m$, we represent $f$ as the composition of elementary functions through elemtary operations by a sequence of intermediate values $v_k$ that is involved with the evaluation of $f$, this is the ***evaluation trace***. We can also represent the trace graphically, resulting in the ***computational graph***.

**Example:** Given $y = f(x_1, x_2) = \ln(x_1) + x_1x_2 - \sin(x_2)$, its evaluation trace and computational graph are:

<table>
    <tr>
        <td>
            <img src="fig/trace.jpg" style="height:250px;" align="center"/>
        </td>
        <td>
            <img src="./fig/graph.jpg" style="height: 200px;" align="center"/>
        </td>
    </tr>
</table>


## Automatic Differentiation: Forward Mode

In ***forward mode automatic differentiation***, we start with the input and work towards the output: evaluating the value of each intermediate value $v_k$ as well as the derivative of $v_k$ with repect to a fixed $x_i$ using the chain rule: 

$$
\frac{\partial v_k}{\partial x_i} = \sum_{v\in \mathrm{parent}(v_k)}\frac{\partial v_k}{\partial v}\frac{\partial v}{\partial x_i}
$$
We denote $\frac{\partial v_k}{x_i}$ by $\dot{v}_k$.

<img src="fig/forward.jpg" style="height:300px;" align="center"/>

## Automatic Differentiation: Reverse Mode

In ***reverse mode automatic differentiation***, we first do a forward pass to compute all intermediate values. Then we start with in the output and work towards the input: evaluating the derivative of $f$ with repect to an intermediate value $v_k$ using the chain rule: 

$$
\frac{\partial f}{\partial v_k} = \sum_{v\in \mathrm{child}(v_k)}\frac{\partial f}{\partial v}\frac{\partial v}{\partial v_k}
$$
We denote $\frac{\partial f}{v_k}$ by $\overline{v}_k$. 

<img src="fig/reverse.jpg" style="height:300px;" align="center"/>

## Implementing Reverse Mode AutoDiff

We see that each intermediate gradient computation $\frac{\partial f}{\partial v_k}$ in reverse mode autodiff is local, it only requires:
1. the current value of $v_k$
2. the derivative of $f$ with respect to every child of $v_k$: $\frac{\partial f}{\partial v},\, v\in \mathrm{child}(v_k)$.
3. the derivative of the elementary function $h_v$ describing the way $v$ depends on $v_k$.

We implement the computation graph of a function $f$ as a directed graph, where each node keeps tracks of the above three pieces of information and uses them to compute its own gradient.

## An Example of Reverse Mode AutoDiff in `python`

In [2]:
'''small example of reverse mode autodiff as implemented in https://github.com/Rufflewind/revad/'''
class Var:
    def __init__(self, value):
        self.value = value
        self.children = []
        self.grad_value = None

    def grad(self):
        if self.grad_value is None:
            self.grad_value = sum(weight * var.grad()
                                  for weight, var in self.children)
        return self.grad_value

    #overloading the '+' operator
    def __add__(self, other):
        z = Var(self.value + other.value)
        self.children.append((1.0, z))
        other.children.append((1.0, z))
        return z

def sin(x):
    z = Var(math.sin(x.value))
    x.children.append((math.cos(x.value), z))
    return z

In [3]:
x = Var(0.5)
y = Var(4.2)
z = y + sin(x)
z.grad_value = 1.0

print('value of y + sin(x) evaluated at x=0.5, y=4.2: {}\nforward pass of our implementation: {}'.format(4.2 + np.sin(0.5), z.value))

value of y + sin(x) evaluated at x=0.5, y=4.2: 4.679425538604203
forward pass of our implementation: 4.679425538604203


In [5]:
x.grad()

0.8775825618903728

In [6]:
print('value of dz/dx = cos(x) evaluated at x=0.5, y=4.2: {}\nreverse pass of our implementation: {}'.format(np.cos(0.5), x.grad_value))

value of dz/dx = cos(x) evaluated at x=0.5, y=4.2: 0.8775825618903728
reverse pass of our implementation: 0.8775825618903728
