# Backpropagation on DAGs

![Status](https://img.shields.io/static/v1.svg?label=Status&message=Finished&color=brightgreen)
[![Source](https://img.shields.io/static/v1.svg?label=GitHub&message=Source&color=181717&logo=GitHub)](https://github.com/particle1331/inefficient-networks/blob/master/docs/notebooks/fundamentals/backpropagation.ipynb)
[![Stars](https://img.shields.io/github/stars/particle1331/inefficient-networks?style=social)](https://github.com/particle1331/inefficient-networks)


---

## Introduction

In this notebook, we will look at **backpropagation** (BP) on **directed acyclic computational graphs** (DAG). Our main result is that a single training step for a single data point (consisting of both a forward and a backward pass) has a time complexity that is linear in the number of edges of the network. 

<!-- In the last section, we take a closer look at the implementation of `.backward` in PyTorch. -->

**Readings**
* [Evaluating $\nabla f(x)$ is as fast as $f(x)$](https://timvieira.github.io/blog/post/2016/09/25/evaluating-fx-is-as-fast-as-fx/)
* [Back-propagation, an introduction](http://www.offconvex.org/2016/12/20/backprop/)
* [PyTorch Autograd Explained - In-depth Tutorial](https://www.youtube.com/watch?v=MswxJw-8PvE)

## Gradient descent on the loss surface

```{margin}
**Constructing the loss surface**. The loss function $\ell$ acts as an almost-everywhere differentiable surrogate to the true objective. The empirical loss surface will generally vary for different samples drawn. But we except these surfaces to be very similar, assuming the samples are drawn from the same distribution.
```

For every data point $(\mathbf x, y)$, the loss function $\ell$ assigns a nonnegative number $\ell(y, f_{\mathbf w}(\mathbf x))$ that approaches zero whenever the predictions $f_{\mathbf w}(\mathbf x)$ approach the target values $y$. Given the current parameters $\mathbf w \in \mathbb R^d$ of a neural network $f$, we can imagine the network to be at a certain point $(\mathbf w, \mathcal L_{\mathcal X}(\mathbf w))$ on a surface in $\mathbb R^d \times \mathbb R$ where $\mathcal L_{\mathcal X}(\mathbf w)$ is the average loss over the dataset:

$$
\mathcal L_{\mathcal X}(\mathbf w) = \frac{1}{|\mathcal X|} \sum_{(\mathbf x, y) \in \mathcal X} \ell(y, f_{\mathbf w}(\mathbf x)).
$$

So training a neural network is equivalent to finding the minimum of this surface. In practice, we use variants of gradient descent, characterized by the update rule $\mathbf w \leftarrow \mathbf w - \varepsilon \nabla_{\mathbf w} \mathcal L_{\mathcal X}$, to find a local minimum. Here $-\nabla_{\mathbf w} \mathcal L_{\mathcal X}$ is the direction of steepest descent at $\mathbf w$ and the learning rate $\varepsilon > 0$ is a constant that controls the step size.


```{figure} ../../img/loss_surface_resnet.png
---
name: loss-surface-resnet
width: 35em
---
Loss surface for ResNet-56 with or without skip connections. Much of deep learning research is dedicated to studying the geometry of loss surfaces and its effect on optimization. {cite}`arxiv.1712.09913`
```



```{margin}
**Derivatives of comp. graphs**
```

In principle, we can perturb the current state of the network (obtained during forward pass) by perturbing the network weights/parameters. This results in perturbations flowing up to the final loss node (assuming each computation is differentiable). So it's not a mystery that we can compute derivatives of computational graphs which may appear, at first glance, as "discrete" objects. Another perspective is that a computational DAG essentially models a a sequence of function compositions which can be easily differentiated using chain rule. However, looking at the network structure allows us to easily code the computation into a computer, exploit modularity, and efficiently compute the flow of derivatives at each layer. This is further discussed below.


```{margin}
**The need for efficient BP**
```

Observe that $\nabla_\mathbf w \mathcal L_{\mathcal X}$ consists of partial derivatives for each weight in the network. This can easily number in millions. So this backward pass operation can be huge. To compute these values efficiently, we will perform both forward and backward passes in a dynamic programming fashion to avoid recomputing any known value. As an aside, this improvement in time complexity turns out to be insufficient for pratical uses, and is supplemented with sophisticated hardware for parallel computation (GPUs/TPUs) which can reduce training time by some factor, e.g. from days to hours.

## Backpropagation on Computational Graphs

A neural network can be modelled as a **directed acyclic graph** (DAG) of compute and parameter nodes that implements a function $f$ and can be extended to implement the calculation of the loss value for each training example and parameter values. 

### Forward pass

In computing $f(\mathbf x)$, the input $\mathbf x$ is passed to the first layer and propagated forward through the network, computing the output value of each node. Every value in the nodes is stored to preserve the current state for backward pass, as well as to avoid recomputation for the nodes in the next layer. Assuming a node with $n$ inputs require $n$ operations, then one forward pass takes $O(E)$ calculations were $E$ is the number of edges of the graph.

```{margin}
[`source`](https://drive.google.com/file/d/1JCWTApGieKZmFW4RjCANZM8J6igcsdYg/view)
```
```{figure} ../../img/backprop-compgraph2.png
---
width: 25em
name: backprop-compgraph2
---
Backpropagation through a single layer neural network with weights $w_0$ and $w_1$, and input-output pair $(x, y).$ Shown here is the gradient flowing from the loss node $\mathcal L$ to the weight $w_0.$
```

### Backward pass

During backward pass gradients are categorized into two types: local gradients of the form $\frac{\partial{\mathcal u}}{\partial w}$ between adjacent nodes $u$ and $w,$ and backpropagated gradients $\frac{\partial{\mathcal L}}{\partial u}$ for each node ${u}.$ Our goal is to calculate the backpropagated gradient of the loss with respect to parameter nodes. Note that parameter nodes have zero fan-in ({numref}`backprop-compgraph2`). 

BP proceeds inductively. For the base step, set the gradient of the compute node for the loss as $1$ (i.e. $\frac{\partial{\mathcal L}}{\partial \mathcal L} = 1$). Since the backpropagated gradient ${\frac{\partial{\mathcal L}}{\partial u}}$ for each compute node $u$ in the upper layer is stored, then after computing local gradients ${\frac{\partial{u}}{\partial w}}$, the backpropagated gradient ${\frac{\partial{\mathcal L}}{\partial w}}$ for each compute node $w$ in the current layer can be calculated via the chain rule:

$$
{\frac{\partial\mathcal L}{\partial w} } = \sum_{ {u} }\left( {{\frac{\partial\mathcal L}{\partial u}}} \right)\left( {{\frac{\partial{u}}{\partial w}}} \right).
$$

Thus, continuing the "flow" of gradients to the current layer. The process ends on  nodes with zero fan-in. Note that the partial derivatives are evaluated on the current network state &mdash; these values are stored during forward pass which precedes backward pass. Analogously, all backpropagated gradients are stored in each compute node for use by the next layer. On the other hand, there is no need to store local gradients; these are computed as needed. Hence, it suffices to compute all gradients with respect to compute nodes to get all gradients with respect to the weights of the network.
    
**Remark.** BP is a useful tool for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. Classic examples are vanishing or exploding gradients as we go into deeper layers of the network.


```{margin}
[`source`](https://drive.google.com/file/d/1JCWTApGieKZmFW4RjCANZM8J6igcsdYg/view)
```
```{figure} ../../img/backprop-compgraph.png
---
width: 28em
name: backprop-compgraph
---
BP on a generic comp. graph with fan out > 1 on node <code>y</code>. Each backpropagated gradient computation is stored in the corresponding node. To calculate the backpropagated gradient for node <code>y</code>, we have to sum over the two incoming gradients. This can be implemented using matrix multiplication.
```

### Gradient descent with BP

Now that we know how to compute each backpropagated gradient implemented as `u.backward()` for node `u` which sends the gradient of the loss with respect to `u` to all its parent nodes (i.e. nodes on the lower layer connected to `u`). The complete recursive algorithm with SGD is implemented below. Note that this abstracts away autodifferentiation.

```python 
def Forward():
    for c in compute: 
        c.forward()

def Backward(loss):
    for c in compute + params + inputs: 
        c.grad = 0
    
    loss.grad = 1
    for c in compute[::-1]: 
        c.backward()

def SGD(eta):
    for w in params:
        w.value -= eta * w.grad


loss = THRESHOLD + 1.0
while loss >= THRESHOLD:
    Forward()
    Backward()
    SGD(eta=ETA)
```

<br>

Two important properties of the algorithm which makes it the practical choice for training huge neural networks are as follows:

* **Modularity.** The dependence only on nodes belonging to the upper layer suggests a modularity in the computation, e.g. we can connect DAG subnetworks with possibly distinct network architectures by only connecting nodes that are exposed between layers. 

<br>

* **Efficiency.** From the backpropagation equation, the backpropagated gradient $\frac{\partial \mathcal L}{\partial w}$ for node $w$ is computed by a sum that is indexed by $u$ for every node connected to $w$. Iterating over all nodes $w$ in the network, we cover all the edges in the network with no edge counted twice. Assuming computing local gradients take constant time, then backward pass requires $O(E)$ computations.

$\phantom{3}$

### BP equations for dense neural nets

```{margin}
Source:<br>**Figure 1** of {cite}`0483bd9444a348c8b59d54a190839ec9`
```
```{figure} ../../img/deep-nns.png
---
width: 40em
---

**Figure 1** of {cite}`0483bd9444a348c8b59d54a190839ec9` should hopefully make sense after reading this article. This figure shows (left) forward pass for a multilayer neural network, and (right) backward pass for the same network.
```

As shown in the above figure, multilayer neural networks can be clearly modelled as a computational DAG with edges between preactivation and activation values, and edges from weights and input values that fan into preactivations. Computation performed by the network at layer $t$ can be written in two steps as:

* ${\mathbf{y}}^{(t)} = {\mathbf{x}}^{(t-1)}{\boldsymbol{W}}^{(t)} + {\boldsymbol b}^{(t)}$ 
* ${\mathbf x}^{(t)} = \varphi({\mathbf y}^{(t)})$ 

Here we use row vectors for layer inputs and outputs. The following equations are obtained by simply matching input and output shapes, then trying to figure out the entries of the right hand side matrix by tracking node dependencies. For the compute nodes in the current layer, the BP equations are:
    
$$
\dfrac{\partial \mathcal L}{\partial {\mathbf x}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{\partial {\mathbf y}^{(t+1)}} 
\dfrac{\partial {\mathbf y}^{(t+1)}}{\partial {\mathbf x}^{(t)}} 
=
\dfrac{\partial \mathcal L}{{\partial \mathbf{y}}^{(t+1)}}
\boldsymbol{W}^{(t+1)\top}
$$

and

$$
\dfrac{\partial \mathcal L}{\partial {{\mathbf y}}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{\partial {\mathbf x}^{(t)}} 
\dfrac{\partial {\mathbf x}^{(t)}}{\partial {\mathbf y}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{{\partial \mathbf{x}}^{(t)}}
{{\boldsymbol J}_\varphi}{^{(t)}}.
$$

Here the Jacobian matrix ${\boldsymbol J}_\varphi$ contains the gradient of one of the elements of $\mathbf x^{(t)}$ in each row. For commonly used activations, i.e. those which are applied entrywise on vector inputs, this matrix reduces to a diagonal matrix. For the parameter nodes, the backpropagated gradients are:

$$
\dfrac{\partial \mathcal L}{\partial {\boldsymbol W}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{\partial {{\mathbf y}}^{(t)}} 
\dfrac{\partial {{\mathbf y}}^{(t)}}{\partial {\boldsymbol W}^{(t)}} 
= 
\mathbf{x}^{(t-1)\top}
\dfrac{\partial \mathcal L}{\partial {{\mathbf y}}^{(t)}} 
$$

and

$$
\dfrac{\partial \mathcal L}{\partial {\boldsymbol{b}_{j}}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{\partial {{\mathbf y}}^{(t)}}
\dfrac{\partial {{\mathbf y}}^{(t)}}{\partial {\boldsymbol b}^{(t)}} 
= 
\dfrac{\partial \mathcal L}{\partial {{\mathbf y}}^{(t)}}.
$$

Backpropagated gradients for compute nodes have to be stored until weights are updated. Observe that derivatives of the compute nodes of layer $t+1$ are retrieved to compute gradients in layer $t.$ On the other hand, the local gradients ${\boldsymbol J}_\varphi$ are computed using autodifferentiation evaluated at the current network state based on values from forward pass. 

```{figure} ../../img/jacobian.svg
---
name: jacobian
width: 75%
---
Deriving the equations by matching shapes of derivatives as matrices. In general, we just need to put the incoming gradients as input, and the current gradients as output of the operation. Then, what's left is figuring out the matrix in between that contains appropriate local derivatives. The fact that this is just matrix multiplication is the key idea of differential calculus.
```


### Example computation with TensorFlow

The above equations and process is demonstrated in the following computation:

In [1]:
import tensorflow as tf
print(tf.__version__)

# Input and weight init.
x0 = tf.Variable(tf.random.normal(shape=(1, 4)))
W1 = tf.Variable(tf.random.normal(shape=(4, 3)))
b1 = tf.Variable(tf.random.normal(shape=(1, 3)))
W2 = tf.Variable(tf.random.normal(shape=(3, 2)))

2.8.0
Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



In [2]:
# Forward pass:
with tf.GradientTape(persistent=True) as tape:
    y1 = tf.matmul(x0, W1) + b1
    x1 = tf.keras.activations.softmax(y1)

    y2 = tf.matmul(x1, W2)
    loss = tf.reduce_sum(y2**2)

# Backward pass:
# (1) ∂L/∂x[t] = ∂L/∂y[t+1] W[t+1]. Fetch ∂L/∂y[t+1] from upper layer.
loss_x1 = tf.matmul(tape.gradient(loss, y2), tf.transpose(W2))

# (2) ∂L/∂y[t] = ∂L/∂x[t] J[t]. Compute local gradients using autodiff.
loss_y1 = tf.matmul(loss_x1, tf.reshape(tape.jacobian(x1, y1), (3, 3)))

# (3) ∂L/∂W[t] = x[t-1]T ∂L/∂y[t]
loss_W1 = tf.matmul(tf.transpose(x0), loss_y1)

# (4) ∂L/∂b[t] = ∂L/∂y[t]
loss_b1 = loss_y1

print("(weights)")
print("TF autodiff:")
print(tape.gradient(loss, W1).numpy())
print("\nBP equations:")
print(loss_W1.numpy())

print("\n(biases)")
print("TF autodiff: ", tape.gradient(L, b1).numpy())
print("BP equations:", loss_b1.numpy())

(weights)
TF autodiff:
[[-0.052407    0.10895263 -0.05654562]
 [-0.04306866  0.0895385  -0.04646983]
 [ 0.01126021 -0.02340965  0.01214944]
 [-0.03177317  0.0660555  -0.03428232]]

BP equations:
[[-0.052407    0.10895262 -0.05654562]
 [-0.04306867  0.08953848 -0.04646983]
 [ 0.01126021 -0.02340965  0.01214944]
 [-0.03177318  0.06605549 -0.03428232]]

(biases)
TF autodiff:  [[-0.03510599  0.07298434 -0.03787834]]
BP equations: [[-0.035106    0.07298433 -0.03787834]]


<!-- ## Autodifferentiation with PyTorch `autograd`

The `autograd` package allows automatic differentiation by building computational graphs on the fly every time we pass data through our model. Autograd tracks which data combined through which operations to produce the output. This allows us to take derivatives over ordinary imperative code. This functionality is consistent with the memory and time requirements outlined in above for BP.

<br>

**Backward for scalars.** Let $y = \mathbf x^\top \mathbf x = \sum_i {x_i}^2.$ In this example, we initialize a tensor `x` which initially has no gradient. Calling backward on `y` results in gradients being stored on the leaf tensor `x`.  -->

<!-- x = torch.arange(4, dtype=torch.float, requires_grad=True)
y = x.T @ x 

y.backward() 
(x.grad == 2*x).all() -->

<!-- **Backward for vectors.** Let $\mathbf y = g(\mathbf x)$ and let $\mathbf v$ be a vector having the same length as $\mathbf y.$ Then `y.backward(v)` implements   

$$\sum_i v_i \left(\frac{\partial y_i}{\partial x_j}\right)$$ 
  
resulting in a vector of same length as `x` that is stored in `x.grad`. Note that the terms on the right are the local gradients in backprop. Hence, if `v` contains backpropagated gradients of nodes that depend on `y`, then this operation gives us the backpropagated gradients with respect to `x`, i.e. setting $v_i = \frac{\partial \mathcal{L} }{\partial y_i}$ gives us the vector $\frac{\partial \mathcal{L} }{\partial x_j}.$ -->

<!-- x = torch.rand(size=(4,), dtype=torch.float, requires_grad=True)
v = torch.rand(size=(2,), dtype=torch.float)
y = x[:2]

# Computing the Jacobian by hand
J = torch.tensor(
    [[1, 0, 0, 0],
    [0, 1, 0, 0]], dtype=torch.float
)

# Confirming the above formula
y.backward(v)
(x.grad == v @ J).all() -->

<!-- **Locally disabling gradient tracking.** Disabling gradient computation is useful when computing values, e.g. accuracy, whose gradients will not be backpropagated into the network. To stop PyTorch from building computational graphs, we can put the code inside a `torch.no_grad()` context or inside a function with a `@torch.no_grad()` decorator.

Another technique is to use the `.detach()` method which returns a new tensor detached from the current graph but shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. -->