# Backpropagation on DAGs

In this notebook, we look at **backpropagation** (BP) on **directed acyclic computational graphs** (DAG). Our main result is that a single training step (consisting of both a forward and a backward pass into the network) has a time complexity that is linear in the network size. In the last section, we take a closer look at the implementation of `.backward` in PyTorch.

**Readings**
* [Evaluating $\nabla f(x)$ is as fast as $f(x)$](https://timvieira.github.io/blog/post/2016/09/25/evaluating-fx-is-as-fast-as-fx/)
* [Back-propagation, an introduction](http://www.offconvex.org/2016/12/20/backprop/)

## Gradient descent on the loss surface

Training neural nets is an optimization problem, i.e. we want to minimize a function $\ell$ called the **loss function** over the data which quantifies the difference between the network prediction and the ground truth. In theory, $\ell\colon \mathbb R \times \mathbb R \to [0, +\infty)$ can be any almost everywhere differentiable function where $\ell$ is close to zero whenever its arguments are close to each other. In practice, the function $\ell$ is a surrogate objective that we want to minimize in order to minimize (on expectation) some metric that we care about but which cannot be optimized directly, e.g. accuracy. If we assume that an underlying distribution generates the data, then the expected loss with the current weights is

$$\mathcal L  (\mathbf w) = \mathbb E_{(\mathbf x, y)} \ell(y, f(\mathbf x; \mathbf w)).$$ 

In practical deep learning tasks, where we only have samples $\mathcal X = \{(\mathbf x_i, y_i )\}_{i=1}^N$ from the distribution, we use the following approximation as our optimization objective:
  
$$\mathcal L(\mathbf w; \mathcal X) = \frac{1}{N}\sum_{i=1}^N \ell (y, f(\mathbf x_i; \mathbf w)).$$

```{margin}
**Loss surfaces from samples**
```

Observe that this forms surface in $\mathbb R^d \times \mathbb R$ where $d$ is the number of parameters of the network, with the current parameter setting being a point $(\mathbf w, \mathcal L(\mathcal X, \mathbf w))$ on this surface which will generally vary with $\mathcal X.$ However, we expect these surfaces to be similar for large $N$ since the points sampled from the same distribution. To find the minimum of this surface, we use variants of **gradient descent** characterized by the update rule

$$\mathbf w \leftarrow \mathbf w - \varepsilon \nabla_{\mathbf w} \mathcal L$$ 

where $-\nabla_{\mathbf w} \mathcal L$ is the direction of steepest descent at $\mathbf w$ and $\varepsilon > 0$ is some positive number called the **learning rate**. Note that we can compute the gradient as the average of gradients $\nabla_\mathbf w \mathcal L (\mathbf x_i, \mathbf w)$ at the point $\mathbf w$ on the loss surface generated by the data point $\mathbf x_i.$

```{margin}
**The need for efficient BP.**
```

Since $\nabla_\mathbf w \mathcal L$ consists of partial derivatives for each weight in the network, this can be really large &mdash; millions, or even billions for SOTA models. How do we compute these derivatives efficiently? Because we have to compute the gradient at the current state of the network, we would have to perform a forward pass to compute all parameter values given $\mathbf w$ up to the final node. This is followed by a backward pass where we compute compute every partial derivative by a clever use of the chain rule, recursively computing the partial derivative from the output to the input layer. Both forward and backward passes will have to be implemented efficiently for this to be usable in practice. 

## Backpropagation on Computational Graphs

```{margin}
**Forward pass** 
```

A neural network can be modelled as a **directed acyclic graph** (DAG) of compute and parameter nodes that implements a function $f$ and can be extended to implement the calculation of the loss value for each training example and parameter values. To compute $f(\mathbf x),$ the values for each node are calculated from bottom to top starting from the input nodes. Every value in the nodes is stored to avoid recomputing any known value. Assuming each activation and each arithmetic operation between weights, biases and compute nodes take constant time, then one forward pass takes $ O(V + E)$ calculations were $V$ is the number of activations, and $E$ is the number of trainable parameters of the network &mdash; i.e. the network size. Around this is also the memory complexity of the whole operation.

```{margin}
**Backward pass** 
```

During backward pass, we divide gradients into two groups: **local gradients** obtained when perturbing adjacent compute nodes $u$ and $w$, and **backpropagated gradients** of the form ${\frac{\partial{\mathcal L}}{\partial u}}$ for a node ${u}.$ Our goal is to calculate the backpropagated gradient of the loss with respect to parameter nodes. Note that parameter nodes have zero fan-in. BP proceeds by recursively. First, $\frac{\partial{\mathcal L}}{\partial \mathcal L} = 1$ is stored as gradient of the node which computes the loss value. Suppose ${\frac{\partial{\mathcal L}}{\partial u}}$ are stored for each compute node $u$ in the upper layer, then after computing local gradients ${\frac{\partial{u}}{\partial w}}$, the backpropagated gradients ${\frac{\partial{\mathcal L}}{\partial w}}$ for compute nodes $w$ can be calculated via the chain rule:

$${ \frac{\partial\mathcal L}{\partial w} } = \sum_{ {u} }\left( {{\frac{\partial\mathcal L}{\partial u}}} \right)\left( {{\frac{\partial{u}}{\partial w}}} \right).$$

```{margin}
BP is a useful tool for understanding how derivatives flow through a model. This can be extremely helpful in reasoning about why some models are difficult to optimize. Classic examples are vanishing or exploding gradients as we go into deeper layers of the network.
```

Thus, continuing the "flow" of gradients to the current layer. The process ends on  nodes with zero fan-in. Note that the partial derivatives are evaluated on the current network state &mdash; these values are stored during forward pass which precedes backward pass. Analogously, all backpropagated gradients are stored in each compute node for use by the next layer. On the other hand, there is no need to store local gradients; these are computed as needed. Hence, it suffices to compute all gradients with respect to compute nodes to get all gradients with respect to the weights of the network.
    

```{figure} ../img/backprop-compgraph.png
---
width: 30em
name: backprop-compgraph
---
BP on a generic comp. graph with fan out > 1 on node <code>y</code>. Each backpropagated gradient computation is stored in the corresponding node. For node <code>y</code> to calculate the backpropagated gradient we have to sum over the two incoming gradients which can be implemented using matrix multiplication of the gradient vectors.
```


$\phantom{3}$

**Backpropagation algorithm.** Now that we know how to compute each backpropagated gradient implemented as `u.backward()` for node `u` which sends its gradient $\frac{\partial \mathcal L}{\partial u}$ to all its parent nodes, i.e. nodes on the lower layer. We now write the complete algorithm:

```python 
def Forward():
    for c in compute: 
        c.forward()

def Backward(loss):
    for c in compute: c.grad = 0
    for c in params:  c.grad = 0
    for c in inputs:  c.grad = 0
    loss.grad = 1

    for c in compute[::-1]: 
        c.backward()

def SGD(eta):
    for w in params:
        w.value -= eta * w.grad
```

```{admonition} BP equations for MLPs

Consider an MLP which is clearly a computational DAG. Let ${z_j}^{[t]} = \sum_k {w_{jk}}^{[t]}{a_k}^{[t-1]}$ and ${a_j}^{[t]} = \phi^{[t]}({\mathbf z}^{[t]})$ be the values of compute nodes at the $t$-th layer of the network. The backpropagated gradients for the compute nodes of the current layer are given by
    
$$\begin{aligned}
        \dfrac{\partial \mathcal L}{\partial {a_j}^{[t]}} 
        &= \sum_{k}\dfrac{\partial \mathcal L}{\partial {z_k}^{[t+1]}} \dfrac{\partial {z_k}^{[t+1]}}{\partial {a_j}^{[t]}} = \sum_{k}\dfrac{\partial \mathcal L}{\partial {z_k}^{[t+1]}} {w_{kj}}^{[t+1]}
    \end{aligned}$$

and

$$\begin{aligned}
    \dfrac{\partial \mathcal L}{\partial {z_j}^{[t]}} 
    &= \sum_{l}\dfrac{\partial \mathcal L}{\partial {a_l}^{[t]}} \dfrac{\partial {a_l}^{[t]}}{\partial {z_j}^{[t]}}.
\end{aligned}$$

This sum typically reduces to a single term for activations such as ReLU, but not for activations which depend on multiple preactivations such as softmax. Similarly, the backpropagated gradients for the parameter nodes (weights and biases) are given by

$$\begin{aligned}
    \dfrac{\partial \mathcal L}{\partial {w_{jk}}^{[t]}} 
    &= \dfrac{\partial \mathcal L}{\partial {z_j}^{[t]}} \dfrac{\partial {z_j}^{[t]}}{\partial {w_{jk}}^{[t]}} = \dfrac{\partial \mathcal L}{\partial {z_j}^{[t]}} {a^{[t-1]}_k} \\
    \text{and}\qquad\dfrac{\partial \mathcal L}{\partial {b_{j}}^{[t]}} 
    &= \dfrac{\partial \mathcal L}{\partial {z_j}^{[t]}} \dfrac{\partial {z_j}^{[t]}}{\partial {b_{j}}^{[t]}} = \dfrac{\partial \mathcal L}{\partial {z_j}^{[t]}}.
\end{aligned}$$

Backpropagated gradients for compute nodes are stored until the weights are updated, e.g. $\frac{\partial \mathcal L}{\partial {z_k}^{[t+1]}}$ are retrieved in the compute nodes of the $t+1$-layer to compute gradients in the $t$-layer. On the other hand, the local gradients $\frac{\partial {a_k}^{[t]}}{\partial {z_j}^{[t]}}$ are computed directly using autodifferentiation and evaluated with the current network state obtained during forward pass.
```

<br>

We highlight two important properties of the algorithm which makes it the practical choice for training huge neural networks:

* **Modularity.** The dependence only on nodes belonging to the upper layer suggests a modularity in the computation, e.g. we can connect DAG subnetworks with possibly distinct network architectures by only connecting nodes that are exposed between layers. 

<br>

* **Bottleneck and complexity.** Assuming each computation of a local derivative takes constant time (e.g. with autodifferentiation), then backward pass requires $O(V + E)$ computations. This takes into account gradient flows across $E$ trainable parameters (including biases) and $V$ gradient flows across activations. It follows that fast matrix multiplication, e.g. by having dedicated hardware such as GPUs, must be developed to make neural networks train fast. Finally, since gradients are stored in nodes, the memory complexity should also depend in the network size (as well as the size of the gradients).

$\phantom{3}$

The following two figures show BP on a logistic regression model.

```{figure} ../img/backprop-compgraph2.png
---
width: 35em
name: backprop-compgraph2
---
Backprop with weights for a single layer neural network with sigmoid activation and cross-entropy loss. Observe the gradient flowing from node <code>L</code> to the node <code>w0</code>.
```

```{figure} ../img/backprop-compgraph3.png
---
width: 25em
name: backprop-compgraph3
---
Backprop with weights for a single layer neural network with sigmoid activation and cross-entropy loss. Local gradients that require current values of the nodes while backpropagated gradients are accessed from the layer above. Node <code>u</code> which has fan-in > 1 performs chain rule on the backpropagated gradients.
```

## Autodifferentiation with PyTorch `autograd`

The `autograd` package allows automatic differentiation by building computational graphs on the fly every time we pass data through our model. Autograd tracks which data combined through which operations to produce the output. This allows us to take derivatives over ordinary imperative code. This functionality is consistent with the memory and time requirements outlined in above for BP.

<br>

**Backward for scalars.** Let $y = \mathbf x^\top \mathbf x = \sum_i {x_i}^2.$ In this example, we initialize a tensor `x` which initially has no gradient. Calling backward on `y` results in gradients being stored on the leaf tensor `x`. 

In [132]:
x = torch.arange(4, dtype=torch.float, requires_grad=True)
y = x.T @ x 

y.backward() 
(x.grad == 2*x).all()

tensor(True)

**Backward for vectors.** Let $\mathbf y = g(\mathbf x)$ and let $\mathbf v$ be a vector having the same length as $\mathbf y.$ Then `y.backward(v)` implements   

$$\sum_i v_i \left(\frac{\partial y_i}{\partial x_j}\right)$$ 
  
resulting in a vector of same length as `x` that is stored in `x.grad`. Note that the terms on the right are the local gradients in backprop. Hence, if `v` contains backpropagated gradients of nodes that depend on `y`, then this operation gives us the backpropagated gradients with respect to `x`, i.e. setting $v_i = \frac{\partial \mathcal{L} }{\partial y_i}$ gives us the vector $\frac{\partial \mathcal{L} }{\partial x_j}.$

In [179]:
x = torch.rand(size=(4,), dtype=torch.float, requires_grad=True)
v = torch.rand(size=(2,), dtype=torch.float)
y = x[:2]

# Computing the Jacobian by hand
J = torch.tensor(
    [[1, 0, 0, 0],
    [0, 1, 0, 0]], dtype=torch.float
)

# Confirming the above formula
y.backward(v)
(x.grad == v @ J).all()

tensor(True)

**Locally disabling gradient tracking.** To stop PyTorch from building computational graphs, we can put the code inside a `with torch.no_grad()` block. In this mode, the result of every computation will have `requires_grad=False`, even when the inputs have `requires_grad=True`. 
<br><br>
Another method is to use the `.detach()` method which returns a new tensor detached from the current graph but shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. Disabling gradient computation is useful when computing values, e.g. accuracy, whose gradients will not be backpropagated into the network.