# Demystifying Neural Networks 

---

# Autograd DAG

How `autograd` can actually perform so many chain rules?
It builds a *Directed Acyclic Graph* (DAG).

We will analyse one directed graph on top of a tiny ANN.
Such a DAG can become big very quickly,
therefore we will use only two layers.

In [1]:
import autograd.numpy as np
from autograd import grad

Since we will not actually train the ANN below,
we will define a very simple activation function: $y = 2x$.
The hyperbolic tangent is quite complex and would make DAG
very long.

In [2]:
x = np.array([[0.3],
              [0.1],
              [0.5]])
y = np.array([[1.],
              [0.]])
w1 = np.array([[0.3,  0.1,  0.2],
               [0.2, -0.1, -0.1],
               [0.7,  0.5, -0.3],
               [0.5,  0.5, -0.5]])
w1b = np.array([[0.3],
                [0.2],
                [0.2],
                [0.3]])
w2 = np.array([[0.2,  0.3,  0.1, 0.1],
               [0.7, -0.2, -0.1, 0.3]])
w2b = np.array([[ 0.3],
                [-0.2]])
def act(x):
    return 2*x

We define an ANN function as normal and execute it with against $\vec{x}$ and $\vec{y}$.
Our only interest are the gradients not the actual output of the ANN.

In [3]:
def netMSE(arg):
    x, w1, w1b, w2, w2b, y = arg
    y_hat = act(w2 @ act(w1 @ x + w1b) + w2b)
    return np.mean((y - y_hat)**2)

netMSE_grad = grad(netMSE)
grads = netMSE_grad([x, w1, w1b, w2, w2b, y])
for g in grads:
    print(g)

[[1.46144]
 [0.9392 ]
 [0.03264]]
[[ 0.9648   0.3216   1.608  ]
 [-0.0768  -0.0256  -0.128  ]
 [-0.06624 -0.02208 -0.1104 ]
 [ 0.42144  0.14048  0.7024 ]]
[[ 3.216 ]
 [-0.256 ]
 [-0.2208]
 [ 1.4048]]
[[0.928   0.3712  0.57536 0.464  ]
 [2.032   0.8128  1.25984 1.016  ]]
[[0.928]
 [2.032]]
[[-0.464]
 [-1.016]]


These are the final gradients against every single weight.

Below we have the complete graph that has been constructed
in order to compute these gradients.
The graph has been constructed when the function executed.
Then, after the function finished executing the graph has been
walked backwards to calculate the gradients.

The ID's at the nodes of the graph are increasing when walking the graph
top to bottom and decreasing when walking bottom to top.
`autograd` computes gradients in order from the biggest node ID
to the lowest node ID, this way one can be sure that all gradients needed
to compute the gradient on the current graph node are already computed.

The computation at each node is performed using the *Jacobian Vector Product*
(JVP) rule for the operation that was originally performed on the node.
Each operation that can be differentiated by `autograd` has a JVP rule.
For example, there are JVP rules for sum, subtraction, or even mean operations.

![graph-ann.svg](attachment:graph-ann.svg)

<div style="text-align:right;font-size:0.7em;">graph-ann.svg</div>

In summary: `autograd` builds a DAG and then walks it backwards
performing the chain rule.
It is this *backwards* that is meant in the backpropagation technique
of ANN training.