# Demystifying Neural Networks 

---

![autograd.svg](attachment:autograd.svg)

<div style="text-align:right;font-size:0.7em;">autograd.svg</div>

The `autograd` library computes gradients of any function at a specific point.
It does so by taking advantage of the **chain rule**.
Given $f(g(h(x)))$ we say that

$$
\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}
$$

I always had problems with getting grips with this equation,
it is much easier to understand it with an example.
Sat that we have.

$$
f(x) = 3x^2 + 7
$$

Taking a derivative of this is not particularly hard

$$
\frac{df}{dx} = 6x
$$

Yet, let's take a different look at $f$.
Let's split $f$ into a handful of functions that perform only basic operations.

$$
g(v) = v^2 \\
h(w) = 3w \\
k(z) = 7 + z \\
$$

The derivatives here are even simpler.

$$
\frac{dg}{dv} = 2v \\
\frac{dh}{dw} = 3 \\
\frac{dk}{dz} = 1 \\
$$

Now, notice that:

$$
f(x) = 3x^2 + 7 = k(h(g(x)))
$$

And by the chain rule we have

$$
\frac{df}{dx} = \frac{dk}{dh} \cdot \frac{dh}{dg} \cdot \frac{dg}{dx}
$$

But we already computed every single of those, so

$$
\frac{df}{dx} = \frac{dk}{dh} \cdot \frac{dh}{dg} \cdot \frac{dg}{dx} =
1 \cdot 3 \cdot 2x = 6x
$$

The correct result.

The trick is that there is a limited number of basic operations.
And, if we keep track of what basic operations whilst we execute $f$,
then we can just multiply together the derivatives of these basic operations
and we have the derivative of $f$.

That's literally what `autograd` does.
It builds a directed graph of operations whilst they execute and
then evaluates the derivative by multiplying together the derivatives
of the basic operations.

The tricky bit comes when matrices come into play.
One could do the derivatives of every sum and multiplication during
matrix multiplication but there are better ways.
Matrix multiplication derivatives can be simplified by using a
*Jacobian Matrix*, the matrix of the partial derivatives against
every element of the matrix.
As we multiply matrices one by another the combinations grow,
therefore calculating at a point allows for some simplifications.

If one calculates the matrices from left to right we call the technique
*Vector Jacobian Product* (VJP), when one performs it right to left (backwards)
we call it *Jacobian Vector Product* (JVP).
But enough about VJPs and JVPs we will see more of them later.
For the time being let's just say that matrix multiplication has
optimizations to the chain rule.

Moreover, for the time being let's build an ANN using `autograd`.
For a star get the data.

In [1]:
import pandas as pd
df = pd.read_csv('./pulsars_tuned.csv')
X, y = df.values[:, :-1], df['label'].values
print(X.shape, y.shape)

(3278, 8) (3278,)


Modify it the same way as for the `pytorch` ANN.

In [2]:
import autograd.numpy as np
y = np.c_[y == 0, y == 1].astype(np.float)
print(y.shape)

(3278, 2)


With `autograd` we need to write the ANN evaluation ourselves.
We will separate the evaluation from the MSE,
we will need both eventually.

In [3]:
from autograd import grad


def net_exec(weights, X):
    W = weights[::2]
    Wb = weights[1::2]
    y_hat = X.T
    for w, b in zip(W, Wb):
        y_hat = np.tanh(w @ y_hat + b)
    return y_hat.T


def netMSE(weights, X, y):
    y_hat = net_exec(weights, X)
    return np.mean(((y_hat - y)**2))


netMSE_grad = grad(netMSE)

Generate the ANN itself, i.e. the matrices.

In [4]:
def layers(neurons):
    weights = []
    for nl, nr in zip(neurons, neurons[1:]):
        weights.append(np.random.normal(0, 1/np.sqrt(nr+nl), (nr, nl)))
        weights.append(np.random.normal(0, 1/np.sqrt(nr+nl), (nr, 1)))
    return weights


weights = layers([8, 25, 10, 2])
print(list(map(lambda x: x.shape, weights)))

[(25, 8), (25, 1), (10, 25), (10, 1), (2, 10), (2, 1)]


Check if it actually works.

In [5]:
print(net_exec(weights, X))

[[-0.48216253 -0.53857301]
 [-0.44014461 -0.67908871]
 [-0.41917221 -0.68076317]
 ...
 [-0.40235887 -0.55211719]
 [-0.15826022 -0.63573856]
 [-0.24043383 -0.68047183]]


And build the `autograd` gradient evaluator.

In [6]:
grads = netMSE_grad(weights, X, y)
print(list(map(lambda x: x.shape, grads)))

[(25, 8), (25, 1), (10, 25), (10, 1), (2, 10), (2, 1)]


Now we will use SGD the same way as we did with `pytorch`.
The difference is only that we have the ANN execution
and gradient evaluation as different functions,
this is just an artifact of the API differences between `pytorch` and `autograd`.

In [7]:
weights = layers([8, 25, 10, 2])
learning_rate = 0.01
batch = 100
for i in range(1000):
    idx = np.random.randint(0, len(y), batch)
    X_sample, y_sample = X[idx], y[idx]
    grads = netMSE_grad(weights, X_sample, y_sample)
    for j in range(len(weights)):
        weights[j] = weights[j] - grads[j]*learning_rate
    if (i+1)%50 == 0:
        mse = np.mean((y_sample - net_exec(weights, X_sample))**2)
        print(mse)

  defvjp(anp.tanh,   lambda ans, x : lambda g: g / anp.cosh(x) **2)


0.14983743872309568
0.10996737432348441
0.098450819481595
0.08002426657662799
0.09149217213971657
0.11733902818020667
0.13478658993492978
0.09546723540571865
0.07010690373065195
0.07906329737137129
0.07921059969386038
0.07342003749436164
0.04995424394384296
0.10770777075212713
0.0761222931599651
0.08486029315839973
0.044647727285919286
0.08137648751941416
0.08702051952086627
0.06249052636668296


And evaluate if the performance is similar.

In [8]:
y_hat = np.argmax(net_exec(weights, X), axis=1)
y_true = df['label'].values
print(sum(y_hat[y_true == 1] == y_true[y_true == 1])/sum(y_true == 1))
print(sum(y_hat[y_true == 0] == y_true[y_true == 0])/sum(y_true == 0))
print(sum(y_hat == y_true)/len(y_true))

0.9072605247101891
0.9145820622330689
0.9109212934716291
