# Demystifying Neural Networks 

---

# Pulsars

Until now we have been quite theoretical.
We will now introduce a dataset which we will train networks on.
The pulsar dataset contains data on 17898 stars,
from which 1639 are pulsars (high momentum neutron stars - or white dwarfs).

Since the original dataset is imbalanced, we have tuned the data to contain
the same number of pulsars and non-pulsars.  The file `./pulsars_raw.csv`
contains the original dataset.  Whilst the file `./pulsars_tuned.csv` contains
the tuned dataset with balanced classes.  We will always us the later.

In [1]:
import pandas as pd
df = pd.read_csv('./pulsars_tuned.csv')
X, y = df.values[:, :-1], df['label'].values
print(X.shape, y.shape)

(3278, 8) (3278,)


In [2]:
df.head()

Unnamed: 0,ip_mean,ip_std,ip_kurtosis,ip_skewness,dmsnr_mean,dmsnr_std,dmsnr_kurtosis,dmsnr_skewness,label
0,102.50781,58.88243,0.46532,-0.51509,1.67726,14.86015,10.57649,127.39358,0
1,142.07812,45.28807,-0.32033,0.28395,5.37625,29.0099,6.07627,37.83139,0
2,138.17969,51.52448,-0.03185,0.0468,6.33027,31.57635,5.15594,26.14331,0
3,114.36719,51.94572,-0.0945,-0.28798,2.73829,17.19189,9.05061,96.6119,0
4,99.36719,41.5722,1.5472,4.15411,27.55518,61.71902,2.20881,3.66268,1


The data contains the first four statistical moments of two indicators

IP: Integrated Profile.
The profile (amount of light) of a pulse is often very different between pulses.
Integrated simply means we baseline the pulse so we have an intensity of zero
anywhere else but the pulses.
It is expected that pulsars have high variance (and standard deviation)
between pulses (between integrated profiles of pulses).

DM-SNR: Dispersion Measure, Signal to Noise Ratio.
Where dispersion means a delay in the lower frequencies, due to interaction with free electrons.
In other words, higher frequencies arrive faster then lower frequencies,
the dispersion is in the time of different frequency arrival.

# ANN with `pytorch`

We will train an ANN on this dataset.
For this purpose we will use `pytorch`,
the objective of this exercise will be to prove that `pytorch` performs the exact
matrix multiplications we performed in the concepts section.

For a start we need to modify `y` a little.
We have two outputs: pulsar and/or non-pulsar.
In the data pulsar is indicated with a $1$ and non-pulsar with a $0$.
Yet, ANNs work better if we have each output neuron output just a single class.
In this case we want to represent pulsars as $[0, 1]$ and non-pulsars as $[1, 0]$.

In [3]:
import numpy as np
y = np.c_[y == 0, y == 1].astype(np.float)
print(y.shape)

(3278, 2)


The following code looks very similar to `pytorch` tutorials on ANNs.
This is intentional, our objective here is not to learn `pytorch`
but instead to understand what it does below the hood.
That said, below we implement *Stochastic Gradient Descent* (SGD) by hand.
There are other ways to train a network but most are variants of
*Gradient Descent*, we will use (SGD) in all our examples.

In [4]:
import torch
import torch.nn as nn


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(8, 25)
        self.fc2 = nn.Linear(25, 10)
        self.fc3 = nn.Linear(10, 2)

    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))
        x = torch.tanh(self.fc3(x))
        return x


net = Net()
criterion = nn.MSELoss()
learning_rate = 0.01
batch = 100
for i in range(1000):
    idx = np.random.randint(0, len(y), batch)
    X_sample, y_sample = torch.Tensor(X[idx]), torch.Tensor(y[idx])

    net.zero_grad()
    y_hat = net(X_sample)
    loss = criterion(y_hat, y_sample)
    loss.backward()
    if 0 == (i+1)%50:
        print(loss)
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

tensor(0.1832, grad_fn=<MseLossBackward>)
tensor(0.1631, grad_fn=<MseLossBackward>)
tensor(0.1124, grad_fn=<MseLossBackward>)
tensor(0.1028, grad_fn=<MseLossBackward>)
tensor(0.0937, grad_fn=<MseLossBackward>)
tensor(0.0999, grad_fn=<MseLossBackward>)
tensor(0.0878, grad_fn=<MseLossBackward>)
tensor(0.0628, grad_fn=<MseLossBackward>)
tensor(0.0806, grad_fn=<MseLossBackward>)
tensor(0.0808, grad_fn=<MseLossBackward>)
tensor(0.0893, grad_fn=<MseLossBackward>)
tensor(0.0976, grad_fn=<MseLossBackward>)
tensor(0.0742, grad_fn=<MseLossBackward>)
tensor(0.0768, grad_fn=<MseLossBackward>)
tensor(0.0620, grad_fn=<MseLossBackward>)
tensor(0.0796, grad_fn=<MseLossBackward>)
tensor(0.0467, grad_fn=<MseLossBackward>)
tensor(0.0761, grad_fn=<MseLossBackward>)
tensor(0.0697, grad_fn=<MseLossBackward>)
tensor(0.0716, grad_fn=<MseLossBackward>)


There are several ways of checking whether an ANN has converged:
having a tolerance of error, tolerance of gradient,
or continually reducing the learning rate; yet we will ignore their
complexities and verify network convergence visually.
Namely: if the *Mean Square Error* (MSE) is reducing the network is learning something,
if the MSE is reducing continuously then we accept that we have a trained network.

The simplification will be good enough for our purposes.
One thing we may do to prove to ourselves that the network did learn
is to check how many sample it did identify correctly.

In [5]:
out = net(torch.Tensor(df.values[:, :-1]))
y_hat = out.argmax(dim=1).numpy()
y_true = df['label'].values
print(sum(y_hat[y_true == 1] == y_true[y_true == 1])/sum(y_true == 1))
print(sum(y_hat[y_true == 0] == y_true[y_true == 0])/sum(y_true == 0))
print(sum(y_hat == y_true)/len(y_true))

0.87614399023795
0.9438682123245882
0.910006101281269


We are ignoring for the time being overfitting of the network.
It is likely that our ANN is overfitting,
and this can be solved by providing test and validation sets.
Yet, to keep it simple we will ignore this need.

# Equations

All the network above is doing is matrix multiplication.
We keep repeating that but this is really important!
We used a network with three layers: one with 25 neurons,
one with 10 neurons and one with 2 neurons (pulsar or not pulsar).
We also have 8 features from our dataset.

A matrix is often written with the number of its rows and columns.
For example matrix $A$ with $p$ rows and $n$ columns can be written:

$$
A_{n x p}
$$

This means that we have the following matrices in the ANN:

$$
W_{8 \times 20}, W_{B\: 20 \times 1},
W'_{20 \times 10}, W'_{B\: 10 \times 1},
W'_{10 \times 2}, W'_{B\: 2 \times 1}
$$

We can see these matrices in the network parameters.

In [6]:
ws = []
for f in net.parameters():
    ws.append(f.data.numpy())
print(list(map(lambda x: x.shape, ws)))

[(25, 8), (25,), (10, 25), (10,), (2, 10), (2,)]


Another thing that we did use above is the activation function (*tanh()*).
If we multiply an input vector through the matrices
and apply the activation function after each multiplication
we should get the same output as `pytorch` gives us.

In [7]:
print(net(torch.Tensor(X[0, :])))

tensor([0.8634, 0.1527], grad_fn=<TanhBackward>)


And our multiplication of:

$$
\hat{y} = \tanh(W'_{10 \times 2} \cdot
  \tanh(W'_{20 \times 10} \cdot
    \tanh(W_{8 \times 20} \cdot \vec{x} + W_{B\: 20 \times 1})
    + W'_{B\: 10 \times 1})
  + W'_{B\: 2 \times 1}
$$

Gives the exact same result.

In [8]:
W = ws[::2]
Wb = ws[1::2]
vector = X[0, :]
for w, b in zip(W, Wb):
    vector = np.tanh(w @ vector + b)
print(vector)

[0.86337939 0.15266306]


There is a little more to it though.

We normally think of data as rows meaning samples and columns meaning features.
That is all good but in our equation above and in the code we used the
vector as a column vector.
In other words, we had one column and eight rows.
`numpy` was kind enough to figure it out and perform the transpose for us.

Now, we normally do not feed a single sample into a neural network,
instead we feed batches (above we used a batch of 100 sample at a time).
The `pytorch` network performs all the transformations needed internally,
we can feed it several rows and it predicts all of them.

In [9]:
print(net(torch.Tensor(X[:3, :])))

tensor([[ 0.8634,  0.1527],
        [ 0.8261, -0.1086],
        [ 0.8253, -0.1033]], grad_fn=<TanhBackward>)


Now, if we are going to feed several samples into our code above,
we will need to perform the transposes ourselves.

Let's remind ourselves that two matrices can only multiply
if the number of columns of the left matrix is the same as the number
of rows of the right matrix.
Moreover the resulting matrix has the number of rows equal to the number
of rows of the left matrix and the number of columns equal to the number
of columns of the right matrix.

$$
A_{n \times p} \cdot B_{p \times q} = C_{n \times q}
$$

Let's say we feed 3 samples and note that:

$$
\hat{Y}_{2 \times 3} = tanh(W''_{2 \times 10} \times
    tanh(W'_{10 \times 25} \times
        tanh(W_{25 \times 8} \times X_{8 \times 3} + W_{B\: 25 \times 1})
    + W'_{B\: 10 \times 1})
+ W''_{B\: 2 \times 1})
$$

But in our dataset we have $X_{3278 \times 8}$ and we need it to have
$8$ rows, not columns, to fit in the equation above.
The same goes for $\hat{Y}$.  We want the answers to be a row per input row,
but we are getting a column per input.  We need to transpose both.

In [10]:
W = ws[::2]
Wb = ws[1::2]
vector = X[:3, :].T
for w, b in zip(W, Wb):
    vector = np.tanh(w @ vector + b[:, np.newaxis])
print(vector.T)

[[ 0.86337939  0.15266306]
 [ 0.82607889 -0.10857165]
 [ 0.82526583 -0.10333694]]


OK, we can argue that we understand how `pytorch` executes its ANNs.
We still do not know how it calculates its gradients but we will be on that next.