This notebook uses the code from chapter 18 of the book, "Data Science from Scratch" by Joel Grus, available on [github][1].

[1]: https://github.com/joelgrus/data-science-from-scratch

## Coding the neural network

Our network will be simple: It will consist of 25 input features (pixels), a hidden layer with 4 neurons (configurable: `num_hidden` below) and an output layer with 10 neurons. The output of the neurons 0-9 in the output layer can be interpreted as a probability that the input is classified as the digit 0-9.

In [None]:
import math, random
import matplotlib
import matplotlib.pyplot as plt

A few helper functions:

In [None]:
def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sigmoid(t):
    return 1 / (1 + math.exp(-t))

def argmax(l):
    return l.index(max(l))

Functions for evaluating the network:

In [None]:
def neuron_output(weights, inputs):
    return sigmoid(dot(weights, inputs))

def feed_forward(neural_network, input_vector):
    """takes in a neural network (represented as a list of lists of lists of weights)
    and returns the output from forward-propagating the input"""

    outputs = []

    for layer in neural_network:

        input_with_bias = input_vector + [1]             # add a bias input (always 1)
        output = [neuron_output(neuron, input_with_bias) # compute the output
                  for neuron in layer]                   # for this layer
        outputs.append(output)                           # and remember it

        # the input to the next layer is the output of this one
        input_vector = output

    # outputs = two arrays (one array of size 4 for the hidden layer plus one array of size 10 for the output layer)
    return outputs 

def predict(network, input):
    """run input through the network and return output of last layer"""
    return feed_forward(network, input)[-1]

Define the function for backpropagation that we'll need to train the network:

In [None]:
def backpropagate(network, input_vector, target):

    hidden_outputs, outputs = feed_forward(network, input_vector)

    # compute the delta (error term) of the output layer
    output_deltas = [output * (1 - output) * (output - target[i]) # (1)
                     for i, output in enumerate(outputs)]

    # back-propagate errors to hidden layer: compute the delta (error term) of the hidden layer
    hidden_deltas = [hidden_output * (1 - hidden_output) *
                     dot(output_deltas, [n[i] for n in network[-1]]) # (2)
                     for i, hidden_output in enumerate(hidden_outputs)]
    
    # adjust weights for output layer (network[-1])
    for i, output_neuron in enumerate(network[-1]): # loop over weights of neurons in output layer
        for j, hidden_output in enumerate(hidden_outputs + [1]): # loop over output of neurons in hidden layer + bias
            output_neuron[j] -= output_deltas[i] * hidden_output # (3)

    # adjust weights for hidden layer (network[0])
    for i, hidden_neuron in enumerate(network[0]): # loop over weights of neurons in hidden layer
        for j, input in enumerate(input_vector + [1]): # loop over output of neurons in first layer, i.e. the inputs + bias
            hidden_neuron[j] -= hidden_deltas[i] * input

In the two lines marked with **(1)** and **(3)** we rediscover
$$ \Delta w_{ij} = - \eta \frac{\partial E}{\partial w_{ij}} = - (\hat y - y) \varphi'(\text{net}_j) x_i,$$
which was introduced in the [short introduction][1] on backpropagation. 

[1]: NN_Activation.ipynb

In **(1)** we compute $(\hat y - y) \varphi'(\text{net}_j)$ with a learning rate $\eta$ of $\frac12$. The term $\hat y - y$ is `output - target[i]` for each target output value (activation) of the neurons in the last layer (`outputs`) and the vector of target values (`target`), and $\varphi'$ is the derivative of the sigmoid function that we use as activation function of the output layer, i.e. $\varphi'(x) = \varphi(x) (1 - \varphi(x))$ or `output * (1 - output)` in code.

Finally, in **(3)** this is just multiplied by $x_i$ which is `hidden_output`, an entry in the vector of output values of the hidden layer (`hidden_outputs`).

To understand equation **(2)** in the above code, which has not been spelled out in the [introduction][2], look again at the [Backpropagation Algorithm][1], which specifies the error terms $\delta_j^k$ for neuron $j$ in hidden layer $k$ as

$$\delta_j^k = g'(a_j^k) \sum_{l=1}^{r^{k+1}} w_{jl}^{k+1}\delta_l^{k+1}$$

In our case, $k$ is our (single) hidden layer and $k+1$ is the output layer. The values $a_j^k$ in this notation are the outputs of the hidden layer.
$g'(x)$ is again the derivative of the sigmoid. $r^{k+1}$ is the number of nodes in the output layer and the sum corresponds to the `dot` product multiplying the `output_deltas` ($\delta_l^{k+1}$) and the weights `network[-1]` of the output layer ($w_{jl}^{k+1}$).

[1]: https://brilliant.org/wiki/backpropagation/
[2]: NN_Activation.ipynb

(Note that **(3)** comes after **(2)**, i.e. we update the weights of the output neurons after computing the corrections for the hidden neurons. `network[-1]` is `output_layer`, i.e. changing `output_neuron` in **(3)** would otherwise affect the computation in **(2)** because the weights `n` would already be the updated ones.)

## Training the neural network

The stylized figures that will serve as inputs to train on (we only have one training set here with one input data per label):

In [None]:
raw_digits = [
   0, """11111
         1...1
         1...1
         1...1
         11111""",

   1, """..1..
         ..1..
         ..1..
         ..1..
         ..1..""",

   2, """11111
         ....1
         11111
         1....
         11111""",

   3, """11111
         ....1
         11111
         ....1
         11111""",

   4, """1...1
         1...1
         11111
         ....1
         ....1""",

   5, """11111
         1....
         11111
         ....1
         11111""",

   6, """11111
         1....
         11111
         1...1
         11111""",

   7, """11111
         ....1
         ....1
         ....1
         ....1""",

   8, """11111
         1...1
         11111
         1...1
         11111""",

   9, """11111
         1...1
         11111
         ....1
         11111"""]

def make_digit(raw_digit):
    pixels = [1 if c == '1' else 0
              for row in raw_digit.split("\n")
              for c in row.strip()]
    # normalize -- one of the neurons always is "flat" if we don't normalize the inputs
    # pixels = [p/sum(pixels) for p in pixels]
    return pixels

Define the inputs (pixel images), labels and targets (= one-hot labels):

In [None]:
inputs  = list(map(make_digit, raw_digits[1::2]))
labels  = raw_digits[0::2]
targets = [[1 if i == j else 0 for i in range(10)]
           for j in labels]

We have ten output neurons, so we cannot directly use the labels...

In [None]:
labels

...to train but need to convert these into a "one-hot encoding":

In [None]:
targets

Define the network structure and initialize:

In [None]:
random.seed(0)    # to get repeatable results
input_size  = 25  # each input is a vector of length 25 (25 pixels)
num_hidden  =  4  # number of neurons in the hidden layer
output_size = 10  # we need 10 outputs for each input

# each hidden neuron has one weight per input, plus a bias weight
hidden_layer = [[random.random() for __ in range(input_size + 1)]
                for __ in range(num_hidden)]

# each output neuron has one weight per hidden neuron, plus a bias weight
output_layer = [[random.random() for __ in range(num_hidden + 1)]
                for __ in range(output_size)]

# the network starts out with random weights
network = [hidden_layer, output_layer]

Now we run the training using the backpropagation:

In [None]:
# 10,000 iterations seems enough to converge
for x in range(10000):
    for input_vector, target_vector in zip(inputs, targets):
        backpropagate(network, input_vector, target_vector)
    if x % 1000 == 0:
        accuracy = sum([argmax(predict(network, input)) == label for input, label in zip(inputs, labels)])
        print("Iterations done: %d, accuracy: %.2f" % (x, accuracy / len(inputs)))


## Testing the neural network

Look at the probabilities of the labels the network predicts on the training data:

In [None]:
m = []
for i, input in enumerate(inputs):
    outputs = predict(network, input)
    print(i, [round(p,2) for p in outputs])
    m.append(outputs)

# This is not a confusion matrix.
plt.imshow(m, plt.cm.Blues);
plt.xlabel("Probability for label")
plt.ylabel("True label");

Let us try the performance on some input the network has not seen before:

In [None]:
print([round(x, 2) for x in
      predict(network,
                [0,1,1,1,0,    # .@@@.
                 0,0,0,1,1,    # ...@@
                 0,0,1,1,0,    # ..@@.
                 0,0,0,1,1,    # ...@@
                 0,1,1,1,0])]) # .@@@.

print([round(x, 2) for x in
      predict(network, 
                [0,1,1,1,0,    # .@@@.
                 1,0,0,1,1,    # @..@@
                 0,1,1,1,0,    # .@@@.
                 1,0,0,1,1,    # @..@@
                 0,1,1,1,0])]) # .@@@.

print([round(x, 2) for x in
      predict(network, 
                [1,1,1,1,1,    # @@@@@
                 0,0,0,0,1,    # ....@
                 0,0,0,1,0,    # ...@.
                 0,0,1,0,0,    # ..@..
                 0,0,1,0,0])]) # ..@..

(Note that the network ist completely overtrained: It's perfect on the training data but generalizes very badly, even though the first two figures are still properly classified. In reality we'd want a lot more input samples for the training.)


Show the weights the network has learned for each of the five hidden neurons:

In [None]:
def show_weights(neuron_idx, ax):
    weights = network[0][neuron_idx]

    grid = [weights[row:(row+5)]      # turn the weights into a 5x5 grid
            for row in range(0, input_size, 5)] # [weights[0:5], ..., weights[20:25]]

    pos = ax.imshow(grid,
                    cmap=matplotlib.cm.coolwarm,
                    interpolation='none', # plot blocks as blocks
                    vmin = -8, vmax = 8) # define a unique range for all subplots
    
    # print bias
    ax.set_xlabel("bias = %.2f" % weights[input_size])
    return pos

fig, ax = plt.subplots(figsize=(15, 3), ncols=num_hidden)
for idx in range(num_hidden):
    pos = show_weights(idx, ax[idx])
    #fig.colorbar(pos, ax = ax[0])


(blue = large negative, red = large positive)

In [None]:
plt.imshow(output_layer, cmap=matplotlib.cm.coolwarm)
plt.xlabel("Weight of hidden neuron (and bias)")
plt.ylabel("Output label");

(blue = large negative, red = large positive)

See how it discriminates e.g. 0 and 8 or 5 and 9?

Summed response of neurons in hidden layer to individual pixels (w/o bias) weighted by the weights of the output layer:

In [None]:
def DrawNNView(idx, ax):
    sum_weights = [sum([
                     sigmoid(network[0][neuron_idx][i])*output_layer[idx][neuron_idx] for neuron_idx in range(num_hidden)
                   ])
                   for i in range(input_size)
                  ]
    grid = [sum_weights[row:(row+5)] for row in range(0, input_size, 5)]

    pos = ax.imshow(grid,
                    cmap=matplotlib.cm.coolwarm,
                    interpolation='none',  # plot blocks as blocks
                    vmin = -40, vmax = 40) # define a unique range for all subplots
    ax.set_xlabel("output node %d" % idx)

fig, ax = plt.subplots(figsize=(15, 6), ncols = 5, nrows = 2)
for idx in range(10):
    pos = DrawNNView(idx, ax.flatten()[idx])