# Lab 10 (Supp): Neural Network from Scratch

In this session, let's start of by attempting to build a neural network from scratch, so that you get an idea of what goes on under the hood of a neural network learner. 



In [None]:
import numpy as np
import matplotlib.pyplot as plt

What is a neural network? It is a machine learning technique that attempts to mimic the human brain by representing it with a series of connected nodes in different layers. A human brain consists of 100 billion cells called neurons, which are connected together by synapses. The synapses are in charge of propagating input stimuli as such that if there is sufficient strength in the synaptic inputs, the resulting (output) neuron will be fired or "activated". This is how signals are propagating through our brain cells that eventually lead to them activating some other body functions. 

<img src="https://cdn-images-1.medium.com/max/800/1*4-4XkuTZopk59wOV6E-RCg.jpeg" width=400>

We can model this process by creating a neural network on a computer. It's not necessary to model the biological complexity of the human brain at a molecular level, just its higher level rules and "thinking" capabilities. To start doing that, we will attempt to model just a single neuron, with three inputs and one output. Then we shall see how values are propagated both forward and backward inside a neural network.

We are going to train the neuron to solve the following simple problem. These four examples will be our training set. From a quick glance, surely you can work out the pattern in the values below, and the answer for '?' is simple.

<img src="https://cdn-images-1.medium.com/max/1600/1*nEooKljI8XbKQh4cFbZu1Q.png" width=400>

You might have noticed that the output is always equal to the value of the leftmost input column. Therefore the answer for the ‘?’ should be 1.

In [None]:
training_set_inputs = np.array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
training_set_outputs = np.array([[0, 1, 1, 0]]).T
print(training_set_inputs)
print(training_set_outputs)

### Training a Neural Network

So, how do we teach our neuron to answer the question correctly? We will give each input a weight, which can be a positive or negative number. An input with a large positive weight or a large negative weight, will have a strong effect on the neuron’s output. Likwise, a weight that is close to zero would have not much effect. This are also known as the "activations" of the neurons. Before we start, let us set each weight to a random number. 

In [None]:
# Seed the random number generator, so it generates the same numbers
# every time the program runs.
np.random.seed(1)   # normally we get the seed from the machine clock

synaptic_weights = 2 * np.random.random((3, 1)) - 1
print(synaptic_weights)

Then, we begin the training process, which follows these steps:

1. Take the inputs from a training set example, adjust them by the weights, and pass them through a "special formula" to calculate the neuron's output.
2. Calculate the error, which is the difference between the neuron's output and the desired output in the training set example.
3. Depending on the direction of the error, adjust the weights slightly.
4. Repeat this process for many times (hundreds or thousands of times).

This process involve a **feed-forward** step (step 1) and a **back propagation** step (step 3). Eventually the weights of the neurons will reach an optimum level for the training set, where it will no longer change by much. 

### The "Special Formula"

The so-called special formula for calculating the neuron’s output can be calculated by first taking the weighted sum of the neuron's inputs:

\begin{align}
\sum_i w_i x_i = w_1 x_1 + w_2 x_2 + w_3 x_3
\end{align}

Next, we want to normalise this value so that it lies between 0 and 1. In fact, it would be good if the function is able to push the values to the extremas of the range between 0 and 1 so that we can emphasize on the strong activations. For this, we use a non-linear activation function called the Sigmoid function.

<img src="https://cdn-images-1.medium.com/max/1600/1*sK6hjHszCwTE8GqtKNe1Yg.png" width=400>

\begin{align}
f(x) = \frac{1}{1+e^{-x}}
\end{align}

In [None]:
# The Sigmoid function, which describes an S shaped curve.
# We pass the weighted sum of the inputs through this function to
# normalise them between 0 and 1.
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

So, by substituting the first equation into the second, the final formula for the output (activation) of the neuron is:

\begin{align}
a = f(\sum_i w_i x_i) = \frac{1}{1+e^{-(\sum_i w_i x_i)}}
\end{align}


In [None]:
output = sigmoid(np.dot(training_set_inputs, synaptic_weights))
print(output)

### Adjusting the weights

With the calculated output and the desired output (based on training data), we can now find the error which will tell us how far is the current output from the ground truth. 

In [None]:
# Calculate the error (The difference between the desired output
# and the predicted output).
error = training_set_outputs - output
print(error)

To make an adjustment to the weights based on this error, we can first multiply the error with the input which is either 0 or 1. This makes it proportional to the amount of the error. Then, we go on to multiply by the gradient of the activation function (which is a Sigmoid). 

The intuition behind doing this, is that when the output is a large positive or large negative number, it signifies the strength or confidence of the neuron. Based on the Sigmoid curve, large numbers (on both ends) will have a small or shallow gradient. If the neuron is confident that the existing weight is correct, it will not adjust it by very much. So, multiplying by the Sigmoid curve gradient achieves this.

In [None]:
# The derivative of the Sigmoid function.
# This is the gradient of the Sigmoid curve.
# It indicates how confident we are about the existing weight.
def sigmoid_derivative(x):
    return x * (1 - x)

In [None]:
# Multiply the error by the input and again by the gradient of the Sigmoid curve.
# This means less confident weights are adjusted more.
# This means inputs, which are zero, do not cause changes to the weights.

print(error * sigmoid_derivative(output))
adjustment = np.dot(training_set_inputs.T, error * sigmoid_derivative(output))
print(adjustment)

In [None]:
old_weights = synaptic_weights.copy()

synaptic_weights += adjustment
print("Old weights:\n",old_weights)
print("New weights:\n",synaptic_weights)

We can see that the weights have changed. One iteration (also known as epoch) has passed when all training examples have passed through the network, updating the weights by back-propagation.

### Iterate it

What we need to do now is to iterate the process and see if the weights converge at some stable values. The proper way of knowing this is to calculate the network loss for each iteration. In this example, the loss is basically the sum of squared error between the predicted and desired output. Sometimes, the average loss is also used because it provides the intuition of how far is each sample (on average) from the ground truth after the training process.

In [None]:
synaptic_weights = 2 * np.random.random((3, 1)) - 1
loss = []
for iteration in range(1000):
    output = sigmoid(np.dot(training_set_inputs, synaptic_weights))
    error = training_set_outputs - output
    synaptic_weights += np.dot(training_set_inputs.T, error * sigmoid_derivative(output))
    rss = np.sum(error**2)
    loss.append(rss)
    
    print("Iteration",iteration," loss=", rss)

In [None]:
plt.plot(range(1000), loss,'b-')
plt.show()

Finally, let's test out the model learned by the neural network with a new test input.

In [None]:
test_input = np.array([1, 0, 0])
test_output = sigmoid(np.dot(test_input, synaptic_weights))
print(test_output)

The answer is close to 1, which is correct. 
By examining the final weights learned from the neural network, we can see that the first weight is a strong positive number, indicating that it contributes the most to the decision making. Recall again that we observed the first column being similar to the output value!

In [None]:
print("Final weights:\n",synaptic_weights)

## Problem case: Lack of Features

Now let's look at a slightly more challenging set of training data. 
<table>
    <tr><th colspan=3>Input</th><th>Output</th>
       <tr><td>0</td><td>0</td><td>1</td><td>0</td></tr>
       <tr><td>0</td><td>1</td><td>1</td><td>1</td></tr>
       <tr><td>1</td><td>0</td><td>1</td><td>1</td></tr>
       <tr><td>1</td><td>1</td><td>1</td><td>0</td></tr>
    </table>
So, what's the pattern here? The output appears to be completely unrelated to column three, which is always 1. However, columns 1 and 2 provide more clarity. If either columns 1 or 2 are a 1 (but not both!) then the output is a 1. If either columns are a 0 then the output is a 0. The third column is as good as redundant, so we are hoping for the first two columns to help make correct predictions. 

This is considered a "non-linear" pattern because there is no direct one-to-one relationship between the input and output. Instead, there is a one-to-one relationship between a combination of inputs, namely columns 1 and 2. This is going to be challenging. There is a lack of features (or rather, useful features). 

<img src="https://www.pyimagesearch.com/wp-content/uploads/2016/08/knn_kaggle_dogs_vs_cats_sample.jpg" width=450>

Image recognition has the similar problem. Given a bunch of images of dogs or cats (assume they are identical in size), we will find that no individual pixel position would directly correlate with the presence of a dog or cat. The pixels are as good as random from a purely statistical point of view. However, certain combination of pixels are not entirely random, namely the combinations that form certain body parts of the cat or dog might be of good use. So, we need to find a "higher level" correlation between these combination of pixels with the output values.

### Strategy

In order to combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer. The first layer will combine the inputs, and the second layer will then map them to the output with the output of the first layer as input. This new layer that is neither an input nor an output layer, is usually known as a *hidden layer*, because it is not observable in any sense, but derives a relationship between the input and output layers.

In terms of weight updating, this neural network will need to update the second layer of weights that maps to the output, and also update the first layer of weights to be better at producing it from the input! So there shall two weight updates, linked to each other.

Let's start by creating the training data...

In [None]:
X = np.array([
[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]]) 
y = np.array([
[0],
[1],
[1],
[0]])
print(X)
print(y)

Then, we initialize the two weight matrices randomly. Their shapes need to be correct.

In [None]:
np.random.seed(1)

# randomly initialize our weights with mean 0
weight0 = 2*np.random.random((3,5)) - 1
weight1 = 2*np.random.random((5,1)) - 1
print(weight0)
print(weight1)

The feedforward operation involves passing the inputs through all layers, until the predicted output is calculated.

In [None]:
# Feed forward through layers 0, 1, and 2
layer0 = X
layer1 = sigmoid(np.dot(X,weight0))
layer2 = sigmoid(np.dot(layer1,weight1))

In [None]:
# Then, calculate the error at the end of network : how far are we from target
l2_error = y - layer2
print(l2_error)

Back-propagaton step starts once we have the error at the final layer. The amount of adjustment needs to be calculated in the last layer first.

In [None]:
l2_delta = l2_error*sigmoid_derivative(layer2)
print(l2_delta)

Then, it trickles over from the 2nd layer to the 1st layer. This is to find out how much each layer1 value contribute to the layer2 error.

In [None]:
l1_error = np.dot(l2_delta, weight1.T)
print(l1_error)

In [None]:
# this is the amount of adjustment needed on layer 1
l1_delta = l1_error * sigmoid_derivative(layer1)
print(l1_delta)

In [None]:
# copy old weights first
old_weight1 = weight1.copy()
old_weight0 = weight0.copy()

# update weights with the adjustment
weight1 += np.dot(layer1.T, l2_delta)
weight0 += np.dot(layer0.T, l1_delta)

# print to check if there are changes
print("Old weight 1\n",old_weight1)
print("New weight 1\n",weight1)
print("Old weight 0\n",old_weight0)
print("New weight 0\n",weight0)

**Q1**: In order to verify that the neural network is able to train correctly, collect and compile all relevant code above, and put them into a loop. As before, keep track of the network loss as well. 

**Q2**: Plot the loss vs. epoch curve. It should show some form of convergence towards a minimum level. 

**Q3**: Add a *learning rate* to your weight updating code. Find a good value for it such that it will give us the lowest possible loss.

Once your neural network has learned sufficiently well, take a look at the weights (both layers) and see if you can see anything interesting happening that might provide you with some idea of what was learned

The neural network flavour that you have just created is called a **Multilayer Perceptron (MLP)**. MLP is a class of feedforward artificial neural network which consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a non-linear activation function. MLP utilizes a supervised learning technique called back-propagation for training.