In [1]:
import numpy as np

# Epoch: 
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.

In [13]:
epoches = 10000  #No of iterations
inputLayerSize=2
hiddenLayerSize=4
outputLayerSize=1
alpha = 0.1 #learning rate

In [20]:
X = np.array([[1,1],[1,2],[2,1],[2,2]])   #input data
Y = np.array([[0],[1],[2],[3]])            #output data

In [17]:
def sigmoid(x):                         #activation function
    return 1/(1+np.exp(-x))

In [18]:
def sigmoid_derivative(x):            #gradient descent
    return x*(1-x)

In [19]:
Wh = np.random.uniform(size=(inputLayerSize,hiddenLayerSize)) #weights used btwn input and hidden layer
Wz = np.random.uniform(size=(hiddenLayerSize,outputLayerSize)) #weights used btwn hidden and output layer

In [21]:
H = sigmoid(np.dot(X, Wh))                  # hidden layer results
Z = np.dot(H,Wz)                            # output layer, no activation
E = Y - Z                                   # how much we missed 
dZ = E * alpha                               # delta Z
Wz +=  H.T.dot(dZ)                          # update output layer weights
dH = dZ.dot(Wz.T) * sigmoid_derivative(H)   # delta H
Wh +=  X.T.dot(dH)                          # update hidden layer weights
print(Z)

[[2.17654678]
 [2.32612864]
 [2.47571744]
 [2.58035145]]


# Summary:
Here, X is the input values. Y is the expected output and Z is the output we get.

An activation function corresponds to the biological phenomenon of a neuron ‘firing’, i.e. triggering a nerve signal when the neuron’s inputs combine in some appropriate way. It has to be chosen so as to cause reasonably proportionate outputs within a small range, for small changes of input. We’ll use the very popular sigmoid function, but note that there are others. We also need the sigmoid derivative for backpropagation.

We’ll make an initial guess using the random initial weights, propagate it through the hidden layer as the dot product of those weights and the input vector of truth-value pairs. Recall that a matrix – vector multiplication proceeds along each row, multiplying each element by corresponding elements down through the vector, and then summing them. This matrix  goes into the sigmoid function to produce H. So H = sigmoid(X * Wh)

Same for the Z (output) layer, Z = sigmoid(H * Wz)

Now we compare the guess with the training date, i.e. Y – Z, giving E.

Finally, backpropagation. This comprises computing changes (deltas) which are multiplied (specifically, via the dot product) with the values at the hidden and input layers, to provide increments for the appropriate weights. If any neuron values are zero or very close, then they aren’t contributing much and might as well not be there. The sigmoid derivative (greatest at zero) used in the backprop will help to push values away from zero. The sigmoid activation function shapes the output at each layer.

E is the final error Y – Z.

dZ is a change factor dependent on this error magnified by the slope of Z; if its steep we need to change more, if close to zero, not 
much. The slope is sigmoid_derivative(Z).

dH is dZ backpropagated through the weights Wz, amplified by the slope of H.

Finally, Wz and Wh are adjusted applying those deltas to the inputs at their layers, because the larger they are, the more the weights need to be changed to absorb the effect of the next forward prop. The input values are the value of the gradient that is being descended; we’re moving the weights down towards the minimum value of the cost function.