Modification of neural network from iamtrask. Source of initial script: [here](http://iamtrask.github.io/2015/07/12/basic-python-network/). His book: [here](https://www.manning.com/books/grokking-deep-learning). <br>
This updates his code from Python 2 to 3 (i.e. simply change xrange function to range) <br>
Adds extra comments to explain the architecture and calculations. <br>
Modifies weight initialisation XXX <br>
Change from Sigmoid Transformation to ReLU XXX <br>
Modify optimisation method XXX <br> 

In [1]:
import numpy as np


## Feature values (4 instances, 3 features)

In [2]:
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
X.shape

(4, 3)

## Labels (4 instances, 1 label)

In [3]:
y = np.array([[0,1,1,0]]).T

print()
print("Y shape: {}".format(y.shape))
print()
print("Y (labels): {}".format(y))
print()
print("Data type for Y output: {}".format(y.dtype))


Y shape: (4, 1)

Y (labels): [[0]
 [1]
 [1]
 [0]]

Data type for Y output: int32


## Random initialise weights (shape 3,4) (i.e. # weights each Features X Instances) -- uniform distribution centered on zero.
#### Key point: dense network - the size of weight matrix is obvious when you just sketch out the architecture and draw lines between the densely connected layers - doesn't matter how many instances you have the size of matrix stays the same - independent of number of instances obviously - the full weight matrix gets multiplied by every instance

In [4]:
syn0 = 2*np.random.random((3,4)) - 1

print("Shape of weights for first layer: {}".format(syn0.shape))
print()
print("The weights for first layer: {}".format(syn0))
print()
print("Smallest weight: {}".format(np.min(syn0)))
print()
print("Biggest weight: {}".format(np.max(syn0)))
print()
print("Mean weight: {}".format(np.mean(syn0)))

Shape of weights for first layer: (3, 4)

The weights for first layer: [[-0.9708614  -0.67080291  0.29191409 -0.89019752]
 [-0.25176646  0.23102564  0.5872154  -0.33048972]
 [ 0.399518   -0.96349127 -0.08503958 -0.85598729]]

Smallest weight: -0.9708613993238386

Biggest weight: 0.587215401629215

Mean weight: -0.29241358627153474


## Random initialise weights (shape 4,1) for hidden layer transition to output (i.e. )

In [5]:
syn1 = 2*np.random.random((4,1)) - 1

print("Shape of weights for hidden layer to output:\n {}".format(syn1.shape))
print()
print("The weights for hidden layer to output:\n {}".format(syn1))
print()
print("Smallest weight: {}".format(np.min(syn1)))
print()
print("Biggest weight: {}".format(np.max(syn1)))
print()
print("Mean weight: {}".format(np.mean(syn1)))

Shape of weights for hidden layer to output:
 (4, 1)

The weights for hidden layer to output:
 [[0.56511337]
 [0.85871653]
 [0.82655379]
 [0.48099302]]

Smallest weight: 0.4809930225597252

Biggest weight: 0.8587165254354729

Mean weight: 0.6828441771932812


## Calculate values of nodes in Hidden Layer (i.e. forward prop)

In [6]:
### Multiply the values (i.e. "np.dot(X,syn0)" and then apply a sigmoid transformation: i.e. 1/(1+np.exp(-NUM))
demo_hidden_layer  = 1/(1+np.exp(-(np.dot(X,syn0))))

print("First version of node values in Hidden Layer: \n  \n{}\n \n Shape of layer 1: {}".format(demo_hidden_layer, demo_hidden_layer.shape))

First version of node values in Hidden Layer: 
  
[[0.59857185 0.27617973 0.47875291 0.2981784 ]
 [0.53687083 0.32465389 0.62297052 0.23388961]
 [0.3609269  0.16324295 0.55153496 0.14852905]
 [0.30510392 0.19729795 0.68870884 0.11138469]]
 
 Shape of layer 1: (4, 4)


## Calculate values of Output Layer (i.e. forward prop)

In [7]:
demo_output_layer = 1/(1+np.exp(-(np.dot(demo_hidden_layer,syn1))))


print("First version of node values in Output Layer: {}\n \n Shape of Output Layer: {}".format(demo_output_layer, demo_output_layer.shape))

First version of node values in Output Layer: [[0.75297785]
 [0.77022754]
 [0.70504915]
 [0.72405614]]
 
 Shape of Output Layer: (4, 1)


## Batch processing: therefore wait until end of batch (i.e. all instances) to update the weights

Derivative of a sigmoid function:   (l2*(1-l2))  <br>
<br>
Convenient form for efficiently calculating gradients used in neural networks: if one keeps in memory the feed-forward activations of the logistic function for a given layer, the gradients for that layer can be evaluated using simple multiplication and subtraction rather than performing any re-evaluating the sigmoid function, which requires extra exponentiation.

In [8]:
### note: was xrange in Py2 - now it's range in Py3

for j in range(60000):
    hidden_layer  = 1/(1+np.exp(-(np.dot(X,syn0))))
    output_layer = 1/(1+np.exp(-(np.dot(hidden_layer ,syn1))))
    
    ###############  (distance from correct answer)  X  (partial derivative of Sigmoid at this point)
    l2_delta = (y - output_layer)*(output_layer*(1-output_layer))
    
    ###############  multiply how much we missed by the slope of the sigmoid at the values in l1
    l1_delta = l2_delta.dot(syn1.T) * (hidden_layer * (1 - hidden_layer))
    
    ###############  update the weights.
    syn1 = syn1 + hidden_layer.T.dot(l2_delta)
    syn0 = syn0 + X.T.dot(l1_delta)
    
    if j % 6000 == 0:
        print("\n Iteration {}: \n Distance from correct prediction: {}".format(j, y - output_layer))



 Iteration 0: 
 Distance from correct prediction: [[-0.75297785]
 [ 0.22977246]
 [ 0.29495085]
 [-0.72405614]]

 Iteration 6000: 
 Distance from correct prediction: [[-0.01054519]
 [ 0.01319209]
 [ 0.01706785]
 [-0.01503702]]

 Iteration 12000: 
 Distance from correct prediction: [[-0.00699539]
 [ 0.00853642]
 [ 0.01096649]
 [-0.0098401 ]]

 Iteration 18000: 
 Distance from correct prediction: [[-0.00556491]
 [ 0.00665954]
 [ 0.0086165 ]
 [-0.00777354]]

 Iteration 24000: 
 Distance from correct prediction: [[-0.00474119]
 [ 0.00558785]
 [ 0.00729363]
 [-0.00659081]]

 Iteration 30000: 
 Distance from correct prediction: [[-0.00419021]
 [ 0.0048776 ]
 [ 0.00642058]
 [-0.00580231]]

 Iteration 36000: 
 Distance from correct prediction: [[-0.00378921]
 [ 0.00436527]
 [ 0.00579071]
 [-0.00522965]]

 Iteration 42000: 
 Distance from correct prediction: [[-0.00348102]
 [ 0.00397472]
 [ 0.00530948]
 [-0.00479016]]

 Iteration 48000: 
 Distance from correct prediction: [[-0.00323489]
 [ 0.00

In [9]:
print("Final weights for features: \n{}".format(syn0))
print()
print("Final weights for hidden layer to output: \n{}".format(syn1))

Final weights for features: 
[[-6.84907556 -7.22261104 -4.51861542 -0.72959157]
 [-4.9640562   6.24814052  7.44527005 -2.99103074]
 [ 1.93997688 -3.15757874  1.52582139  2.66809815]]

Final weights for hidden layer to output: 
[[-7.29438893]
 [11.29629496]
 [-7.53537454]
 [ 6.68733369]]
