# Simple Neural Network

https://iamtrask.github.io/2015/07/12/basic-python-network/
https://iamtrask.github.io/2015/07/27/python-network-part2/

## The network

### Part 1

Import numpy (mathematics library)

In [1]:
import numpy as np

The sigmoid function is defined as $\frac{1}{1+e^{-x}}$. The derivative of that is $s(x) * (1 - s(x))$ where $s(x)$ is the sigmoid function. This function definition doesn't define itself recursively because in actual usage the computed value of the original sigmoid will be passed in.

In [2]:
def sigmoid(x, deriv=False):
    if deriv:
        return x * (1 - x)
    return 1 / (1 + np.exp(-x))

Define the inputs. Each row is a data point that will be used to train the network.

`numpy.array` - generates a vector if passed in a list or a matrix if passed in a list of lists

In [3]:
x = np.array([
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 1]
])

`numpy.array().T` - if passed in a matrix, it gets transposed, otherwise it remains the same

The expected outputs. This is equivalent to
```
np.array([
    [0],
    [0],
    [1],
    [1]
])
```

In [4]:
y = np.array([
    [0, 0, 1, 1]
]).T

Seed the random number generator with a definitive number so that we can see progress each time it is run.

PRNG (Pseudo Random Number Generators) are very "random" given that the initial value, the seed, is different each time. If the seed is the same number, then it will be deterministic. For example, if the seed function is passed 1 I can expect x, y, and z when the random function is run three times. If I seed it with 1 again, I can be certain that when run three times I will get x, y, and z.

In [5]:
np.random.seed(1)

Initialize the first layer's weights randomly with a mean of 0.

This has a mean of 0 as $E(x)$ is defined as the sum of all possible values multiplied by its probability. Since `np.random.random` yields values between 0 and 1 and all values between them should be equally probable we end up with 0.5 as the expected value. Multiplying that by 2 and subtracting 1 yields $E(x) = 0$.

In [6]:
syn0 = 2 * np.random.random((3, 1)) - 1

The actual training takes place in the for loop.

The first layer (`layer0`) is just the inputs, in this case `x`.

The second layer (`layer1`) is derived doing a matrix multiplication of `layer0` and `syn0` aka the inputs and the weights. `layer0` is a 4x4 matrix and `syn0` is a 3x1 matrix, so it's just a vector matrix multiplication. In this case it'd be equivalent to:

```
[
    [ syn0[0][0]*layer0[0][0] + syn[1][0]*layer[0][1] + syn[2][0]*layer[0][2] ],
    [ syn0[0][0]*layer0[1][0] + syn[1][0]*layer[1][1] + syn[2][0]*layer[1][2] ],
    [ syn0[0][0]*layer0[2][0] + syn[1][0]*layer[2][1] + syn[2][0]*layer[2][2] ],
    [ syn0[0][0]*layer0[3][0] + syn[1][0]*layer[3][1] + syn[2][0]*layer[3][2] ],
]
```

This is fed into the the sigmoid function to normalize it between 0 and 1. That is taken and `layer1_error` is calculated by seeing the difference between the expected value and the value output from the sigmoid function.

Next the error is multiplied by the slope of the sigmoid function at the values in `layer1`. This is then taken and a matrix multiplication is done between the inputs transposed and the delta derived from multiplying the error and the slopes.

The intuition about the delta is that in a sigmoid function, the slope near -1 and 1 is very low. This will not change the error by much but if the outputs are uncertain, meaning that it's near 0 the slope will be big and change the weights by a lot more. The delta is then added to the weights at which point we can rinse and repeat.

In [7]:
for i in range(10000):
    # layer 0 aka the inputs
    layer0 = x

    # layer 1
    layer1 = sigmoid(np.dot(layer0, syn0))

    layer1_error = y - layer1
    layer1_delta = layer1_error * sigmoid(layer1, deriv=True)

    # adjust the weights
    syn0 += np.dot(layer0.T, layer1_delta)
    
print('Output after training: ')
print(layer1)

Output after training: 
[[ 0.00966449]
 [ 0.00786506]
 [ 0.99358898]
 [ 0.99211957]]


### Part 2

Backpropagation does not optimize. It just moves all the error information from the end of the network to all the weights inside the network. Another algorithm has to then work with the information generated.

One such algorithm is gradient descent. Simplified gradient descent can be seen as such:

1. Calculate slope at current position
2. If slope is negative, move right
3. If slope is positive, move left
4. Repeat above steps until slope is 0

Naive gradient descent:

1. Calculate slope at current x position
2. Change x by the negative of the slope ($x = x - slope$)
3. Repeat above steps until slope is 0

#### Problem 1: Overshooting

Overshooting can lead to divergence. To resolve this the slope can be simply made smaller. The gradients are multiplied by a number between 0 and 1 called the alpha.

Improved gradient descent:
1. Set alpha to some number between 0 and 1
2. Calculate slope at current x position
3. $x = x - (alpha * slope)$
4. Repeat above steps until slope is 0

#### Problem 2: Local Minimums

The algorithm might get caught at a local minimum, where it satisfies the condition that the slope is 0 and it will quit.

To resolve this, we can have multiple random starting points. Neural networks do this by having very large hidden layers. Each hidden node in a layer starts out in a different random starting state. This allows  the nodes to converge to different patterns.

#### Problem 3: Slopes are Too Small

This can occur if the alpha is too small and can be remedied by increasing the alpha.

Initialize the dataset.

In [8]:
x = np.array([
    [0, 1],
    [0, 1],
    [1, 0],
    [1, 0]
])

The expected answers.

In [9]:
y = np.array([
    [0, 0, 1, 1]
]).T

Seed it with a deteministic number.

In [10]:
np.random.seed(1)

Initialize the weights with a mean of 0.

In [11]:
synapse0 = 2 * np.random.random((2, 1)) - 1

The actual training is done below.

`layer0` is just the training data set and `layer1` is just the data and the weights multiplied and normalized against the activation function.

`synapse0_derivative` is obtained by multiplying the delta and the inputs

In [12]:
for i in range(10000):
    layer0 = x
    layer1 = sigmoid(np.dot(layer0, synapse0))
    
    layer1_error = layer1 - y
    layer1_delta = layer1_error * sigmoid(layer1, deriv=True)

    synapse0_derivative = np.dot(layer0.T, layer1_delta)
    synapse0 -= synapse0_derivative

print('Output after training:')
print(layer1)

Output after training:
[[ 0.00505119]
 [ 0.00505119]
 [ 0.99494905]
 [ 0.99494905]]


The alpha parameter reduces the size of each iteration's update. Right before we update the weights, we multiply the weight update by the alpha.

In [13]:
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

Initialize the inputs and the expected outputs.

In [14]:
x = np.array([
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 1]
])

y = np.array([[0, 1, 1, 0]]).T

There is training with different alphas to show the effect of alpha.

In [15]:
for alpha in alphas:
    print('\nTraining with alpha: {}'.format(str(alpha)))
    
    np.random.seed(1)
    
    synapse0 = 2 * np.random.random((3, 4)) - 1
    synapse1 = 2 * np.random.random((4, 1)) - 1
    
    for j in range(60000):
        layer0 = x
        layer1 = sigmoid(np.dot(layer0, synapse0))
        layer2 = sigmoid(np.dot(layer1, synapse1))
        
        layer2_error = layer2 - y
        layer2_delta = layer2_error * sigmoid(layer2, deriv=True)
        
        if j % 10000 == 0:
            print('Error after {} iterations: {}'.format(str(j), str(np.mean(np.abs(layer2_error)))))
        
        layer1_error = layer2_delta.dot(synapse1.T)
        layer1_delta = layer1_error * sigmoid(layer1, deriv=True)
        
        synapse1 -= alpha * layer1.T.dot(layer2_delta)
        synapse0 -= alpha * layer0.T.dot(layer1_delta)


Training with alpha: 0.001
Error after 0 iterations: 0.496410031903
Error after 10000 iterations: 0.495164025493
Error after 20000 iterations: 0.493596043188
Error after 30000 iterations: 0.491606358559
Error after 40000 iterations: 0.489100166544
Error after 50000 iterations: 0.485977857846

Training with alpha: 0.01
Error after 0 iterations: 0.496410031903
Error after 10000 iterations: 0.457431074442
Error after 20000 iterations: 0.359097202563
Error after 30000 iterations: 0.239358137159
Error after 40000 iterations: 0.143070659013
Error after 50000 iterations: 0.0985964298089

Training with alpha: 0.1
Error after 0 iterations: 0.496410031903
Error after 10000 iterations: 0.0428880170001
Error after 20000 iterations: 0.0240989942285
Error after 30000 iterations: 0.0181106521468
Error after 40000 iterations: 0.0149876162722
Error after 50000 iterations: 0.0130144905381

Training with alpha: 1
Error after 0 iterations: 0.496410031903
Error after 10000 iterations: 0.00858452565325
Err