### Stochastic Gradient Descent
We iterate through the whole dataset and perform prediction and weight updates for each training example separately. 

In [10]:
import numpy as np

W = np.array([0.5, 0.48, -.7])

alpha = 0.1

X = np.array([[ 1, 0, 1 ],
              [ 0, 1, 1 ],
              [ 0, 0, 1 ],
              [ 1, 1, 1 ],
              [ 0, 1, 1 ],
              [ 1, 0, 1 ]])

Y = np.array([0,1,0,1,1,0])

def forward_propogation(X, W):
    return X.dot(W)

def l2_error(P, Y):
    return (Y-P)**2
    
def gradient(X, D):
    return np.multiply(X,D)

def weight_update(W, WD, alpha):
    W -= alpha * WD
    return W

def error_optimization(X, W, Y, epochs, alpha):
    for epoch in range(epochs):
        total_error = 0
        for rowIDX in range(len(X)):
            P = forward_propogation(X[rowIDX,:], W)
            E = l2_error(P, Y[rowIDX])
            total_error += E
            D = P-Y[rowIDX]
            WD = gradient(X[rowIDX], D)
            W = weight_update(W, WD, alpha)
            print(f"Epoch {epoch} | Prediction {round(P,2)} | Error {E}")
#             print(f"Weights {W} | Weight Deltas {WD}")
        print(f"Total Error: {total_error}")

In [11]:
error_optimization(X,W,Y,1000,alpha)

Epoch 0 | Prediction -0.2 | Error 0.03999999999999998
Epoch 0 | Prediction -0.2 | Error 1.44
Epoch 0 | Prediction -0.56 | Error 0.31359999999999993
Epoch 0 | Prediction 0.62 | Error 0.14745599999999992
Epoch 0 | Prediction 0.17 | Error 0.6842598400000001
Epoch 0 | Prediction 0.18 | Error 0.030807270400000003
Total Error: 2.6561231104
Epoch 1 | Prediction 0.14 | Error 0.019716653055999997
Epoch 1 | Prediction 0.31 | Error 0.48073921463296004
Epoch 1 | Prediction -0.35 | Error 0.11912040471029758
Epoch 1 | Prediction 1.01 | Error 4.4054335374336594e-05
Epoch 1 | Prediction 0.48 | Error 0.2719586253784771
Epoch 1 | Prediction 0.27 | Error 0.07129122555848956
Total Error: 0.9628701776715985
Epoch 2 | Prediction 0.21 | Error 0.04562638435743332
Epoch 2 | Prediction 0.53 | Error 0.21646497866936446
Epoch 2 | Prediction -0.26 | Error 0.0679506481084675
Epoch 2 | Prediction 1.13 | Error 0.017408924772739018
Epoch 2 | Prediction 0.63 | Error 0.13877681858052432
Epoch 2 | Prediction 0.25 | Error

### Learning Correlations
Each training example exerts a positive or a negative pressure on the weights. Based on this pressure the weights 'learn' a correlation with the input variables.

Our prediction is a weighted sum of our inputs. Our learning algorithm rewards inputs that correlate with our output with upward pressure on their weight while penalizing inputs with no correlation with downard pressure.

There are times when correlation happens by accident, and that is called overfitting. This happens when the neural network accidentally creates perfect correlation between our prediction and the output (such that error == 0) without actually giving allocating the heaviest weights to the best inputs. This is when the neural network stops learning. 

Neural networks are very flexible. They can find many different weight configurations that will correctly predict for a subset of your training data. In fact, if we trained our neural network on the first 2 training examples, it would likely stop learning at a point where it did NOT work well for our other training examples. 

In essence, it memorized the two training examples instead of acutally finding the correlation that will generalize to any possible output configuration.

But what if there is no correlation between the input dataset and the output dataset. In that case we can create our own intermediate dataset which can have some limited amount of correlation with the output. This can be done by using weights to convert out input dataset into values which show some limited correlation with the output.

You can think of this method as stacking two neural networks on top of eachother. In the first neural network, we use some weights on our input dataset to convert it into our intermediate dataset. In the second neural network we take our intermediate dataset and use weights to measure the correlation with the output.

**Challenge**: The only problem here is that what do we use as our error function for the input dataset? How do we measure if the intermediate dataset it is creating is the one which will have correlation with the output? This is where we talk about _Backpropogation_. 

### Backpropogation
What is the prediction from layer1 (our input layer) to layer2 (our intermediatory dataset layer)? It's just a weighted average of the values at layer1. So if layer2 is too "high" by _x_ amount, how do we know which values at layer1 contributed to the error?

It's simple: the ones with the _higher weights_ contributed more! The ones with the lower weights contributed less. For example, let's say if there is a value with a corresponding weight of 0, how much would that value have contributed to the network's error? Zero! 

Our weights from layer1 to layer2 describe how much each layer1 value contributes to the layer2 prediction. This means those weights also exactly describe how much each layer1 neuron contributes to the layer2 error. 

So, how do we use the delta at layer2 to figure out the delta at layer1? We simple calculate the derivative of the delta w.r.t. weights for layer1. 

Backpropogation let's us say: "If you want this neuron to be X amount higher, then each of these previous 4 neurons need to be $X \times weights$ amount higher/lower because these weights were amplifying the prediction by weights times." 

### Nonlinearity
If we just keep passing on the output of forward propogation without any modification then we would have just modeled a linear function. The multiple hidden layers would become useless and we would not be able to model complex non-linear relationships between the input and output variables. For that we use non-linear activation functions like `relu`.

### Deep Neural Network

In [103]:
def relu(Z):
    return (Z > 0) * Z
 
# derivative of relu
def relu_d(O):
    return O>0
    
alpha = 0.2
hidden_size=4

np.random.seed(42)
# Generate weights
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
weights = [weights_0_1, weights_1_2]

X = np.array([[ 1, 0, 1 ],
              [ 0, 1, 1 ],
              [ 0, 0, 1 ],
              [ 1, 1, 1 ]])


Y = np.array([[ 1, 1, 0, 0]]).T

def forward_propogation(X, weights):
    layers = [X]
    for i in range(len(weights)-1):
        Z = layers[i].dot(weights[i])
        A = relu(Z)
        layers.append(A)
    output = layers[-1].dot(weights[-1])
    layers.append(output)
    return layers

def l2_error(P, Y):
    return (Y-P)**2

def delta(P, Y):
    return Y-P

def weight_delta(X, D):
    return np.multiply(X,D)
    
def error_optimization(X, weights, Y, epochs, alpha):
    for epoch in range(epochs):
        layer_2_error = 0
        for rowIDX in range(len(X)):
            layers = forward_propogation(X[rowIDX:rowIDX+1], weights)
            layer_2_error += np.sum(l2_error(layers[-1], Y[rowIDX:rowIDX+1]))
            layer_2_WD = delta(layers[-1], Y[rowIDX:rowIDX+1])
            layer_1_WD = layer_2_WD.dot(weights[-1].T) * relu_d(layers[1])
            
            # update weights
            weights[1] += alpha * layers[1].T.dot(layer_2_WD)
            weights[0] += alpha * layers[0].T.dot(layer_1_WD)
            
        if epoch% 10 == 9:
            print(f"Error: {layer_2_error}")

In [104]:
error_optimization(X, weights, Y, 200, alpha)

Error: 0.776352386929149
Error: 0.3901796861670468
Error: 0.0691973149044412
Error: 0.0038678704918441455
Error: 0.00020694429674102688
Error: 1.0848218544878106e-05
Error: 5.513021298112976e-07
Error: 2.7816924042541508e-08
Error: 1.4012857061851108e-09
Error: 7.056455534577176e-11
Error: 3.553130326953479e-12
Error: 1.7890716109835663e-13
Error: 9.00829350317375e-15
Error: 4.535831731230958e-16
Error: 2.2838693146292132e-17
Error: 1.1499672386177982e-18
Error: 5.790283061238402e-20
Error: 2.9155160053867668e-21
Error: 1.468039353041297e-22
Error: 7.391133279676253e-24


Check predictive power:

In [112]:
for i in range(len(X)):
    print(f"Prediction: {round(forward_propogation(X[i], weights)[-1][0],2)} | Ground Truth: {Y[i]}")

Prediction: 1.0 | Ground Truth: [1]
Prediction: 1.0 | Ground Truth: [1]
Prediction: 0.0 | Ground Truth: [0]
Prediction: 0.0 | Ground Truth: [0]


### Intermediate Layers Intuition
If you take a picture of a cat and isolate a random handful of pixels from the picture, can you derive any correlation of those pixels with the picture of the cat in it's entirity? Most likely, not! A bunch of random pixels in isolation can belong to anything and be put anywhere without showing any potentiality for being correlated to a picture of a cat. However, if we slowly build upon the correlation of the random pixels with maybe a group of pixels in a portion of the picture, and then those group of pixels correlate to a large group of pixels, and that may correlate to a portion of the cat's ear and the portion of the cat's ear will correlate to the whole ear and so on and so forth.

Deep Learning is all about building this correlation iteratively from small parts of the data. 