[View in Colaboratory](https://colab.research.google.com/github/jonaskratochvil/hello-world/blob/master/Moje_konecne_funkcni_NN.ipynb)

**Understanding Stochastic gradient descent, backpropagation and their implementation**

**Gradient descent**

The key idea in training a neural network is to minimize certain loss function, which has its imput variables
weights and biases. As the optimization problem is non-convex we use the technique of gradient descent, more concretelly its stochastic form. 





Good analogy is ball rolling down the vally, every step moving in the direction of steepest descent. The direction of steepest descent is represented by negative gradient of this cost function in a given point. We therefore need to calculate gradients of cost function with respect to all weights and biases. This is where the backpropagation comes into play.
The stochastic aspect of gradient descent comes from the idea that rather than going through all examples and updating weights and biases after that, we update it more frequently after a small batch of data points run through network. Good analogy is that we're a drunk man going down the hill rather than carefully calculating man who takes ages to converge.

**Backpropagation**
The pseudocode for backpropagation is as follows:





1.   Input set of training examples with true labels
2.   for each training example: 

3.   feedforward for each l = 2,3,...,L : a(l) = activation_function(w_(l) dot a_(l-1) + b_(l))
4.   output error: dL = dC/da * da_(L)
5.   backpropagate the error: dl = (w_(l+1).T dot d(l+1) * da_(l))
6.   update weights and biases w_in = w_in - learning_rate * (d_in dot a_(in-1).T) 







Imagine backpropagation as going through network backwards and every step you need to open certain door with specific key. First after computing cost we need to go back to output neuron - derivative of a cost function w.r.t. output gets us in front of the neuron and to get inside we need a derivative of the activation and so on.

We load the libraries that we will need.

In [0]:
import numpy as np

import random 

from scipy.stats import truncnorm

In [0]:
def sigmoid(x):
  return 1/(1+np.exp(-x))

Let now comment on specific parts of object NeuralNetwork

1. **init:** here we specify all important parts of the network (number of unput nodes, number of hidden nodes, number of output nodes, learning rate and we initialize weights and biases). Note the index of weights: if for example we have 3 input nodes and 4 output nodes -> our first weights, weights_in will have dimension 4 x 3.

2. **train:** This is the feedforward and backpropagation part of the network. Note that we first transpose the input data so that the dimensions correspond with weights when we make the dot product. After that we proceed exactly as written in pseudo code. Note also that for this implementation we use the **Cross-entropy cost function.**  The reason for it is that if we use the L2 cost function we will have to deal with derivative of sigmoid when calculating d1, which slows the learning significantly. By introducing CE we get rig of it as the derivative of CE was specifically designed to cancel out the sigmoid prime. We save the gradients of all weights and biases

3. **SGD:** We first inicialize matrices of zeros with apropriate dimensionality for all weights in/out and biases in/out. We also specify number of epochs (number of times we want to iterate whole dataset) and for each epoch we randomly shuffel the data by first writing the indexes of our dataset to an array -> randomly shuffling them and than rearanging X and y according to this shuffel. Each minibatch we accumulate the weight and bias changes and than update the weights and biases by the average of these changes time learning rate. After we run each minibatch we inicialize the matrices back to zeros and repeat over. We also track the error printing it along the way.

4. **run:** Simply runs a test example to check that our neural network is giving us the desired results.

In [0]:
class NeuralNetwork:
  
  def __init__(self, n_in_nodes, n_hidden_nodes, n_out_nodes, learning_rate):
    
    self.n_in_nodes = n_in_nodes
    self.n_hidden_nodes = n_hidden_nodes
    self.n_out_nodes = n_out_nodes
    self.learning_rate = learning_rate
    self.weights_in = np.random.randn(n_hidden_nodes, n_in_nodes)
    self.weights_out = np.random.randn(n_out_nodes, n_hidden_nodes)
    self.bias_in = np.random.randn(n_hidden_nodes,1)
    self.bias_out = np.random.randn(n_out_nodes,1)
  
  def train(self, input_data, input_labels):
    
    # Transponovani dat z 4x3 na 3x4 -> weights jsou 4x3, 1x4
    
    input_data = np.array(input_data, ndmin = 2).T
    input_labels = np.array(input_labels, ndmin = 2).T
    
    # Forward propagate 4x3 x 3x4 -> 4x4 
    
    l0 = sigmoid(np.dot(self.weights_in, input_data)+self.bias_in)
    
    # 1x4 x 4x4 -> 1x4
    
    l1 = sigmoid(np.dot(self.weights_out, l0)+self.bias_out)
    
    # self.output_error_L2 = 1/2*(input_data - l1) **2
    # d1 = l1 - input_data * (l1)(1-l1) -> tady je ten term derivace sigmoidu ktery spomaluje learning
    
    self.output_error_CE = -(input_labels * np.log(l1) + (1-input_labels) * np.log(1-l1))
    
    # dal jsem derivaci square loss kdyz napr 0.8 - 0 -> zaporny learning rate da to znamenko tak jak chci
    
    # kdyz 0.1 - 1 tak zase ten learning rate to hodi tam kam chci
    
    # Backpropagate -> deltas in first and zero layer from Nielsen formula
    
    # u d1 nyni derivace CE funkce kterou se vymaze ta derivace sigmoidu -> opravdu je videt rozdil!!!
    
    d1 = -input_labels*(1-l1) + (1-input_labels)*l1
    d0 = np.dot(self.weights_out.T,d1)*(l0*(1-l0))
    
    # Weight update change from Nielsen formula to np.dot(delta1, output0.T)
    
    self.dw_out = np.dot(d1,l0.T)
    self.dw_in = np.dot(d0,input_data.T)
    
    self.db_out = d1
    self.db_in = d0
    
  def SGD(self, X, y, epochs, batch_size):
    
    batch_w_in_update = np.zeros((self.n_hidden_nodes, self.n_in_nodes), dtype=int)
    batch_w_out_update = np.zeros((self.n_out_nodes, self.n_hidden_nodes), dtype=int)
    batch_b_in_update = np.zeros((self.n_hidden_nodes,1), dtype=int)
    batch_b_out_update = np.zeros((self.n_out_nodes,1), dtype=int)
    
    for epoch in range(epochs):
      
      # shuffle randomly data before start of each epoch
      
      data_shuffle = [i for i in range(X.shape[0])]
      data_shuffle = random.sample(data_shuffle, len(data_shuffle))

      X = np.array([X[i] for i in data_shuffle])
      y = np.array([y[i] for i in data_shuffle])
      
      for minibatch in range(0,X.shape[0],batch_size):
        
        for example in range(minibatch,minibatch+batch_size):
          
          # Append weight and biases gradients
          
          self.train(X[example],y[example])
          batch_w_in_update = batch_w_in_update + self.dw_in
          batch_w_out_update = batch_w_out_update + self.dw_out
          batch_b_in_update = batch_b_in_update + self.db_in
          batch_b_out_update = batch_b_out_update + self.db_out
        
        # Update current weights to new ones with minibatch update
        
        self.weights_in = self.weights_in - self.learning_rate/batch_size * batch_w_in_update 
        self.weights_out = self.weights_out - self.learning_rate/batch_size * batch_w_out_update
        
        self.bias_in = self.bias_in - self.learning_rate/batch_size * batch_b_in_update 
        self.bias_out = self.bias_out - self.learning_rate/batch_size * batch_b_out_update
        
        batch_w_in_update = np.zeros((self.n_hidden_nodes, self.n_in_nodes), dtype=int)
        batch_w_out_update = np.zeros((self.n_out_nodes, self.n_hidden_nodes), dtype=int)
        batch_b_in_update = np.zeros((self.n_hidden_nodes,1), dtype=int)
        batch_b_out_update = np.zeros((self.n_out_nodes,1), dtype=int)
        
      if (epoch % 10) == 0:

        print ("epoch:" + str(epoch) + " error: " + str(np.mean(np.abs(self.output_error_CE))))
    
  def run(self, test_data):
     
    test_data = np.array(test_data, ndmin = 2).T
    l0 = sigmoid(np.dot(self.weights_in, test_data) + self.bias_in)
    l1 = sigmoid(np.dot(self.weights_out, l0) + self.bias_out)

    return round(l1,4)
    
    
    

In [0]:
net = NeuralNetwork(3, 4, 1, 0.001)

In [0]:
X = np.array([[0,0,1],
             [0,1,1],
             [1,0,1],
             [1,1,1],])

y = np.array([[0,0,1,1]]).T


In [369]:
net.SGD(X,y,200000,2)

epoch:0 error: 0.22361667150137002
epoch:20000 error: 0.2457238359188545
epoch:40000 error: 0.10892929600986187
epoch:60000 error: 0.05522566845861739
epoch:80000 error: 0.022494715098972076
epoch:100000 error: 0.025256216365263727
epoch:120000 error: 0.01980409301821385
epoch:140000 error: 0.009137114922408512
epoch:160000 error: 0.014331275752464531
epoch:180000 error: 0.011440468759849392


In [0]:
print(net.weights_out)

print(net.weights_in)

In [371]:
point = np.array([[1,0,1]])

net.run(point)

array([[0.99419361]])

In [0]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST.data/", one_hot = True)

In [0]:
print(mnist.train.labels.shape)

In [0]:
net_mnist = NeuralNetwork(784, 10 , 10, 0.1)

In [391]:
print(mnist.test.labels[5000])

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]


In [392]:
net_mnist.run(mnist.test.images[5000])

array([[8.82050096e-07],
       [2.60535144e-03],
       [8.40765692e-03],
       [9.28880392e-01],
       [1.58662237e-04],
       [5.49915933e-02],
       [1.24461404e-06],
       [2.41482192e-05],
       [7.40199882e-04],
       [6.44000153e-07]])

In [381]:
net_mnist.SGD(mnist.train.images,mnist.train.labels,100,10)

epoch:0 error: 0.039886536431082344
epoch:10 error: 0.15911996688977187
epoch:20 error: 0.0014496427664972829
epoch:30 error: 0.002949616116542424
epoch:40 error: 0.04341890145081058
epoch:50 error: 0.4786440456064943
epoch:60 error: 0.006113164638449813
epoch:70 error: 0.004400160056795515
epoch:80 error: 0.0027326767033939254
epoch:90 error: 0.004302507788731307
