# Neural Network and Deep Code Steps

This notebook is based on Michael Nielsen's book [Neural Networks an Deep Learning](http://neuralnetworksanddeeplearning.com/index.html). It'll give step by step code processing with some comments.

# Chapter 1

## Initialization of network

Import the basic python libraries.

In [1]:
#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

The list **sizes** contains the number of neurons in the respective layers of the network.  For example, if the list was ```[2, 3, 1]``` then it would be a three-layer network, with the first layer containing 2 neurons, the second layer 3 neurons, and the third layer 1 neuron.

In [1]:
sizes = [2, 3, 3, 1]

The biases and weights for the network are initialized randomly, using a Gaussian distribution with mean 0, and variance 1.

Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

### 'numpy.random.randn' usage

#### note
---
For random samples from $N(\mu, \sigma^2)$, use:
~~~~
sigma * np.random.randn(...) + mu
~~~~


#### examples
---
~~~~
>>> np.random.randn()
2.1923875335537315 #random
~~~~


Two-by-four array of samples from N(3, 6.25):
~~~~
>>> 2.5 * np.random.randn(2, 4) + 3
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],  #random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]]) #random
~~~~

In [8]:
bias = [np.random.randn(y, 1) for y in sizes[1:]]    # generate bias paramters based on number of neurons

In [9]:
bias

[array([[ 0.48724328],
        [ 0.93406278],
        [ 0.42375429]]), array([[ 1.27325374],
        [ 0.06224019],
        [-0.74347155]]), array([[ 0.17992025]])]

__zip()__ in conjunction with the \* operator can be used to unzip a list:
~~~~
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)
>>> x == list(x2) and y == list(y2)
True
~~~~

As the weights parameters are determined by *previousNumNeurons-nextNumNeurons* pair, thus use the indices with ```sizes[ : -1]``` and ```sizes[1 : ]```.

It's really tricky skill.

In [10]:
weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

In [11]:
weights

[array([[-1.09087206, -1.02480915],
        [ 0.73758621, -0.13678059],
        [ 1.63535255,  0.7132227 ]]),
 array([[-0.12201119,  0.25199167,  1.48160204],
        [ 0.08937866, -0.19738687,  1.16034127],
        [ 0.34190555, -0.31483018,  1.76061759]]),
 array([[-0.19020864, -1.23776123, -0.77610618]])]

## Summarize all the initialization

In [17]:
class Network(object):
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1: ]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[ :-1], sizes[1: ])]

So, for example, if we want to create a *Network* object with 
* 2 neurons in the first layer, 
* 3 neurons in the second layer, 
* 1 neuron in the final layer.
we'd do this with the code:

In [18]:
net = Network([2, 3, 1])

In [26]:
net.weights[0]    # Numpy matrix storing weights connecting 1st and 2nd layers of neurons.

array([[ 1.59107265, -0.3271491 ],
       [ 0.50645772, -1.1246387 ],
       [ 1.44765442,  0.30688227]])

In [27]:
net.weights[1]    # Numpy matrix storing weights connecting 2nd and 3rd layers of neurons.

array([[-0.14067259, -1.20439729, -0.22918305]])

Separate function sigmoid definition.

In [30]:
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

Add *feedforward* method to the *Network* class, which, given $a$ for the network, returns the corresponding output.

In [34]:
    def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            '''take out the corresponding b and w for computation.'''
            a = sigmoid(np.dot(w, a) + b)
        return a


Learning process: SGD(stochastic gradient descent) method, which is used to find out much better $w_k$, $b_l$. The idea here is not using the whole *training data set* to compute the gradient. Instead, only use one sample, i.e. the so-called **mini_batch** to compute the gradient as estimate for the whole *training data set*.

* ```training_data```, the list of tuples ```(x, y)```.
* ```epochs```, number of epochs to train for.
* ```mini_batch_size```, size of the mini-batches to use when sampling. 
* ```eta```, the learning rate $\eta$.
* ```test_data```, used for tracking process.

In [35]:
    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """
        Train the neural network using mini-batch stochastic
        gradient descent.  
        
        The "training_data" is a list of tuples "(x, y)" representing 
        the training inputs and the desired outputs.  
        
        The other non-optional parameters are self-explanatory.  
        
        If "test_data" is provided then the network will be evaluated 
        against the test data after each epoch, and partial progress printed out.  
        This is useful for tracking progress, but slows things down substantially.
        
        """
        
        if test_data: 
            n_test = len(test_data)
            
        n = len(training_data)
        
        # use xrange() instead of range() only involving
        # very large range on memory-starved machine, or
        # when all range's elements are never used.
        for j in xrange(epochs):
            # Shuffle the input data set, here is the training data set.
            random.shuffle(training_data)
            
            # Divide the training data set into groups, which here is called
            # mini_batches. Every group 'mini_batch' with the 'mini_batch_size'
            # is one of the element of group sets 'mini_batches'.
            mini_batches = [
                training_data[k : k + mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)


Summarize steps above, for each epoch:
  1. randomly shuffling training data set.
  2. partition it into mini-batches with appropriate size.
  3. for each mini-batch:
     apply a single step of gradient descent by ```update_mini_batch```.
     Here, the update step will use **back propogation** to update weights and biases.

In [5]:
    def update_mini_batch(self, mini_batch, eta):
        """
        Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate.
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
            
        self.weights = [w - (eta / len(mini_batch)) * nw 
                        for w, nw in zip(self.weights, nabla_w)]
        
        self.biases = [b - (eta / len(mini_batch)) * nb 
                       for b, nb in zip(self.biases, nabla_b)]
