### The Code

In [None]:
import random
import numpy as np

class Network():
    def __init__(self,sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y,1) for y in sizes[1:]]
        self.weights = [np.random.randn(y,x) for x,y in zip(sizes[:-1],sizes[1:])]
        
    def feedforward(self,a):
        for b,w in zip(self.biases,self.weights):
            a = sigmoid(np.dot(w,a)+b)
        return a
    
    def SGD(self, training_data, epochs, mini_batch_size,eta, test_data = None):
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size] for
                            k in range(0,n,mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch,eta)
            if test_data:
                print("Epoch {0}:{1}/{2}".format(j, self.evaluate(test_data),n))
            else:
                print("Epoch {0} complete".format(j))
                
    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeroes(b.shape) for b in self.biases]
        nabla_w = [np.zeroes(w.shape) for w in self.weights]
        for x,y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backdrop(x,y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip (self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip (self.biases, nabla_b)]
        
    def backprop(self,x,y):
        nabla_b = [np.zeroes(b.shape) for b in self.biases]
        nabla_w = [np.zeroes(w.shape) for w in self.weights]
        activation = x
        activations = [x]
        zs = []
        for b,w in zip(self.biases, self.weights):
            z = np.dot(w,activation) +b
            zs.append(z)
            activation =  sigmoid(z)
            activations.append(activation)
            delta = self.cost_derivative(activation[-1],y)*sigmoid_prime(zs[-1])
            nabla_b[-1] = delta
            nabla_w[-1] = np.dot(delta, activations[-2].transpose())
            for l in range(2, self.num_layers):
                z = zs[-1]
                sp = sigmoid_prime(z)
                delta = np.dot(self.weights[-l+1].transpose(),delta) * sp
                nabla_b[-1] = delta
                nabla_w[-1] = np.dot(delta,activations[-l-1].transpose())
            return (nabla_b, nabla_w)
        
        
        def evaluate(self, test_data):
            test_results = [(np.argmax(self.feedforward(x)),y) for (x,y) in test_data]
            return sum(int(x==y) for (x,y) in test_results)
        
        def cost_derivative(self, output_activations, y):
            return (output_activations-y)
        

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))


### Chapter 2

$w^l_{jk} \Rightarrow$ weight from $k_{th}$ neuron in $l-1$ layer to $j_{th}$ neuron in $l$ layer

![](images/weight.png)

$b_j^l \Rightarrow$ bias for $j_{th}$ neuron in $l_{th}$ layer

![](images/bias.png)

With these notations, the activation $a^l_j$ of the $j^{th}$ neuron in the $l^{th}$
layer is related to the activations in the $(l − 1)^{th}$ layer by the equation -

$$\color{blue}{a_j^l = \phi({\sum}_k{w_{jk}^l}a_k^{l-1}+b_j^l)}$$

where the sum is over all neurons $k$ in the $(l − 1)^{th}$ layer.

To rewrite this expression in a matrix form we define a weight matrix $w^l$ for
each layer, $l$. The entries of the weight matrix $w^l$ are just the weights
connecting to the $l^{th}$ layer of neurons, that is, the entry in the $j^{th}$ row
and $k^{th}$ column is $w_{jk}^l$. Similarly, for each layer $l$ we define a bias vector, $b^l$. You can probably guess how this works - the components of the bias vector are just the values $b_j^l$, one component for each neuron in the $l^{th}$ layer. And finally, we define an activation vector $a^l$ whose components are the activations $a^l_j$


$$\color{blue}{a^l = \phi({w^la^{l-1}+b^l})}$$

The quadratic cost has the form

$$\color{blue}{C = \frac{1}{2n}\sum_x{||y(x)-a^L(x)||}^2}$$

where n is the number of total samples.

##### 2 Assumptions about Cost and Backpropagation

The **first assumption** we need is that the cost function can be written as an average $C =
\frac{1}{n} Σ_xC_x$ over cost functions $C_x$ for individual training
examples, $x$.

The reason we need this assumption is because what
backpropagation actually lets us do is compute the partial
derivatives $∂C_x/ ∂w$ and $∂C_x / ∂b$ for a single training example. We
then recover $∂C / ∂w$ and $∂C / ∂b$ by averaging over training examples.

The **second assumption** we make about the cost is that it can be
written as a function of the outputs from the neural network.

Next, we define the error $\delta_j^l$ such that - 

$$\color{blue}{\delta_j^l = \frac{\partial{C}}{\partial{z_j^l}}}$$

where $z^l_j$ is net input for respective neuron and layer.

As per our usual conventions, we use $δ^l$ to denote the vector of
errors associated with layer $l$. Backpropagation will give us a way of
computing $δ^l$ for every layer, and then relating those errors to the
quantities of real interest, $∂C / ∂w^l_{jk}$
 and $∂C / ∂b^l_j$.

 - **An equation for the error in output layer, $\delta^L$:** The components of $\delta^L$ are given by -
 
$$\color{blue}{\delta_j^L = \frac{\partial{C}}{\partial{a^L_j}}\sigma^{'}(z^L_j)}$$ 

Or in matrix-based form - 

$$\color{blue}{\delta^L = \nabla_aC \odot \sigma^{'}(z^L)}$$

In above eq, $\odot$ indicates **Hadamard product**. Here, $\nabla aC$ is defined to be a vector whose components are the partial derivatives $∂C / ∂a^L_j$. For the cost function described earlier, we have $\nabla aC = (a^L-y)$, so -

$$\color{blue}{\delta^L = (a^L-y) \odot \sigma^{'}(z^L)}$$

 - **An equation for the error $δ^l$ in terms of the error in the next layer, $δ^(l + 1)$:** In particular

$$\color{blue}{\delta^L = ((w^{l+1})^T\delta^{l+1}) \odot \sigma^{'}(z^l)}$$

 - **An equation for the rate of change of the cost with respect to any bias in the network:** In particular:
 
$$\color{blue}{\frac{\partial C}{\partial b^l_j} = \delta^l_j}$$ 

The above equation can be simplified as $\frac{\partial C}{\partial b} = \delta$

 - ** An equation for the rate of change of the cost with respect to any weight in the network:** In particular:

$$\color{blue}{\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1}\delta^l_j}$$

The equation can be rewritten in a less index-heavy notation as

$$\frac{\partial C}{\partial w} = a_{in}\delta_{out}$$
 