In [1]:
import pandas as pd
import numpy as np

The centerpiece is a `Network` class, which we use to represent a neural network.

In [2]:
class Network(object):
    
    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y,1) for y in sizes[1:]]
        self.weights = [np.random.randn(y,x) for x, y  in zip(sizes[:-1], sizes[1:])]

In this code, the list `sizes` contains the number of neurons in the respective layers. So, for example, if we want to create a `Network` object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we'd do this with:

    net = Network([2,3,1])

The biases and weights in the `Network` object are all initialised randomly, generating Gaussian distirbutions with mean 0 and standard deviation 1 (stochastic gradient descent has now a place to start from - not optimal, but will do for now).

We assume that the biases and weights are stored as lists of numpy matrices, and that the first layer of neurons is the input layer (omitting any biases for those neurons).

In [3]:
def sigmoid(z):
    return 1.0/(1.0 +np.exp(-z))

Note that when the input `z` is a vector, numpy automatically applies the function `sigmoid` element wise (in vectorised form).

We then add a `feedforward` method to the `Network` class, which, given an input `a` for the network, returns the correspondent output (assumed to be an (n,1) numpy array, where n is the number of inputs in the network.

In [4]:
def feedforward(self, a):
    """Return the output of the network if a is input."""
    for b, w in zip(self.biases, self.weights):
        a = sigmoid(np.dot(w, a) + b)

Network.feedforward = feedforward

    a' = sigmoid(W * a + b)

We are now in a position to create a new method that implements the stochastic gradient descent

In [5]:
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
    """
    Train the neural network using mini-batch stochastic gradient descent.
    
    The training_data is a list of tuples (x,y) representing the training inputs
    and the desired outputs.
    If test data is provided then the network will be evaluated against the test data
    after each epoch, and partial progress will be printed out.

    """
    if test_data:
        n_test = len(test_data)
    
    n = len(training_data)
    for j in range(epochs):
        ramdom.shuffle(training_data)
        mini_batches - [
            training_dta[k:k+mini_batch_size]
            for k in range(0, n, mini_batch_size)
        ]
        for mini_batch in mini_batches:
            self.update_mini_batch(mini_batch, eta)
        if test_data:
            print("Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test))
        else:
            print("Epoch {0} complete".format(j))

Network.SGD = SGD
            

The code works as follows: in each epoch, it starts by randomly shuffling the training data, and then partitions it into mini batches of the appropriate size, which is an easy way to sampling randomly from the training data. Then for each `mini_batch` we apply a single step of gradient descent.

This is done by means of `self.update_mini_batch`, which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in `mini_batch`.

In [6]:
def update_mini_batch(self, mini_batch, eta):
    """
    Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini_batch.
    eta is the learning_rate.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x,y)
        nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        
    self.weights = [w - (eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b - (eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

Network.update_mini_batch = update_mini_batch        

The function `backprop` is doing pretty much all the heavy lifting here.

In [7]:
def backprop(self, x, y):
    """
    Return a tuple (nabla_b, nabla_w) representing the gradient for the cost function C_x.
    
    nabla_b and nabla_w are layer-by -layer lists of numpy arrays, similar to self.biases and self.weights.
    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    
    #Feedforward
    activation = x
    activations = [x] # activations will be added here layer by layer
    
    zs = []
    
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation) + b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)
        
    # Backward Pass
    delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
    
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    
    # Note that the variable l below is used backwards: l1 is the last layer, l2 is the second to last, etc.
    for l in range(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weigts[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        
    return (nabla_b, nabla_w)

Network.backprop = backprop

The method makes use of a few extra functions to help in computing the gradient, namely `sigmoid_prime`, which computes the derivative of the sigmoid function, and `cost_derivative`.

In [8]:
def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

Network.sigmoid_prime = sigmoid_prime

def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

Network.cost_derivative = cost_derivative

In [9]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [24]:
training_data = mnist.train
mnist.train.next_batch(100)[0].shape

(100, 784)

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We'll discuss all these at length through the book, including how I chose the hyper-parameters above.