# MNIST Example
http://neuralnetworksanddeeplearning.com/chap1.html

## Simple Neural Network Implementation
Start by loading in the MNIST data

In [1]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

load_data_wrapper() returns a tuple containing (training_data, validation_data, test_data) on  format which is convenient for use in our implementation of neural networks.

In particular, training_data is a list containing 50,000 2-tuples (x, y).  x is a 784-dimensional numpy.ndarray containing the input image.  y is a 10-dimensional numpy.ndarray representing the unit vector corresponding to the correct digit for x.

validation_data and test_data are lists containing 10,000 2-tuples (x, y).  In each case, x is a 784-dimensional numpy.ndarry containing the input image, and y is the corresponding classification, i.e., the digit values (integers) corresponding to x.

This means we're using slightly different formats for the training data and the validation / test data.  These formats turn out to be the most convenient for use in our neural network code.

In [2]:
print "Number of training data tuples: ", len(training_data)
print "Data points per input value: ", len(training_data[0][0])
print "Output array size: ", len(training_data[0][1])
print "Number of validation data tuples: ",len(validation_data)
print "Number of test data tuples: ",len(test_data)

Number of training data tuples:  50000
Data points per input value:  784
Output array size:  10
Number of validation data tuples:  10000
Number of test data tuples:  10000


Set up a Network with 30 hidden neurons

In [32]:
import network
net = network.Network([784, 30, 10])

In [54]:
print "Number of neurons in the respective layers: ", net.sizes, "\n"
print "Number of biases between 1st and 2nd layers: ", len(net.biases[0])
print "Some example biases between 1st and 2nd layers: \n", net.biases[0][:3]
print "Number of biases between 2nd and 3rd layers: ", len(net.biases[0])
print "Some example biases between 2nd and 3rd layers: \n", net.biases[1][:3], "\n"
print "Number of weights between 1st and 2nd layers: ", len(net.weights[0])
print "Some example weights between 1st and 2nd layers: \n", net.weights[0][:3,:3]
print "Number of weights between 2nd and 3rd layers: ", len(net.weights[1])
print "Some example weights between 2nd and 3rd layers: \n", net.weights[1][:3,:3]

Number of neurons in the respective layers:  [784, 30, 10] 

Number of biases between 1st and 2nd layers:  30
Some example biases between 1st and 2nd layers: 
[[-0.65616326]
 [-0.66219892]
 [ 0.52779785]]
Number of biases between 2nd and 3rd layers:  30
Some example biases between 2nd and 3rd layers: 
[[ 0.11582079]
 [ 0.17365875]
 [-0.02760088]] 

Number of weights between 1st and 2nd layers:  30
Some example weights between 1st and 2nd layers: 
[[ 0.0431288  -1.39348301 -0.38850388]
 [-0.98134354 -0.57518265 -0.21679014]
 [ 0.26499918  0.8704585  -1.14835446]]
Number of weights between 2nd and 3rd layers:  10
Some example weights between 2nd and 3rd layers: 
[[-1.01367654  0.18505978 -0.60359269]
 [-1.02634188  0.63877486 -0.90054889]
 [ 1.16044398  0.22333662 -0.95617607]]


Use stochastic gradient descent to learn from the MNIST training_data over 10 epochs, with a mini-batch size of 10, and a learning rate of η=3.0

In [56]:
net.SGD(training_data, 10, 10, 3.0, test_data=test_data)

Epoch 0: 9343 / 10000
Epoch 1: 9383 / 10000
Epoch 2: 9393 / 10000
Epoch 3: 9404 / 10000
Epoch 4: 9419 / 10000
Epoch 5: 9435 / 10000
Epoch 6: 9425 / 10000
Epoch 7: 9458 / 10000
Epoch 8: 9432 / 10000
Epoch 9: 9433 / 10000


And once we've trained a network it can be run very quickly indeed, on almost any computing platform

Rerun the above experiment, changing the number of hidden neurons to 100

In [59]:
net = network.Network([784, 100, 10])
net.SGD(training_data, 10, 10, 3.0, test_data=test_data)

Epoch 0: 6178 / 10000
Epoch 1: 7596 / 10000
Epoch 2: 7687 / 10000
Epoch 3: 7696 / 10000
Epoch 4: 7709 / 10000
Epoch 5: 7745 / 10000
Epoch 6: 7722 / 10000
Epoch 7: 7791 / 10000
Epoch 8: 7825 / 10000
Epoch 9: 8600 / 10000


This should improve results, but there is quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks

Number of epochs of training, the mini-batch size, and the learning rate, η are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results.

In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if we've initialized the weights and biases in a way that makes it hard for the network to learn? Or maybe we don't have enough training data to get meaningful learning? Perhaps we haven't run for enough epochs? Or maybe it's impossible for a neural network with this architecture to learn to recognize handwritten digits? Maybe the learning rate is too low? Or, maybe, the learning rate is too high? When you're coming to a problem for the first time, you're not always sure.

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. 

## Naive implementation based on average image darkness
A naive classifier for recognizing handwritten digits from the MNIST data set.  The program classifies digits based on how dark they are --- the idea is that digits like "1" tend to be less dark than digits like "8", simply because the latter has a more complex shape.  When shown an image the classifier returns whichever digit in the training data had the closest average darkness.

In [3]:
import mnist_average_darkness
mnist_average_darkness.main()

Baseline classifier using average darkness of image.
2225 of 10000 values correct.


## SVM-based Classifier
A classifier program for recognizing handwritten digits from the MNIST data set, using an SVM classifier

In [1]:
import mnist_svm
mnist_svm.svm_baseline()

Baseline classifier using an SVM.
9435 of 10000 values correct.


If we run scikit-learn's SVM classifier using the default settings, then it gets 9,435 of 10,000 test images correct. It means that the SVM is performing roughly as well as our neural networks, just a little worse. In later chapters we'll introduce new techniques that enable us to improve our neural networks so that they perform much better than the SVM.

SVMs have a number of tunable parameters, and it's possible to search for parameters which improve this out-of-the-box performance. With some work optimizing the SVM's parameters it's possible to get the performance up above 98.5 percent accuracy.

Well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly.

The moral of both our results and those in more sophisticated papers, is that for some problems:

__sophisticated algorithm ≤ simple learning algorithm + good training data.__

## Using the cross-entropy to classify MNIST digits
When we use the quadratic cost learning is slower when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong.

In [2]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

network2.py is an improved version of network.py, implementing the stochastic gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function, regularization, and better initialization of network weights.

For this experiment: net.large_weight_initializer() command is used to initialize the weights and biases in the same way as described in Chapter 1.

In [4]:
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 10, 10, 0.5, evaluation_data=test_data, monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9113 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 9243 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 9227 / 10000

Epoch 3 training complete
Accuracy on evaluation data: 9316 / 10000

Epoch 4 training complete
Accuracy on evaluation data: 9394 / 10000

Epoch 5 training complete
Accuracy on evaluation data: 9409 / 10000

Epoch 6 training complete
Accuracy on evaluation data: 9388 / 10000

Epoch 7 training complete
Accuracy on evaluation data: 9419 / 10000

Epoch 8 training complete
Accuracy on evaluation data: 9451 / 10000

Epoch 9 training complete
Accuracy on evaluation data: 9473 / 10000



([], [9113, 9243, 9227, 9316, 9394, 9409, 9388, 9419, 9451, 9473], [], [])

Rerun the above experiment, changing the number of hidden neurons to 100

In [5]:
net = network2.Network([784, 100, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 10, 10, 0.5, evaluation_data=test_data, monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9315 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 9465 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 9533 / 10000

Epoch 3 training complete
Accuracy on evaluation data: 9583 / 10000

Epoch 4 training complete
Accuracy on evaluation data: 9567 / 10000

Epoch 5 training complete
Accuracy on evaluation data: 9570 / 10000

Epoch 6 training complete
Accuracy on evaluation data: 9605 / 10000

Epoch 7 training complete
Accuracy on evaluation data: 9623 / 10000

Epoch 8 training complete
Accuracy on evaluation data: 9649 / 10000

Epoch 9 training complete
Accuracy on evaluation data: 9645 / 10000



([], [9315, 9465, 9533, 9583, 9567, 9570, 9605, 9623, 9649, 9645], [], [])

The result with 30 neurons is pretty close to the result with quadratic cost, but with 100  hidden neurons,  the cross-entropy cost function gives a substantial improvement compared to quadratic cost.

But for the improvement to be really convincing we'd need to do a thorough job optimizing hyper-parameters such as learning rate, mini-batch size, and so on. 