## MNIST classification

In this notebook we tackle the perhaps most well known problem in all of machine learning, classifying hand-written digits.

The particular dataset we will use is the MNIST (Modified National Institute of Standards and Technology)
The digits are 28x28 pixel images that look somewhat like this:

![](https://user-images.githubusercontent.com/2202312/32365318-b0ccc44a-c079-11e7-8fb1-6b1566c0bdc4.png)

Each digit has been hand classified, e.g. for the above 9-7-0-9-0-...

Our task is to teach a machine to perform this classification, i.e. we want to find a function $\mathcal{T}_\theta$ such that

| | |
|-|-|
|$\mathcal{T}_\theta$(|<img align="center" src="https://user-images.githubusercontent.com/2202312/33177374-b134e572-d062-11e7-87c7-0574c6f5bee9.png" width="28"/>|) = 4|

# Import dependencies

This should run without errors if all dependencies are installed properly.

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

In [2]:
# Start a tensorflow session
session = tf.InteractiveSession()

# Set the random seed to enable reproducible code
np.random.seed(0)

# Get data and utilities

We now need to get the data we will use, which in this case is the famous [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, a set of digits 70000 hand-written digits, of which 60000 are used for training and 10000 for testing.

In addition to this, we create a utility `evaluate(...)` that we will use to evaluate how good the classification is.

In [3]:
# Get MNIST data
mnist = input_data.read_data_sets('MNIST_data')

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [4]:
# Read the 10000 mnist test points
batch = mnist.test.next_batch(10000)
test_images = batch[0].reshape([-1, 28, 28, 1])
test_labels = batch[1]

def evaluate(result_tensor, data_placeholder):
    """Evaluate a reconstruction method.

    Parameters
    ----------
    result_tensor : `tf.Tensor`, shape (None,)
        The tensorflow tensor containing the result of the classification.
    data_placeholder : `tf.Tensor`, shape (None, 28, 28, 1)
        The tensorflow tensor containing the input to the classification operator.

    Returns
    -------
    MSE : float
        Mean squared error of the reconstruction.
    """
    result = result_tensor.eval(
        feed_dict={data_placeholder: test_images})

    return np.mean(result == test_labels)

In [5]:
# Create placeholders. Placeholders are needed in tensorflow since tensorflow is a lazy language,
# and hence we first define the computational graph with placeholders as input, and later we evaluate it.
with tf.name_scope('placeholders'):
    images = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
    true_labels = tf.placeholder(tf.int32, shape=[None])

# Logistic regression

We start with [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), perhaps the most well known and widely applied classification method.

The first problem we need to solve is that the values we try to regress against are discrete (e.g. [0, 1, 2, ..., 9]) which does not work very well with continuous optimization. To solve this we convert the values to a one-hot encoding, embedding the values into $\mathbb{R}^{10}$:

```
>>> one_hot([0, 1, 2], depth=3)
[[ 1.,  0.,  0.],
 [ 0.,  1.,  0.],
 [ 0.,  0.,  1.]]
```

this can also be seen as a probabilistic encoding, i.e. we can estimate that a number is 10% 1 and 90% 2. For our training data, we have 100% certanity for each digit. We use the cross entropy to measure the distance between two such probability distributions.

The estimator used for logistic regression is

$$
p_i = \frac{\langle w_i, x \rangle + b_i}{\sum_{j=0}^9 (\langle w_j, x \rangle + b_j)}
$$

Where $p_i$ is the probability of a digigt belonging to a cathegory $i$, $w_i \in \mathbb{R}^{28 \times 28}$ and $b_i \in \mathbb{R}$

In [6]:
with tf.name_scope('logistic_regression'):
    x = tf.contrib.layers.flatten(images)
    logits = tf.contrib.layers.fully_connected(x, 10,
                                               activation_fn=None)
    pred = tf.argmax(logits, axis=1)
    
with tf.name_scope('optimizer'):
    one_hot_labels = tf.one_hot(true_labels, depth=10)
    
    loss = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot_labels,
                                                   logits=logits)
    optimizer = tf.train.AdamOptimizer().minimize(loss)

# Initialize all TF variables
session.run(tf.global_variables_initializer())

for i in range(10000):
    batch = mnist.train.next_batch(128)
    train_images = batch[0].reshape([-1, 28, 28, 1])
    train_labels = batch[1]

    session.run(optimizer, feed_dict={images: train_images, 
                                      true_labels: train_labels})

    if i % 100 == 0:
        print('{} Average correct: {}'.format(
                i, evaluate(pred, images)))

0 Average correct: 0.0971
100 Average correct: 0.8402
200 Average correct: 0.8738
300 Average correct: 0.8887
400 Average correct: 0.8983
500 Average correct: 0.9012
600 Average correct: 0.9061
700 Average correct: 0.9086
800 Average correct: 0.911
900 Average correct: 0.913
1000 Average correct: 0.9123
1100 Average correct: 0.9135
1200 Average correct: 0.9145
1300 Average correct: 0.9161
1400 Average correct: 0.9166
1500 Average correct: 0.9177
1600 Average correct: 0.9185
1700 Average correct: 0.9212
1800 Average correct: 0.9198
1900 Average correct: 0.9212
2000 Average correct: 0.9217
2100 Average correct: 0.921
2200 Average correct: 0.9211
2300 Average correct: 0.9221
2400 Average correct: 0.921
2500 Average correct: 0.9227
2600 Average correct: 0.9228
2700 Average correct: 0.9221
2800 Average correct: 0.9229
2900 Average correct: 0.9228
3000 Average correct: 0.9235
3100 Average correct: 0.9239
3200 Average correct: 0.9232
3300 Average correct: 0.9247
3400 Average correct: 0.9238
3

# Multilayer Perceptron

The first "deep" neural networks were [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron), in these we have a function of the following form

$$
\rho(W_3\rho(W_2\rho(W_1 x + b_1) + b_2) + b_3)
$$

Where $W_i$ are matrices and $b_i$ vectors. Note that the logistic regression can be cast into this form (how?).

In [7]:
with tf.name_scope('logistic_regression'):
    x = tf.contrib.layers.flatten(images)
    x = tf.contrib.layers.fully_connected(x, 128)  # the default activation function is ReLU
    x = tf.contrib.layers.fully_connected(x, 32)
    logits = tf.contrib.layers.fully_connected(x, 10,
                                               activation_fn=None)
    pred = tf.argmax(logits, axis=1)
    
with tf.name_scope('optimizer'):
    one_hot_labels = tf.one_hot(true_labels, depth=10)
    
    loss = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot_labels,
                                                   logits=logits)
    optimizer = tf.train.AdamOptimizer().minimize(loss)

# Initialize all TF variables
session.run(tf.global_variables_initializer())

for i in range(10000):
    batch = mnist.train.next_batch(128)
    train_images = batch[0].reshape([-1, 28, 28, 1])
    train_labels = batch[1]

    session.run(optimizer, feed_dict={images: train_images, 
                                      true_labels: train_labels})

    if i % 100 == 0:
        print('{} Average correct: {}'.format(
                i, evaluate(pred, images)))

0 Average correct: 0.1431
100 Average correct: 0.9093
200 Average correct: 0.9247
300 Average correct: 0.9429
400 Average correct: 0.9459
500 Average correct: 0.9513
600 Average correct: 0.9529
700 Average correct: 0.9576
800 Average correct: 0.9587
900 Average correct: 0.9626
1000 Average correct: 0.9629
1100 Average correct: 0.9595
1200 Average correct: 0.9656
1300 Average correct: 0.9645
1400 Average correct: 0.9669
1500 Average correct: 0.9682
1600 Average correct: 0.9697
1700 Average correct: 0.9695
1800 Average correct: 0.9689
1900 Average correct: 0.9738
2000 Average correct: 0.9713
2100 Average correct: 0.9719
2200 Average correct: 0.9728
2300 Average correct: 0.9736
2400 Average correct: 0.9745
2500 Average correct: 0.9751
2600 Average correct: 0.9745
2700 Average correct: 0.9737
2800 Average correct: 0.9739
2900 Average correct: 0.9708
3000 Average correct: 0.9761
3100 Average correct: 0.9762
3200 Average correct: 0.975
3300 Average correct: 0.9766
3400 Average correct: 0.977

# Convolutional network

Convolutional neural networks are a corner-stone of the deep learning revolution. Here instead of using traditionall fully-connected layers which connect each point with all other points, we use spatial convolutions instead. By doing this, we get a translation invariant operator that acts locally. In order to get non-local behaviour we stack several of these on top of each other.

The following code is a very simplified convolutional neural network for digit classification:

In [8]:
with tf.name_scope('convolutional_network'):
    x = tf.contrib.layers.conv2d(images, num_outputs=32, kernel_size=3, stride=2)
    x = tf.contrib.layers.conv2d(x, num_outputs=32, kernel_size=3, stride=2)
    x = tf.contrib.layers.flatten(x)
    
    x = tf.contrib.layers.fully_connected(x, 128)
    logits = tf.contrib.layers.fully_connected(x, 10,
                                               activation_fn=None)
    pred = tf.argmax(logits, axis=1)
    
with tf.name_scope('optimizer'):
    one_hot_labels = tf.one_hot(true_labels, depth=10)
    
    loss = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot_labels,
                                                   logits=logits)
    optimizer = tf.train.AdamOptimizer().minimize(loss)

# Initialize all TF variables
session.run(tf.global_variables_initializer())

for i in range(10000):
    batch = mnist.train.next_batch(128)
    train_images = batch[0].reshape([-1, 28, 28, 1])
    train_labels = batch[1]

    session.run(optimizer, feed_dict={images: train_images, 
                                      true_labels: train_labels})

    if i % 100 == 0:
        print('{} Average correct: {}'.format(
                i, evaluate(pred, images)))

0 Average correct: 0.1646
100 Average correct: 0.914
200 Average correct: 0.945
300 Average correct: 0.9602
400 Average correct: 0.9652
500 Average correct: 0.9708
600 Average correct: 0.9772
700 Average correct: 0.9763
800 Average correct: 0.9801
900 Average correct: 0.9819
1000 Average correct: 0.9837
1100 Average correct: 0.983
1200 Average correct: 0.983
1300 Average correct: 0.9846
1400 Average correct: 0.9857
1500 Average correct: 0.9859
1600 Average correct: 0.9841
1700 Average correct: 0.9848
1800 Average correct: 0.9839
1900 Average correct: 0.9838
2000 Average correct: 0.9855
2100 Average correct: 0.984
2200 Average correct: 0.9859
2300 Average correct: 0.9852
2400 Average correct: 0.9847
2500 Average correct: 0.9848
2600 Average correct: 0.9862
2700 Average correct: 0.9841
2800 Average correct: 0.987
2900 Average correct: 0.9858
3000 Average correct: 0.9861
3100 Average correct: 0.9878
3200 Average correct: 0.986
3300 Average correct: 0.9877
3400 Average correct: 0.9862
3500