# [Deep MNIST for Experts](https://www.tensorflow.org/get_started/mnist/pros)

## Setup

### Load MNIST Data

In [4]:
import tensorflow as tf
import numpy as np

In [13]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


### Start TensorFlow InteractiveSession

In [5]:
session = tf.InteractiveSession()

### Placeholders and variables

In [14]:
X = tf.placeholder(tf.float32, shape=[None, 784])
y = tf.placeholder(tf.float32, shape=[None, 10])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

## Build a Multilayer Convolutional Network

### Weight Initialization

To create this model, we're going to need to create a lot of weights and biases. One should generally initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients. Since we're using [ReLU]( https://en.wikipedia.org/wiki/Rectifier_(neural_networks) neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons". Instead of doing this repeatedly while we build the model, let's create two handy functions to do it for us.

In [9]:
def createWeights(shape):
    randoms = tf.truncated_normal(shape=shape, mean=0.5, stddev=0.1)
    return tf.Variable(randoms)

def createBiases(shape):
    n = tf.constant(value=0.1, shape=shape)
    return tf.Variable(n)

### Convolution and Pooling
TensorFlow also gives us a lot of flexibility in convolution and pooling operations. How do we handle the boundaries? What is our stride size? In this example, we're always going to choose the vanilla version. Our convolutions uses a stride of one and are zero padded so that the output is the same size as the input. Our pooling is plain old max pooling over 2x2 blocks. To keep our code cleaner, let's also abstract those operations into functions.

In [17]:
def conv2d(X, W):
    return tf.nn.conv2d(input=X, filter=W, strides=[1, 1, 1, 1], padding='SAME')

def maxPool2x2(X):
    return tf.nn.max_pool(value=X, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

### First Convolutional Layer

We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolution will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of ``[5, 5, 1, 32]``. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

In [18]:
W_conv1 = createWeights([5, 5, 1, 32])
b_conv1 = createBiases([32])

To apply the layer, we first reshape ```X``` to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.

In [19]:
X_image = tf.reshape(X, [-1, 28, 28, 1])

We then convolve ``x_image`` with the weight tensor, add the bias, apply the ReLU function, and finally max pool. The ``max_pool_2x2`` method will reduce the image size to 14x14.

In [20]:
h_conv1 = tf.nn.relu(conv2d(X_image, W_conv1) + b_conv1)
h_pool1 = maxPool2x2(h_conv1)

### Second Convolutional Layer

In order to build a deep network, we stack several layers of this type. The second layer will have 64 features for each 5x5 patch.

In [21]:
W_conv2 = createWeights([5, 5, 32, 64])
b_conv2 = createBiases([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = maxPool2x2(h_conv2)

### Densely Connected Layer

Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU.

In [22]:
W_fc1 = createWeights([7 * 7 * 64, 1024])
b_fc1 = createBiases([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### Dropout

To reduce overfitting, we will apply [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) before the readout layer. We create a `placeholder` for the probability that a neuron's output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing. TensorFlow's `tf.nn.dropout` op automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling.

In [23]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### Readout Layer

Finally, we add a layer, just like for the one layer softmax regression above.

In [24]:
W_fc2 = createWeights([1024, 10])
b_fc2 = createBiases([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

### Train and Evaluate the Model

* We will replace the steepest gradient descent optimizer with the more sophisticated ADAM optimizer.

* We will include the additional parameter `keep_prob` in `feed_dict` to control the dropout rate.

* We will add logging to every 100th iteration in the training process.

In [27]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_conv))
train = tf.train.AdamOptimizer(0.01).minimize(cross_entropy)

compare = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(compare, tf.float32))

tf.global_variables_initializer().run()
for i in range(2000):
    XBatch, yBatch = mnist.train.next_batch(50)
    
    if i % 100 == 0:
        accRes = accuracy.eval(feed_dict={X: XBatch, y: yBatch, keep_prob: 1.0})
        print('step {0}: accuracy {1}'.format(i, accRes))
    
    train.run(feed_dict={X: XBatch, y: yBatch, keep_prob: 0.5})
    
print('Test accuracy: {0}'.format(accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, keep_prob: 1.0})))

step 0: accuracy 0.10000000149011612
step 100: accuracy 0.5199999809265137
step 200: accuracy 0.23999999463558197
step 300: accuracy 0.05999999865889549
step 400: accuracy 0.10000000149011612
step 500: accuracy 0.20000000298023224
step 600: accuracy 0.14000000059604645
step 700: accuracy 0.14000000059604645
step 800: accuracy 0.18000000715255737
step 900: accuracy 0.25999999046325684
step 1000: accuracy 0.2199999988079071
step 1100: accuracy 0.11999999731779099
step 1200: accuracy 0.11999999731779099
step 1300: accuracy 0.10000000149011612
step 1400: accuracy 0.05999999865889549
step 1500: accuracy 0.23999999463558197
step 1600: accuracy 0.18000000715255737
step 1700: accuracy 0.20000000298023224
step 1800: accuracy 0.18000000715255737
step 1900: accuracy 0.2199999988079071
Test accuracy: 0.20200000703334808
