# MNIST For ML Beginners

## The MNIST Data

Download the MNIST data:

In [30]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts:

<table align="left">
    <tr>
        <th>training data</th><td>55,000</td>
    </tr>
    <tr>
        <th>test data</th><td>10,000</td>
    </tr>
    <tr>
        <th>validation data</th><td>5,000</td>
    </tr>
</table>

This split is very important: it's essential that we have separate data which we <b>don't</b> learn from so that we can make sure that what we've learned actually generalizes!

Each image is 28 pixels by 28 pixels, 8-bit grayscale. We can flatten each image into a 784-dimensional vector. For now, we are giving up information about the 2D structure, which will be exploited in later tutorials.

The result is that `mnist.train.images` is a teusor with a shape of [55000, 784]. Each entry is a pixel intensity between 0 and 1(normalized).

Each image in MNIST has a corresponding label(i.e. the 'correct' answer), a number between 0 and 9 representing the digit drawn in the image. 

In the program, the labels are converted to 'one-hot vectors, whose entries contain exactly one '1' and all others are '0's. For example, label 3 would be [0,0,0,1,0,0,0,0,0,0]. Consequently, mnist.train.labels is a tensor with shape [55000, 10].

## Softmax Regressions

There are only ten possible things that a given image in MNIST can be, and we want to be able to look at an image and give the <b>probabilities</b> for it being each digit.

Softmax regression is a suitable model for our purpose since it gives us a list of values between 0 and 1 that add up to 1. Even later on, when we train more sophisticated models, the final step will be a layer of softmax.

A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

The <b>evidence</b> for a certain class $i$ given an input $x$ is:

<div> $$\text{evidence}_i = \sum_j W_{i,~ j} x_j + b_i$$ </div>

where $W_i$ is the weights and $b_i$ is the bias for class $i$,
and $j$ is an index for summing over the pixels in our input image $x$.
We then convert the evidence tallies into our predicted probabilities
$y$ using the "softmax" function:

<div> $$y = \text{softmax}(\text{evidence})$$ </div>

Here softmax is serving as an "activation" or "link" function, shaping
the output of our linear function into the form we want -- in this case, a
probability distribution over 10 cases.
We can think of it as converting tallies
of evidence into probabilities of our input being in each class.
It's defined as:

$$\text{softmax}(x) = \text{normalize}(\exp(x))$$

If we expand that equation out, we get:

<div> $$\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$ </div>

Check out <a href="http://neuralnetworksanddeeplearning.com/chap3.html#softmax">here</a> to get more intuition about softmax function.

We can write $y$ in the compact form:

<div> $$y = \text{softmax}(Wx + b)$$ </div>

## Implementing the Regression

Import TensorFlow:

In [31]:
import tensorflow as tf

Create symbolic variable `x`:

In [32]:
x = tf.placeholder(tf.float32, [None, 784])

`x` is a <b>placeholder</b>, a value that we'll input when we ask TensorFlow to run a computation. Our input is a series of MNIST images, each flattened into a 784-dimensional vector. The input has a shape `[None, 784]`, where `None` means that a dimension can be of <b>any length</b>.

In our TensorFlow program, the weights and biases are <b>Variables</b>. Their values can be easily adjusted by the program during the learning process.

In [33]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

where `tf.zeros(shape)` produces a tensor of all-zero with a certain shape.

We can now implement our model. It only takes one line to define it!

In [34]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

* `tf.matmul(x, W) + b`: the TensorFlow expression for our model $y = Wx + b$. Note that the order of `x` and `W` has been flipped in TF expression since we must match dimensions in matrix multiplication (`x` has shape `[None, 784]`, `W` has shape `[784, 10]`). `b` is the bias term.
* `tf.nn.softmax(...)`: the softmax regression of our 'evidences'. `nn` means 'neural network'.

## Training

In machine learning, the term <b>loss</b> represents how far off our model's prediction is from the desired outcome. The goal of <b>training</b> is to minimize the loss.

A commonly-used loss function is called <b>cross entropy</b>:

<div> $$H_{y'}(y) = -\sum_i y'_i \log(y_i)$$ </div>

Where $y$ is our predicted probabiligy distribution and  $y'$ is the true distribution (the one-hot vector with the digit labels). It's good to <a href="http://colah.github.io/posts/2015-09-Visual-Information/">understand</a> how cross entropy works.

For implementation, first add a new placeholder for correct answers(labels):

In [35]:
y_ = tf.placeholder(tf.float32, [None, 10])

Implement the cross-entropy function:

In [36]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_*tf.log(y), axis=1))

* `tf.log`: the logarithm operation.
* `y_*tf.log(y)`: element-wise multiplication
* `tf.reduce_sum(..., axis=1)`: see example below:

In [37]:
mat = tf.Variable([[1, 1], 
                 [2, 2]], dtype=tf.float32)
init_op =  tf.global_variables_initializer()

sum0_op = tf.reduce_sum(mat, axis=0)
sum1_op = tf.reduce_sum(mat, axis=1)

with tf.Session() as sess:
    sess.run(init_op) # initialize global variables
    sum0_val = sess.run(sum0_op)
    sum1_val = sess.run(sum1_op)
    print 'axis=0:', sum0_val
    print 'axis=1:', sum1_val

axis=0: [ 3.  3.]
axis=1: [ 2.  4.]


* `tf.reduce_mean(...)`: similar to `tf.reduce_sum()`

For more details, see documentation: <a href="https://www.tensorflow.org/api_docs/python/tf/reduce_mean">tf.reduce_mean</a>, <a href="https://www.tensorflow.org/api_docs/python/tf/reduce_sum">tf.reduce_sum</a>

Note that in the source code, we don't use this formulation since it's numerically unstable (previously  in `tf.nn.softmax(arg)`, since `arg` is unnormalized, the entries can be very large, so that `exp(arg)` might be ridiculously large). Instead, we apply `tf.nn.softmax_cross_entropy_with_logits` on the unnormalized logits:

In [38]:
y = tf.matmul(x, W) + b # use the unnormalized model (no softmax)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)

For more info, see <a href="https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits">tf.nn.softmax_cross_entropy_with_logits</a>.

Now that we know what we want our model to do, it's very easy to have TensorFlow train it to do so. Because TensorFlow knows the entire graph of your computations, it can automatically use the <a href="http://colah.github.io/posts/2015-08-Backprop/"><b>backpropagation algorithm</b></a> to efficiently determine how your variables affect the loss you ask it to minimize. Then it can apply your choice of optimization algorithm to modify the variables and reduce the loss.

In [39]:
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

In this case, we ask TensorFlow to minimize `cross_entropy` using the <a href="https://en.wikipedia.org/wiki/Gradient_descent"><b>gradient descent algorithm</b></a> with a learning rate of 0.5. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost. But TensorFlow also provides <a href="https://www.tensorflow.org/api_guides/python/train#Optimizers">many other optimization algorithms</a>: using one is as simple as tweaking one line.

What TensorFlow actually does here, behind the scenes, is to add new operations to your graph which implements backpropagation and gradient descent. Then it gives you back a single operation which, when run, does <b>a step</b> of gradient descent training, slightly tweaking your variables to reduce the loss.

We can now launch the model in an `InteractiveSession`:

In [40]:
sess = tf.InteractiveSession()

We first have to create an operation to initialize the variables we created:

In [41]:
tf.global_variables_initializer().run()

Let's train -- we'll run the training step 1000 times!

In [42]:
for _ in xrange(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})

Each step of the loop, we get a 'batch' of 100 random data points from our training set. We run `train_step` feeding in the batches data to replace the `placeholder`s.

Using small batches of random data is called <b>stochastic training</b> -- in this case, stochastic gradient descent. Ideally, we'd like to use all our data for every step of training because that would give us a better sense of what we should be doing, but that's expensive. So, instead, we use a different subset every time. Doing this is cheap and has much of the same benefit.

## Evaluating Our Model

How well does our model do?

Well, first let's figure out where we predicted the correct label. `tf.argmax` is an extremely useful function which gives you the index of the highest entry in a tensor along some axis. 

Examples of `tf.argmax`:

In [43]:
vec = [[1, 2, 3],
       [6, 5, 4]]
print sess.run(tf.argmax(vec, axis=1))

[2 0]


Therefore, `tf.argmax(y, axis=1)` is the label our model thinks is most likely for each input, while `tf.argmax(y_, axis=1)` is the correct label.

We can use tf.equal to check if our prediction matches the truth:

In [44]:
correct_prediction = tf.equal(tf.argmax(y, axis=1), tf.argmax(y_, axis=1))

That gives us a list of booleans. To determine what fraction are correct, we cast (type conversion) to floating point numbers ans then take the mean. For example, `[True, False, True, True]` would become `[1, 0, 1, 1]` which would become `0.75`.

In [45]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Finally, we ask for our accuracy on our test data:

In [46]:
print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})

0.8742


This accuracy is not very good since our model is too young too simple. To get better results, we need a more sophisticated model.