# MNIST for ML Beginners

- This tutorial goes through the code in [`mnist_softmax.py`](https://github.com/tensorflow/tensorflow/blob/v0.10.0/tensorflow/examples/tutorials/mnist/mnist_softmax.py) line by line.

In [3]:
from IPython.display import display, Math, Latex

In [1]:
# Download the MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


- The MNIST data is split into three parts:
    - 55,000 data points of training data (mnist.train)
    - 10,000 points of test data (mnist.test)
    - 5,000 points of validation data (mnist.validation)

- [Visualizing the MNIST data](http://colah.github.io/posts/2014-10-Visualizing-MNIST/)

- Each MNIST image is a 28 x 28 pixel image of a handwritten digit.
- We will be flattening the image into a vector (or "tensor") of 784 dimensions, each dimension representing a pixel's intensity (a value between 0 and 1).
- Since there are 55K images, our training data, `mnist.train.images`, will have the shape [55000, 784].
- Each image has a corresponding label from 0 to 9.
- We will represent our labels in "one-hot" encoding, where each label will be a 10-dimension vector with one dimension "switched on", e.g. 3 would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].
- Our training labels, `mnist.train.labels`, will have the shape [55000, 10].
- Objective: Given an image, produce a set of probabilities for each label.
- This is a classic case where a softmax regression is a natural, simple model. If you want to assign probabilities to an object being one of several different things, softmax is the thing to do, because softmax gives us a list of values between 0 and 1 that add up to 1.

## Softmax Regressions

1. Add up the evidence of our input being in certain classes
2. Convert that evidence into probabilities

- To tally up the evidence that a given image is in a particular class, we do a weighted sum of the pixel intensities. The weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor.
- We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input.

In [12]:
display(Math(r'\text{evidence}_i = \sum_j W_{i,~ j} x_j + b_i'))
display(Math(r'\text{where }W_i\text{ is the weights and }b_i\text{ is the bias for class }i\text{, and }j\text{ is an index for summing over the pixels in our input image }x'))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

- We then convert the evidence tallies into our predicted probabilities y using the "softmax" function:

In [10]:
display(Math(r'y = \text{softmax}(\text{evidence})'))

<IPython.core.display.Math object>

- Here softmax is serving as an "activation" or "link" function, shaping the output of our linear function into the form we want -- in this case, a probability distribution over 10 cases. You can think of it as converting tallies of evidence into probabilities of our input being in each class.

In [13]:
display(Math(r'\text{softmax}(x) = \text{normalize}(\exp(x))'))

<IPython.core.display.Math object>

or

In [14]:
display(Math(r'\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}'))

<IPython.core.display.Math object>