In [6]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)


Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [13]:
print(mnist.train.images.shape)
print(mnist.train.labels.shape)

(55000, 784)
(55000, 10)


Before describing more about the tutorial, let us understand the "Softmax Regression" in detail. Like logistic regression, softmax regression is also a classification problem. However, the logistic regression is a binary class classifier and the softmax regression is a multi-class classifier. Since we need to classify 10 different digits Softmax classifier is very important for MNIST datasets. If we want to assign probabilities to an object being on several different things, softmax can be used because softmax provides the list of values between 0 and 1 that adds upto 1. 

Very simple idea of the softmax classifier: 
First add up the evidences of the input being in certain classes.
Next convert these evidences into probabilities. 

To tally up the evidences that the given image is in certain class, we do the weighted sum of the pixel intensities. 
The weight is negative if the pixel having high intensity is evidence against the image being in that class and the weight is positive if the evidence is in favour.

We can also add some extra evidence called Bias. Bias represent those terms or things that are more likely independent of the input. 

The evidence for class $i$ given an input $x$ is expressed as :

$$\begin{align}\text{evidence}_i = \sum_j W_{i,j}x_j + b \end{align} $$


The evidence is then converted into the predicted probabilities by using the softmax function as: 

$$\begin{align} y = \text{softmax(evidence)} \end{align} $$

Softmax is generally defined as the normalized exponential inputs given as: 
$$\begin{align}\text{softmax}(x) = \text{normalize}(\exp(x))\end{align}$$

Vectorizing the expression and expressing in terms of matrix and vector form, we can simply write:

$$\begin{align}y = \text{softmax}(\mathbf{Wx + b}) \end{align}, $$

where $\mathbf{W}$ is the weight matrix, $\mathbf{x}$ is the input vector and $\mathbf{b}$ is the bias vector. 

In [25]:
#we define x as a placeholder and can take any inputs as:
x = tf.placeholder(tf.float32,[None, 784],"Input_data")

#x is not a specific value, but just the placeholder and we will input this value when we ask the Tensorflow to 
#run the computation
print(x)

Tensor("Input_data_1:0", shape=(?, 784), dtype=float32)


In [26]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
print(W,b)

<tf.Variable 'Variable_4:0' shape=(784, 10) dtype=float32_ref> <tf.Variable 'Variable_5:0' shape=(10,) dtype=float32_ref>


In [27]:
y = tf.nn.softmax(tf.matmul(x,W) + b)
print(y)

Tensor("Softmax_2:0", shape=(?, 10), dtype=float32)


In the previous mathematical formulation we obtained y=softmax(Wx+b), but here we did y=softmax(xW+b). This is valid and true because we transpose all the dimensions here, earlier y is expressed as a column vector while y is expressed as a row vector. Mathematically, these two expressions with all the dimensions transposed makes sense. 

i.e $y^T = (Wx + b)^T = (x^TW^T + b^T)$

The next step is training the model we just constructed. The training is performed based on the outputs we have. Let us assign the known outputs by constructing the placeholder tensor y_ as:

In [28]:
y_ = tf.placeholder(tf.float32, [None, 10],'Known_Output')
print(y_)

Tensor("Known_Output_3:0", shape=(?, 10), dtype=float32)


Based on the known output, we try to minimize the error which is obtained from our defined model. We use a very nice function called "Cross-entropy", which minimizes the loss or error and predicts the optimal values of weights and biases.

The cross entropy function is the term arising from information theory which we do not discuss much in machine learning, but it is defined as :

$$\begin{align}H_{y'}(y) = - \sum_i y'_i\log (y_i) \end{align},$$

where y' is the true distribution and y is the predicted probability distribution.

In [29]:
cross_entropy = tf.reduce_mean(- tf.reduce_sum(y_*tf.log(y), reduction_indices=[1]))

In [30]:
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

The tensorflow then adds the new operation to our computational graph which implement back propagation and the gradient descent and gives back a single operation which when run does a step of gradient descent training, slightly tweaking the variables to reduce the loss. Next we launch the model using an interactive session.

In [31]:
sess = tf.InteractiveSession()
init = tf.global_variables_initializer()
sess.run(init)

#We can now train the model after initializing the variables. 

In [32]:
#Let us run the training steps 1000 times 
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict = {x: batch_xs, y_:batch_ys})
    
# In each step of the loop, we get a batch of 100 random data points from the training set. Using such small
# batches of random data is called stochastic training-- in this case the stochastic gradient descent. Using 
# different subsets every time than using all the data is computationally cheap and gives almost similar benefits.

Next let us determine how well does our model do. In the first step we figure out where we predicted the correct labels.

tf.argmax is an extremely useful function which gives you the index of the highest entry in a tensor along some axis.

We see then that, tf.argmax(y, 1) is the label predicted by the model for each input, however tf.argmax(y_, 1) is the correct label. 

Thus we can obtain the correct prediction by matching the predicted label with the correct label. This is achieved by using tf.equal() function as:

In [34]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
print(correct_prediction)

Tensor("Equal_1:0", shape=(?,), dtype=bool)


In [35]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(accuracy)

Tensor("Mean_1:0", shape=(), dtype=float32)


In [38]:
# Now we can run and find the accuracy from the given model for our available test data 
print(sess.run(accuracy, feed_dict={x:mnist.test.images, y_:mnist.test.labels}))


0.9011
