# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Linear-Regression" data-toc-modified-id="Linear-Regression-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Linear Regression</a></div><div class="lev1 toc-item"><a href="#Neural-Network" data-toc-modified-id="Neural-Network-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Neural Network</a></div><div class="lev1 toc-item"><a href="#Convolutional-Networks" data-toc-modified-id="Convolutional-Networks-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Convolutional Networks</a></div>

In [None]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('data',one_hot = True)

# Linear Regression
We'll apply a basic layer of weights to the inputs. Here, the inputs will be all 28 x 28 = 784 pixels of the MNIST picture

In [None]:
# tf notation for permanents. None allows flexibility in that dimension
x = tf.placeholder(tf.float32,[None,784]) 

In [None]:
# directly go from inputs to outputs
W_linreg = tf.Variable(tf.random_normal([784, 10], stddev = 0.1)) 
b_linreg = tf.Variable(tf.random_normal([10], stddev = 0.1))

In [None]:
# Our prediction
y = tf.nn.softmax(tf.matmul(x,W_linreg) + b_linreg)

# correct answers
y_ = tf.placeholder(tf.float32,[None,10])

Many different optimizers are available, here we use a basic Gradient Descent optimizer

In [None]:
# set as negative log like
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) 
# 0.5 is the learn rate
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) 

Very important! Without this line, none of the cells will be run

In [None]:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

In [None]:
## begin the training:
batch_size = 500

for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(batch_size)
    sess.run(train_step, feed_dict={ x : batch_xs, y_: batch_ys})
    

In [None]:
### checking the results:
# check by comparing max values in vectors
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(sess.run(accuracy, feed_dict={x : mnist.test.images, y_:mnist.test.labels}))

So about 92%. We ought to be better... the best algorithms get ~99% : http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

# Neural Network
Let's try again, this time with a neural network.
To make it a neural network, we'll give it a hidden layer (i.e. add an additional set of weights) and apply non-linear transformations after each multiplication by a weight matrix.

In [None]:
# tf notation for permanents. None allows flexibility in that dimension
x = tf.placeholder(tf.float32,[None,784]) 

In [None]:
## parameters to consider:
batch_size = 200
learn_rate = 0.5
h1_layer_size = 500

In [None]:
# Go from inputs to hidden layer
W_NN1 = tf.Variable(tf.random_normal([784, h1_layer_size], stddev = 0.1)) 
b_NN1 = tf.Variable(tf.random_normal([h1_layer_size], stddev = 0.1))
# Now go from hidden layer to 
W_NN2 = tf.Variable(tf.random_normal([h1_layer_size, 10], stddev = 0.1)) 
b_NN2 = tf.Variable(tf.random_normal([10], stddev = 0.1))

In [None]:
# Our prediction, apply some relu non-linearity to system.
h1_layer = tf.nn.relu(tf.matmul(x,W_NN1) + b_NN1) 
y = tf.nn.softmax(tf.matmul(h1_layer, W_NN2)+b_NN2)
# correct answers
y_ = tf.placeholder(tf.float32,[None,10])

Many different optimizers are available, here we use a basic Gradient Descent optimizer

In [None]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) # set as negative log like
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) # 0.5 is the learn rate

Very important! Without this line, none of the cells will be run

In [None]:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

In [None]:
## begin the training:
for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(batch_size)
    sess.run(train_step, feed_dict={ x : batch_xs, y_: batch_ys})

In [None]:
### checking the results:
# check by comparing max values in vectors
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x : mnist.test.images, y_:mnist.test.labels}))

Much better after just one hidden layer, accuracy up to 97.6% from 92% for linear regression. Can we do still better?

Parameters to play around with:
- batch size
- hidden layer size
- Number of hidden layers (try adding another hidden layer on your own!)

# Convolutional Networks

Convolutional networks allow us to reuse some of the parameters. Instead of requiring weights for each of the input pixels, we have a set of weights that run over all the pixels and get reused. We also try to condense the representation so there are fewer overall 'nodes' in our representation, but each node has its own features

In [None]:
# some helper function to help us set up the convolutional network
# strides : Describe how to move the convolutional window on the input
# padding : Whether to add extra zero-columns so that the window can be read to the last input column
#     'SAME' - indicates extra zero columns will be added
#     'VALID - indicates no extra columns will be added, so the number of columns is N_width - W_width 
#              in the new representation

def weight_variable(size):
    initial = tf.truncated_normal(size, stddev = 0.02)
    return tf.Variable(initial)
    
def bias_variable(size):
    initial = tf.constant(0.1, shape=size)
    return tf.Variable(initial)

def conv2d(x, W):  
    return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='SAME') 

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

In [None]:
# creating the appropriate layers
# First, it's necessary to create a 4-tensor out of inputs
x = tf.placeholder(tf.float32,[None,784])
x_image = tf.reshape(x, [-1, 28, 28, 1])

Note that with a convolutional network, our weight/bias convolution only requires (5 x 5 + 1) * 32 = 832 weights.

If we had wanted a fully connected neural network going from all input 784 pixels to a hidden layer of size 32, plus a bias, we'd need (784 + 1 ) *32 = 25,120 weights instead.

In [None]:
W_conv1 = weight_variable([5,5,1,32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)  # this layer will be a 28 x 28 x 32 rep.
h_pool1 = max_pool_2x2(h_conv1)     # After the pooling, we now have a 14 x 14 x 32 rep.

In [None]:
# creating a second layer:
W_conv2 = weight_variable([5,5,32,64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)  # this layer will be 28 x 28 x 64
h_pool2 = max_pool_2x2(h_conv2)      # Now it will be 7 x 7 x 64 

In [None]:
# Now let's create a fully connected neural network layer
W_fc1  = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])

# we will flatten the convolutional layers we had developed previously
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) 
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1)+b_fc1)

In [None]:
# Last set of weights to get to the output layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

# Output:
y_conv1 = tf.nn.softmax(tf.matmul(h_fc1, W_fc2)+b_fc2)

In [None]:
# begin training, this time using fancier Adam optimizer
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv1), reduction_indices=[1])) # set as negative log like
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_conv1,1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [None]:
# some training parameters:
num_iters = 2000
batch_size = 300

sess.run(tf.initialize_all_variables())
for i in range(num_iters):
    batch = mnist.train.next_batch(50)
    if i%100 == 0 :
        train_accuracy = sess.run(accuracy, feed_dict={ x: batch[0], y_: batch[1]})
        print('Step %d, trainig accuracy %g'%(i, train_accuracy))
    sess.run(train_step, feed_dict={x: batch[0], y_:batch[1]})

In [None]:
test_accuracy = sess.run(accuracy, feed_dict = {x:mnist.test.images, y_:mnist.test.labels})
print('Test accuracy for convnet without dropout: %g'%(test_accuracy))

Wow, what happened, we only have 10% accuracy on the test set, whereas we had nearly 100% accuracy on the training set. This indicates that the results have been overfit. Let's try it again with dropout.

Since dropout randomly deletes nodes from the hidden layer in the fully connected layer, that should help the nodes be a little more general

In [None]:
# Setting up parameter for dropout:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

In [None]:
# Last set of weights to get to the output layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

# Output:
y_conv2 = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2)+b_fc2)

In [None]:
# begin training, this time using fancier Adam optimizer
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv2), reduction_indices=[1])) # set as negative log like
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_conv2,1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [None]:
# some training parameters:
num_iters = 2000
batch_size = 300

sess.run(tf.initialize_all_variables())
for i in range(num_iters):
    batch = mnist.train.next_batch(50)
    if i%100 == 0 :
        train_accuracy = sess.run(accuracy, feed_dict={ x: batch[0], y_: batch[1], keep_prob: 1.0})
        print('Step %d, training accuracy %g'%(i, train_accuracy))
    sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob: 0.5}) # during trainin, only keep nodes half the tiem

To double check: are these all running in the same session? or a different one each time you set 'initialize_all_variables'

In [None]:
test_accuracy = sess.run(accuracy, feed_dict = {x:mnist.test.images, y_:mnist.test.labels, keep_prob: 1.0})
print('Test accuracy for convnet without dropout: %g'%(test_accuracy))

98.4%, That's much better!