In [None]:
%matplotlib inline
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
plt.rcParams['image.cmap'] = 'gray' # we want our images to be show black and white, not heat-mapped

In [None]:
session = tf.InteractiveSession()

In [None]:
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# Convolutional MNIST model

We want a couple hidden layers not just one and we don't like code duplication

In [None]:
def convolutional_layer(inp, output_channels, scope_name):
    # We want to get the number of channels in `inp`
    # `get_shape` return a tf.TensorShape, describing the static shape of `inp`
    # and the last element of it is the number of channels
    input_channels = inp.get_shape().as_list()[-1]
    # The kernel shape is [kernel_height, kernel_width, input_channels, output_channels]
    kernel_shape = [3 ,3, input_channels, output_channels]
    with tf.variable_scope(scope_name):
        kernel = tf.get_variable('kernel', kernel_shape, initializer=tf.random_normal_initializer(stddev=1e-3))
        bias = tf.get_variable('bias', [output_channels], initializer=tf.zeros_initializer())
    # Strides tells TF the location of the next kernel center relative to the current.
    # Padding tells TF how to behave near the edges of the image 
    # More info about the strides and padding can be found in http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html
    processed = tf.nn.conv2d(inp, kernel, strides=[1, 2, 2, 1], padding='VALID') + bias
    return tf.nn.leaky_relu(processed)
    

## Static shape vs Dynamic shape
Each tensor has a shape. When we define variables we have to fully define that shape, but when we define placeholders we can omit some of dimentions of the shape (like the batch size dimention). The definition shape is the *static* shape of the tensor. We get the static shape of a tensor with the `get_shape` method. Operations get their shape based on their operands.

When we pass the placeholders and we actually compute the tensors, they have a fully-defined shape without `None` dimentions. We call that shape the *dynamic* shape. We can get the dynamic shape using the `tf.shape` function. That shape is useful in some circumstances, but in our case we need the static shape, because we want to compute the static shape of the kernel. 

## Variable sharing

Using `variable_scope` and `get_variable` is referred in TF as variable sharing. We use them to define our reusable layers, because we don't have to explicitly return and store the variables used in the layers, and they get meaningful names in the graph. 

In [None]:
image = tf.placeholder(tf.float32, [None, 28, 28, 1], name='image')
gt_label = tf.placeholder(tf.float32, [None, 10], name='gt_label')

We now want `image` to be a batch of 28x28 images with one (luminance) channel.

In [None]:
hidden_layer_1 = convolutional_layer(image, 4, 'conv_1')

This creates two variables `conv_1/kernel` for our kernel and `conv_1/bias` for the bias. If we need them we can get them in a simialar way (with `variable_scope` and `get_variable` but we must pass `reuse=True/tf.AUTO_REUSE` to `variable_scope`).

In [None]:
hidden_layer_2 = convolutional_layer(hidden_layer_1, 8, 'conv_1')

That reuse machanism is a good way to prevent copy/paste errors if we forget to rename our layer scopes.

In [None]:
hidden_layer_2 = convolutional_layer(hidden_layer_1, 8, 'conv_2')
print(hidden_layer_2.get_shape())

In [None]:
hidden_layer_3 = convolutional_layer(hidden_layer_2, 16, 'conv_3')
print(hidden_layer_3.get_shape())

In [None]:
max_pooled_layer_3 = tf.nn.max_pool(hidden_layer_3, 
                                    ksize=[1, 2, 2, 1], # how big are our pooling windows (kernels)
                                    strides=[1, 2, 2, 1], # has the same meaning as tf.nn.conv2d
                                    padding='VALID', # has the same meaning as tf.nn.conv2d
                                   )
print(max_pooled_layer_3.get_shape())

In [None]:
flattened_hidden_layer_3 = tf.reshape(max_pooled_layer_3, [-1, 16])

In [None]:
logits = tf.layers.dense(flattened_hidden_layer_3, 10, name='final_layer')

The dense layer is the same as the linear model used in the previous notebook. The only difference (used here, since `tf.layers.dense` has a lot of options) is that it's defined as a layer with `variable_scope` and `get_variable`. Implementing a linear layer using variable sharing is left as an exercise to the reader.

In [None]:
predicted_label_probs = tf.nn.softmax(logits)
predicted_label = tf.argmax(predicted_label_probs, axis=-1)

In [None]:
loss = tf.nn.softmax_cross_entropy_with_logits_v2(
    labels=gt_label, 
    logits=logits,
)
loss = tf.reduce_mean(loss) # since we want the average loss

In [None]:
global_step = tf.Variable(0, trainable=False, dtype=tf.int32, name='global_step')
optimizer = tf.train.AdamOptimizer(1e-3)
train_op = optimizer.minimize(loss, global_step=global_step)

`AdamOptimizer` implements a variant of stochastic gradient descent, that behaves better in various contexts: sparse variables, saddle points, etc. and is known for fast convergence, but sadly even though it's advertised to not need learning rate tuning it's still sensitive to it.

In [None]:
gt_label_index = tf.argmax(gt_label, axis=-1)
gt_match = tf.equal(predicted_label, gt_label_index)
accuracy = tf.reduce_mean(tf.cast(gt_match, tf.float32), axis=0)

In [None]:
init_op = tf.global_variables_initializer()

In [None]:
summary_writer = tf.summary.FileWriter('summaries_logdir/')

In [None]:
summary_writer.add_graph(session.graph)

In [None]:
tf.summary.scalar('loss', loss)
tf.summary.scalar('accuracy', accuracy)

In [None]:
session.run(init_op)

In [None]:
session.run(predicted_label_probs, {image: mnist.train.images[0:1]})

Since our model expects a 28x28 1 channel images we have to reshape our input. If our images where of different sizes we would have to resize them or make our model handle images with different sizes.

In [None]:
session.run(predicted_label_probs, {image: mnist.train.images[0:1].reshape([-1, 28, 28, 1])})

In [None]:
val_summary_writer = tf.summary.FileWriter('summaries_logdir/validation/')

In [None]:
# Train for one epoch
batch_size = 10
summaries_op = tf.summary.merge_all()
for batch_start in range(0, mnist.train.images.shape[0], batch_size):
    # Get the corresponding batches for images and labels
    image_batch = mnist.train.images[batch_start:batch_start + batch_size].reshape([-1, 28, 28, 1])
    label_batch = mnist.train.labels[batch_start:batch_start + batch_size]
    
    # Execute one step of the model. Note that train_op doesn't have a value, but it **HAS** to be executed
    # in order to train the model
    global_step_value, loss_value, accuracy_value, summaries_value, _ = session.run(
        [global_step, loss, accuracy, summaries_op, train_op],
        {image: image_batch, gt_label: label_batch}
    )
    if global_step_value % 100 == 0:
        # Print the loss and accuracy of the model on the *Trainng* *batch*
        print("{:6}: loss: {}, accuracy:{} ".format(global_step_value, loss_value, accuracy_value))
        summary_writer.add_summary(summaries_value, global_step_value)
    if global_step_value % 1000 == 0:
        # Compute the loss and accuracy on the whole *Validation* set. 
        # Note 0: usually the validation set will be too big to compute validations on the whole set
        # Note 1: there is no `train_op` in the tensors we pass to run, so the model doesn't learn 
        #   anything from the validation set (this would be bad)
        global_step_value, loss_value, accuracy_value, summaries_value = session.run(
            [global_step, loss, accuracy, summaries_op],
            {image: mnist.validation.images.reshape([-1, 28, 28, 1]), 
             gt_label: mnist.validation.labels}
        )
        val_summary_writer.add_summary(summaries_value, global_step_value)
        print("VAL: {:6}: loss: {}, accuracy:{} ".format(global_step_value, loss_value, accuracy_value))


# Performance exploration


Let's check out the examples that are causing problems.

In [None]:
predicted_label_probs_val, gt_match_val = session.run(
    [predicted_label_probs, gt_match],
    {image: mnist.validation.images.reshape([-1, 28, 28, 1]), gt_label: mnist.validation.labels}
)

In [None]:
np.sum(~gt_match_val)

In [None]:
failed_predicted_label_probs_val = predicted_label_probs_val[~gt_match_val]
failed_images_val = mnist.validation.images[~gt_match_val]
failed_labels_val = mnist.validation.labels[~gt_match_val]

In [None]:
failed_predicted_label_probs_val[0], failed_labels_val[0]

In [None]:
plt.imshow(failed_images_val[0].reshape([28, 28]))

# Saving our models


To save our mode we use `tf.train.Saver`.

In [None]:
saver = tf.train.Saver()

With the `save` method we pass a session and a path for saving the model. Usually we want to pass a `global_step` parameter, so we can save every few thousand steps and still know which model is more trained than others. 

A big downside is that our model consists of 3 types of files:
 - `.data-#####-of-#####` files, that contains the tensor values (currently we have only one such file).
 - `.index` file, that contains tensor metadata: what tensors do we have in our data files, in which files at which offset, checksums, etc.
 - `.meta` file, that describes our graph's structure if we don't want to build it in our source. Since we build the graph in the source we don't need it (for our usecase) and we can disable it's generation with `write_meta_graph=False`

More info can be found in [this SO thread](https://stackoverflow.com/questions/41265035/tensorflow-why-there-are-3-files-after-saving-the-model)

In [None]:
saver.save(session, 'saved_model/model.ckpt', global_step=global_step_value, write_meta_graph=False)

In [None]:
saver.restore(session, '...')