<H1 style="text-align: center;"> TensorFlow Tutorial</H1>
<H2 style="text-align: center;"> Convolutional Neural Networks</H2>
<footer style="text-align: center;"> Kent Yu<br><br>10/21/2016</footer>

## Overview
* **Problem** 
    * Classify RGB 32x32 pixel images across 10 categories
    
* **Solution**
    * Build a relatively small convolutional neural network (CNN) for recognizing images
    * A mini version of ImageNet classificaion by Alex  Krizhevsky in UofT in 2012 (1.06 million vs 60 million parameters) (https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-)
    
* **Implication**
    * Provides a template for constructing larger and more sophisticated models
    

## Program Structure

* Creating the model
* Training the model
* Evaluating the model
* Others
 * Proprocessing and preparation, e.g. read image files, shuffle, crop and flip the images etc
 * Reporting and visualzation
 * Performance tuning, e.g. using multiple GPUs, using async services such as queues
 * Utility functions, e.g. calculating moving average etc


## The model graph
![tensorboard graph](./cifar_graph.png "")

## Converlutional Network Illustration
<img src="./convolutional-network-demo.gif" width=700/>



## Conv1 Layer
* ** Input: 128 X 24 X 24 X 3 **
* **Kernel size: 5 (Height) X 5 (Width) X 3 (Channels) X 64 (#of Kernels)**
* ** Stride: 1,1,1,1;  Padding: Same **
* **Activation: RELU (Retified Linear Unit)**
* ** Output: 128 X 24 X 24 X 64 **
* ** Technique used: Weight Decay Regularization (to prevent over-fitting) **

![tensorboard graph](./Conv1.jpg "")

```python
  # conv1
  with tf.variable_scope('conv1') as scope:
    kernel = _variable_with_weight_decay('weights',
                                         shape=[5, 5, 3, 64],
                                         stddev=5e-2,
                                         wd=0.0)
    conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
    bias = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv1)

```

## Wegith Decay

* With a given cost or error function E(w), and learning rate ETA (η)
![tensorboard graph](./WeightDecay.jpg "")
* The new term **−ηλwi** coming from the regularization causes the weight to decay in proportion to its size.
* As a rule of thumb, the more training examples you have, the weaker this term should be.

## Pool 1 Layer
* ** Input: 128 X 24 X 24 X 64 **

* **KSize:  1 (Batch) X 3 (Height) X 3 (Width) X 1 (Channel) **
* **Stride: 1 (Batch) X 2 (Height) X 2 (Width) X 1 (Channel) **
* **Type: Max**
* **Padding: Same**

* ** Output: 128 X 12 X 12 X 64 **
![tensorboard graph](./Pool1.jpg "")

```Python
  # pool1
  pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1],
                         padding='SAME', name='pool1')
```                         



## Norm1 Layer

* tf.nn.lrn(**pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1'**)
* Purpose: to reduce "lateral inhibition"
* **Accoring to CS231n, LRN is rarely used recently**
> Many types of normalization layers have been proposed for use in ConvNet architectures,
> ...
> ...
> However, these layers have since fallen out of favor because in practice their contribution has been shown to be minimal, if any. 
* What it does
![tensorboard graph](./LRN.jpg "")

```Python
  # norm1
  norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm1')
```
## From ImageNet paper
![ensorboard graph](./LocalReponseNormalization.jpg "" )

## Conv2 Layer
* ** Input: 128 X 12 X 12 X 64 **
* **Kernel size: 5 (Height) X 5 (Width) X 64 (Channel) X 64 (# of Kernels)**
* **Activation: RELU **
* ** Output: 128 X 12 X 12 X 64 **
![tensorboard graph](./Conv2.jpg "")

```Python
 # conv2
  with tf.variable_scope('conv2') as scope:
    kernel = _variable_with_weight_decay('weights',
                                         shape=[5, 5, 64, 64],
                                         stddev=5e-2,
                                         wd=0.0)
    conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
    bias = tf.nn.bias_add(conv, biases)
    conv2 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv2)
```    

## Norm2 Layer

* tf.nn.lrn(**conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,name='norm2')**)

```Python
  # norm2
  norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm2')
```               

## Pool 2 Layer
* ** Input: 128 X 12 X 12 X 64 **

* **KSize:  1 (Batch) X 3 (Height) X 3 (Width) X 1 (Channel) **
* **Stride: 1 (Batch) X 2 (Height) X 2 (Width) X 1 (Channel) **
* ** Type: max **
* **Padding: Same**

* ** Output: 128 X 6 X 6 X 64 **
![tensorboard graph](./Pool2.jpg "")

```Python
  # pool2
  pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1],
                         strides=[1, 2, 2, 1], padding='SAME', name='pool2')
```                    




## Local 3 Layer

* **Input: 128 X 6 X 6 X 64 **
* ** Number of Nurons: 384**
* ** Output: 128 X 384 **
![tensorboard graph](./Local3.jpg "")

```Python
  # local3
  with tf.variable_scope('local3') as scope:
    # Move everything into depth so we can perform a single matrix multiply.
    reshape = tf.reshape(pool2, [FLAGS.batch_size, -1])
    dim = reshape.get_shape()[1].value
    weights = _variable_with_weight_decay('weights', shape=[dim, 384],
                                          stddev=0.04, wd=0.004)
    biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1))
    local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name)
    _activation_summary(local3)
```    


###### Local 4 Layer

* ** Input: 128 X 384 **
* ** Number of Nurons: 192**
* ** Output: 128 X 192 **
![tensorboard graph](./Local4.jpg "")
```Python
  # local4
  with tf.variable_scope('local4') as scope:
    weights = _variable_with_weight_decay('weights', shape=[384, 192],
                                          stddev=0.04, wd=0.004)
    biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1))
    local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name)
    _activation_summary(local4)
```    
    

## Output Layer (Softmax Linear)
* **Input:128 X 192 **
* **Number of classes: 10**
* **Output: 10 (Probability)**
![tensorboard graph](./Softmax.jpg "")
```Python
  # softmax, i.e. softmax(WX + b)
  with tf.variable_scope('softmax_linear') as scope:
    weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
                                          stddev=1/192.0, wd=0.0)
    biases = _variable_on_cpu('biases', [NUM_CLASSES],
                              tf.constant_initializer(0.0))
    softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
    _activation_summary(softmax_linear)
```    

### **Note: No tf.nn.softmax is applied**

* **Compare the above with the output layer in the Deep MNIST for Experts tutorial**
```Python
    W_fc2 = weight_variable([1024, 10])
    b_fc2 = bias_variable([10])

    y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
```

## Lost Function

```Python
def loss(logits, labels):
  """Add L2Loss to all the trainable variables.

  Add summary for "Loss" and "Loss/avg".
  Args:
    logits: Logits from inference().
    labels: Labels from distorted_inputs or inputs(). 1-D tensor
            of shape [batch_size]

  Returns:
    Loss tensor of type float.
  """
  # Calculate the average cross entropy loss across the batch.
  labels = tf.cast(labels, tf.int64)
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
      logits, labels, name='cross_entropy_per_example')
  cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
  tf.add_to_collection('losses', cross_entropy_mean)

  # The total loss is defined as the cross entropy loss plus all of the weight
  # decay terms (L2 loss).
  return tf.add_n(tf.get_collection('losses'), name='total_loss')
```  



### Reduce Mean

reduce_mean(input_tensor, reduction_indices=None, keep_dims=False, name=None)
input_tensor: The tensor to reduce. Should have numeric type.
reduction_indices: The dimensions to reduce. If `None` (the defaut),
    reduces all dimensions.


In [102]:
import tensorflow as tf
import numpy as np

xarray =np.array([[1., 1. ] 
                , [2., 2.]])
x = tf.convert_to_tensor(xarray)
with tf.Session():
    print((tf.reduce_mean(x)).eval() )  # 1.5
    print(tf.reduce_mean(x, 0).eval())  # [1.5 , 1.5]
    print(tf.reduce_mean(x, 1).eval() ) # [1,
                                        #  2]

1.5
[ 1.5  1.5]
[ 1.  2.]


In [87]:
#Exercise to see how tf.nn.softmax_cross_entropy_with_logits 
# and tf.nn.sparse_softmax_cross_entropy_with_logits works



sess = tf.Session()

# Assume y_hat is the calculated result from the tensor graph (Logit)
y_hat_array=np.array([[0.5, 1.5, 0.1],[2.2, 1.3, 1.7]])
y_hat = tf.convert_to_tensor(y_hat_array)
print ("y_hat\n",sess.run(y_hat),'\n')


#Softmax:
# For each batch `i` and class `j` we have
#      softmax[i, j] = exp(logits[i, j]) / sum_j(exp(logits[i, j]))
y_hat_softmax = tf.nn.softmax(y_hat)
print ("y_hat_softmax\n",sess.run(y_hat_softmax))

e = np.exp(y_hat_array)
print ("y_hat_softmax_2\n",e[0] / np.sum(e,axis=1)[0],'\n',e[1] / np.sum(e,axis=1)[1],'\n')
       



y_hat
 [[ 0.5  1.5  0.1]
 [ 2.2  1.3  1.7]] 

y_hat_softmax
 [[ 0.227863    0.61939586  0.15274114]
 [ 0.49674623  0.20196195  0.30129182]]
y_hat_softmax_2
 [ 0.227863    0.61939586  0.15274114] 
 [ 0.49674623  0.20196195  0.30129182] 



In [89]:
y_true = tf.convert_to_tensor(np.array([[0.0, 1.0, 0.0],[0.0, 0.0, 1.0]]))
#print ("y_true=", sess.run(y_true))
print ("")

loss_per_instance_1 = tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true)
print ("loss_per_instance_1=",sess.run(loss_per_instance_1))
# array([ 0.4790107 ,  1.19967598])

total_loss_1 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_hat, y_true))
print ("loss_total_1=",sess.run(total_loss_1),'\n')
# 0.83934333897877922




loss_per_instance_1= [ 0.4790107   1.19967598]
loss_total_1= 0.839343338979 



In [92]:
loss_per_instance_2 = -tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1])
print ("loss_per_instance_2=",sess.run(loss_per_instance_2))

total_loss_2 = tf.reduce_mean(-tf.reduce_sum(y_true * tf.log(y_hat_softmax), reduction_indices=[1]))
print ("total_loss_2=",sess.run(total_loss_2),'\n')


loss_per_instance_2= [ 0.4790107   1.19967598]
total_loss_2= 0.839343338979 



In [93]:
y_true_2 = tf.convert_to_tensor(np.array([1, 2]).astype(np.int64))
total_loss_3 = tf.nn.sparse_softmax_cross_entropy_with_logits(y_hat, y_true_2)
print ("loss_per_instance_3=",sess.run(total_loss_3))

total_loss_3 = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(y_hat, y_true_2))
print ("loss_total_3=",sess.run(total_loss_3))
# 0.83934333897877922

loss_per_instance_3= [ 0.4790107   1.19967598]
loss_total_3= 0.839343338979


## Training steps in MINIST
```Python
def training(loss, learning_rate):
  # Create the gradient descent optimizer with the given learning rate.
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  # Create a variable to track the global step.
  global_step = tf.Variable(0, name='global_step', trainable=False)
  # Use the optimizer to apply the gradients that minimize the loss
  # (and also increment the global step counter) as a single training step.
  train_op = optimizer.minimize(loss, global_step=global_step)
  return train_op
```

## Training Operation

```Python
def train(total_loss, global_step):
  """Train CIFAR-10 model.

  Create an optimizer and apply to all trainable variables. Add moving
  average for all trainable variables.

  Args:
    total_loss: Total loss from loss().
    global_step: Integer Variable counting the number of training steps
      processed.
  Returns:
    train_op: op for training.
  """
  # Variables that affect learning rate.
  num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN / FLAGS.batch_size
  decay_steps = int(num_batches_per_epoch * NUM_EPOCHS_PER_DECAY)

  # Decay the learning rate exponentially based on the number of steps.
  lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE,
                                  global_step,
                                  decay_steps,
                                  LEARNING_RATE_DECAY_FACTOR,
                                  staircase=True)
  tf.scalar_summary('learning_rate', lr)

  # Generate moving averages of all losses and associated summaries.
  loss_averages_op = _add_loss_summaries(total_loss)

  # Compute gradients.
  with tf.control_dependencies([loss_averages_op]):
    opt = tf.train.GradientDescentOptimizer(lr)
    grads = opt.compute_gradients(total_loss)

  # Apply gradients.
  apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

  # Add histograms for trainable variables.
  for var in tf.trainable_variables():
    tf.histogram_summary(var.op.name, var)

  # Add histograms for gradients.
  for grad, var in grads:
    if grad is not None:
      tf.histogram_summary(var.op.name + '/gradients', grad)

  # Track the moving averages of all trainable variables.
  variable_averages = tf.train.ExponentialMovingAverage(
      MOVING_AVERAGE_DECAY, global_step)
  variables_averages_op = variable_averages.apply(tf.trainable_variables())

  with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
    train_op = tf.no_op(name='train')

  return train_op
```  

## Initialization and Training Step

```Python
def train():
  """Train CIFAR-10 for a number of steps."""
  with tf.Graph().as_default():
    global_step = tf.Variable(0, trainable=False)

    # Get images and labels for CIFAR-10.
    # The distorted_inputs function randomly corp, flip and change brightness/contrast of images
    images, labels = cifar10.distorted_inputs()

    # Build a Graph that computes the logits predictions from the
    # inference model.
    logits = cifar10.inference(images)

    # Calculate loss.
    loss = cifar10.loss(logits, labels)

    # Build a Graph that trains the model with one batch of examples and
    # updates the model parameters.
    train_op = cifar10.train(loss, global_step)

    # Create a saver.
    saver = tf.train.Saver(tf.all_variables())

    # Build the summary operation based on the TF collection of Summaries.
    summary_op = tf.merge_all_summaries()

    # Build an initialization operation to run below.
    init = tf.initialize_all_variables()

    # Start running operations on the Graph.
    sess = tf.Session(config=tf.ConfigProto(
        log_device_placement=FLAGS.log_device_placement))
    sess.run(init)

    # Start the queue runners.
    tf.train.start_queue_runners(sess=sess)

    summary_writer = tf.train.SummaryWriter(FLAGS.train_dir, sess.graph)

    for step in xrange(FLAGS.max_steps):
      _, loss_value = sess.run([train_op, loss]) # Training
     # ...
     # ...
```

## Feeding with batch
> **Reading data**
>
> There are three main methods of getting data into a TensorFlow program:

> * Feeding: Python code provides the data when running each step.
> * Reading from files: an input pipeline reads the data from files at the beginning of a TensorFlow graph.
> * Preloaded data: a constant or variable in the TensorFlow graph holds all the data (for small data sets).


### Inside cifar10.distorted_inputs()

```Python
def _generate_image_and_label_batch(image, label, min_queue_examples,
                                    batch_size, shuffle):
  """Construct a queued batch of images and labels.

  Args:
    image: 3-D Tensor of [height, width, 3] of type.float32.
    label: 1-D Tensor of type.int32
    min_queue_examples: int32, minimum number of samples to retain
      in the queue that provides of batches of examples.
    batch_size: Number of images per batch.
    shuffle: boolean indicating whether to use a shuffling queue.

  Returns:
    images: Images. 4D tensor of [batch_size, height, width, 3] size.
    labels: Labels. 1D tensor of [batch_size] size.
  """
  # Create a queue that shuffles the examples, and then
  # read 'batch_size' images + labels from the example queue.
  num_preprocess_threads = 16
  if shuffle:
    images, label_batch = tf.train.shuffle_batch(
        [image, label],
        batch_size=batch_size,
        num_threads=num_preprocess_threads,
        capacity=min_queue_examples + 3 * batch_size,
        min_after_dequeue=min_queue_examples)
  else:
    images, label_batch = tf.train.batch(
        [image, label],
        batch_size=batch_size,
        num_threads=num_preprocess_threads,
        capacity=min_queue_examples + 3 * batch_size)

  # Display the training images in the visualizer.
  tf.image_summary('images', images)

  return images, tf.reshape(label_batch, [batch_size])

```