# Batch Normalization – lab
[from Udacity's DLND] (https://www.udacity.com/course/deep-learning-nanodegree--nd101)

* MNIST classifier 
* Neural network with 20 convolutional layers
* Fully connected layer

Although this is not a good architecture for MNIST classifier, it's a good example to illustrate batch-norm

1. Complicated enough that training would benefit from batch normalization.
2. Simple enough that it would train quickly, since this is meant to be a short exercise just to give you some practice adding batch normalization.
3. Simple enough that the architecture would be easy to understand without additional resources.

More about Batch-norm https://padlet.com/nvmoyar/9dfpaqb93c4e

This notebook includes two versions of the network: 

1. [Batch Normalization with `tf.layers.batch_normalization`](#example_1) --> higher abstraction level package
2. [Batch Normalization with `tf.nn.batch_normalization`](#example_2) --> lower abstraction level package

The following cell loads TensorFlow, downloads the MNIST dataset if necessary, and loads it into an object named `mnist`. You'll need to run this cell before running anything else in the notebook.

In [13]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True, reshape=False)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


## Model without batch Normalization

In [2]:
def fully_connected(prev_layer, num_units):
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
    layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
    return layer

We'll use the following function to create convolutional layers in our network. They are very basic: 

* 3x3 kernel
* ReLU activation functions
* strides of 1x1 on layers with odd depths
* and strides of 2x2 on layers with even depths

We aren't bothering with pooling layers at all in this network.
This version of the function does not include batch normalization.

In [3]:
def conv_layer(prev_layer, layer_depth):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    # if the depth is divisible by 3, then strides =2 and therefore, we have a half-sized image, otherwise same size
    strides = 2 if layer_depth % 3 == 0 else 1
    conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', activation=tf.nn.relu)
    return conv_layer

**Run the following cell**, along with the earlier cells (to load the dataset and define the necessary functions). 

This cell builds the network **without** batch normalization, then trains it on the MNIST dataset. It displays loss and accuracy data periodically while training.

### Shapes

We will run a 20 convolutional layers, and we set depth as layer_i*4 which is a way of making the shapes to change automatically. We need to bear in mind that shapes are related to strides in conv_layer() function. If depth is divisible by 3, stride is 2 which means that every 3 loops, inputs change size. 

So the shapes will be:

* 1 (?, 28, 28, 4)
* 2 (?, 28, 28, 8)
* 3 (?, 14, 14, 12)
* 4 (?, 14, 14, 16)
* 5 (?, 14, 14, 20)
* 6 (?, 7, 7, 24)
* 7 (?, 7, 7, 28)
* 8 (?, 7, 7, 32)
* 9 (?, 4, 4, 36)
* 10 (?, 4, 4, 40)
* 11 (?, 4, 4, 44)
* 12 (?, 2, 2, 48)
* 13 (?, 2, 2, 52)
* 14 (?, 2, 2, 56)
* 15 (?, 1, 1, 60)
* 16 (?, 1, 1, 64)
* 17 (?, 1, 1, 68)
* 18 (?, 1, 1, 72)
* 19 (?, 1, 1, 76)

In [8]:
def train(num_batches, batch_size, learning_rate):
    
    # BUILD THE MODEL #########################################
    
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # Feed the inputs into a series of 20 convolutional layers 
    
    layer = inputs  # inputs  (?, 28, 28, 1)
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i) # we use layer_i to apply depths automatically
                  
    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list() # [None, 1, 1, 76]
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]]) # after reshape  (?, 76)

    # Add one fully connected layer
    layer = fully_connected(layer, 100) # (?, 100)

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10) # shape=(?, 10)
    
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)
    
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # END OF THE MODEL ###############################################
    
    # TRAIN AND TEST THE MODEL #######################################
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches): # now we define batches 
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys})
            
            # Periodically check the validation or training loss and accuracy
            if batch_i % 100 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels})
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]]})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)

Batch:  0: Validation loss: 0.69137, Validation accuracy: 0.09580
Batch: 25: Training loss: 0.33708, Training accuracy: 0.09375
Batch: 50: Training loss: 0.32516, Training accuracy: 0.04688
Batch: 75: Training loss: 0.32335, Training accuracy: 0.26562
Batch: 100: Validation loss: 0.32559, Validation accuracy: 0.11260
Batch: 125: Training loss: 0.32436, Training accuracy: 0.14062
Batch: 150: Training loss: 0.32702, Training accuracy: 0.01562
Batch: 175: Training loss: 0.32555, Training accuracy: 0.10938
Batch: 200: Validation loss: 0.32592, Validation accuracy: 0.11000
Batch: 225: Training loss: 0.32514, Training accuracy: 0.15625
Batch: 250: Training loss: 0.32563, Training accuracy: 0.09375
Batch: 275: Training loss: 0.32642, Training accuracy: 0.07812
Batch: 300: Validation loss: 0.32616, Validation accuracy: 0.11000
Batch: 325: Training loss: 0.32585, Training accuracy: 0.09375
Batch: 350: Training loss: 0.32257, Training accuracy: 0.15625
Batch: 375: Training loss: 0.32602, Trainin

* Final validation accuracy: 0.09900
* Final test accuracy: 0.10090
* Accuracy on 100 samples: 0.11


## Adding batch normalization and improving the previous results

For this exercise, we will use [`tf.layers.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization) to handle most of the math, but you'll need to make a few other changes to your network to integrate batch normalization. 

Batch Normalization using `tf.layers`  [`tf.layers.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization)

To add batch normalization to the layers created by conv_layer, we do the following:

* Added the is_training parameter to the function signature so we can pass that information to the batch normalization layer.
* Removed the bias and activation function from the conv2d layer.
* Used tf.layers.batch_normalization to normalize the convolutional layer's output. Notice we pass is_training to this layer to ensure the network updates its population statistics appropriately.
* Passed the normalized values into a ReLU activation function.

Batch normalization is still a new enough idea that researchers are still discovering how best to use it. In general, people seem to agree to remove the layer's bias (because the batch normalization already has terms for scaling and shifting) and add batch normalization before the layer's non-linear activation function. However, for some networks it will work well in other ways, too.

In [14]:
def fully_connected(prev_layer, num_units, is_training):
    
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
  
    layer = tf.layers.dense(prev_layer, num_units, use_bias=False, activation=None)
    layer = tf.layers.batch_normalization(layer, training=is_training)
    layer = tf.nn.relu(layer)
    return layer

In [15]:
def conv_layer(prev_layer, layer_depth, is_training):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    strides = 2 if layer_depth % 3 == 0 else 1
    conv_layer = tf.layers.conv2d(prev_layer, layer_depth*4, 3, strides, 'same', use_bias=False, activation=None)
    conv_layer = tf.layers.batch_normalization(conv_layer, training=is_training)
    conv_layer = tf.nn.relu(conv_layer)

    return conv_layer

### Modify the train

* Add is_training, a placeholder to store a boolean value indicating whether or not the network is training.
* Pass is_training parameter to the conv_layer and fully_connected functions, as expected. 
* Each time we call run on the session, we added to feed_dict the appropriate value for **is_training**.

* In addition to that code, the training step is wrapped in the following with statement:

    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
    This line actually works in conjunction with the training parameter we pass to tf.layers.batch_normalization. Without it, TensorFlow's batch normalization layer will not operate correctly during inference.

    Finally, whenever we train the network or perform inference, we use the feed_dict to set self.is_training to True or False, respectively, like in the following line:

    session.run(train_step, feed_dict={self.input_layer: batch_xs, labels: batch_ys, self.is_training: True})
    We'll go into more details later, but next we want to show some experiments that use this code and test networks with and without batch normalization.

    Moved the creation of train_opt inside a with tf.control_dependencies... statement. This is necessary to get the normalization layers created with tf.layers.batch_normalization to update their population statistics, which we need when performing inference. Without this context update, the inference won't work properly and low accuracies will be
    achieved instead. 

In [17]:
# Edit the train function to support batch normalization and make sure it updates and uses its population 
# statistics correctly.


def train(num_batches, batch_size, learning_rate):
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # Add placeholder to indicate whether or not we're training the model
    is_training = tf.placeholder(tf.bool)

    
    # Feed the inputs into a series of 20 convolutional layers 
    layer = inputs 
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i, is_training) # is_training

    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list()
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]])

    # Add one fully connected layer
    layer = fully_connected(layer, 100, is_training)  # is_training

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10)
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    
    # Tell TensorFlow to update the population statistics while training
    # A context manager that specifies control dependencies for all operations constructed within the context.
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)   # !IMPORTANT
       
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # Train and test the network
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys, is_training: True})  # is_training = Value
            
            # Periodically check the validation or training loss and accuracy, every 100 epochs
            if batch_i % 100 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels, is_training: False}) # now we are not training, we are validating
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys, is_training: False})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels, is_training: False})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels, is_training: False})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]], is_training: False})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)

Batch:  0: Validation loss: 0.69091, Validation accuracy: 0.09760
Batch: 25: Training loss: 0.59167, Training accuracy: 0.07812
Batch: 50: Training loss: 0.49184, Training accuracy: 0.07812
Batch: 75: Training loss: 0.42548, Training accuracy: 0.06250
Batch: 100: Validation loss: 0.37773, Validation accuracy: 0.09240
Batch: 125: Training loss: 0.35851, Training accuracy: 0.06250
Batch: 150: Training loss: 0.33854, Training accuracy: 0.10938
Batch: 175: Training loss: 0.32412, Training accuracy: 0.15625
Batch: 200: Validation loss: 0.29513, Validation accuracy: 0.25780
Batch: 225: Training loss: 0.24947, Training accuracy: 0.45312
Batch: 250: Training loss: 0.23215, Training accuracy: 0.50000
Batch: 275: Training loss: 0.28767, Training accuracy: 0.42188
Batch: 300: Validation loss: 0.17280, Validation accuracy: 0.64820
Batch: 325: Training loss: 0.10398, Training accuracy: 0.78125
Batch: 350: Training loss: 0.14164, Training accuracy: 0.73438
Batch: 375: Training loss: 0.13091, Trainin

With batch normalization, you should now get an accuracy over 90%. Notice also the last line of the output: `Accuracy on 100 samples`. If this value is low while everything else looks good, that means you did not implement batch normalization correctly. Specifically, it means you either did not calculate the population mean and variance while training, or you are not using those values during inference.

# Batch Normalization using `tf.nn.batch_normalization`<a id="example_2"></a>

Most of the time you will be able to use higher level functions exclusively, but sometimes you may want to work at a lower level. For example, if you ever want to implement a new feature – something new enough that TensorFlow does not already include a high-level implementation of it, like batch normalization in an LSTM – then you may need to know these sorts of things.

This version of the network uses `tf.nn` for almost everything, and expects you to implement batch normalization using [`tf.nn.batch_normalization`](https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization).

**Optional TODO:** You can run the next three cells before you edit them just to see how the network performs without batch normalization. However, the results should be pretty much the same as you saw with the previous example before you added batch normalization. 

**TODO:** Modify `fully_connected` to add batch normalization to the fully connected layers it creates. Feel free to change the function's parameters if it helps.

**Note:** For convenience, we continue to use `tf.layers.dense` for the `fully_connected` layer. By this point in the class, you should have no problem replacing that with matrix operations between the `prev_layer` and explicit weights and biases variables.

In [None]:
def fully_connected(prev_layer, num_units):
    """
    Create a fully connectd layer with the given layer as input and the given number of neurons.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param num_units: int
        The size of the layer. That is, the number of units, nodes, or neurons.
    :returns Tensor
        A new fully connected layer
    """
    layer = tf.layers.dense(prev_layer, num_units, activation=tf.nn.relu)
    return layer

**TODO:** Modify `conv_layer` to add batch normalization to the fully connected layers it creates. Feel free to change the function's parameters if it helps.

**Note:** Unlike in the previous example that used `tf.layers`, adding batch normalization to these convolutional layers _does_ require some slight differences to what you did in `fully_connected`. 

In [None]:
def conv_layer(prev_layer, layer_depth):
    """
    Create a convolutional layer with the given layer as input.
    
    :param prev_layer: Tensor
        The Tensor that acts as input into this layer
    :param layer_depth: int
        We'll set the strides and number of feature maps based on the layer's depth in the network.
        This is *not* a good way to make a CNN, but it helps us create this example with very little code.
    :returns Tensor
        A new convolutional layer
    """
    strides = 2 if layer_depth % 3 == 0 else 1

    in_channels = prev_layer.get_shape().as_list()[3]
    out_channels = layer_depth*4
    
    weights = tf.Variable(
        tf.truncated_normal([3, 3, in_channels, out_channels], stddev=0.05))
    
    bias = tf.Variable(tf.zeros(out_channels))

    conv_layer = tf.nn.conv2d(prev_layer, weights, strides=[1,strides, strides, 1], padding='SAME')
    conv_layer = tf.nn.bias_add(conv_layer, bias)
    conv_layer = tf.nn.relu(conv_layer)

    return conv_layer

**TODO:** Edit the `train` function to support batch normalization. You'll need to make sure the network knows whether or not it is training.

In [None]:
def train(num_batches, batch_size, learning_rate):
    # Build placeholders for the input samples and labels 
    inputs = tf.placeholder(tf.float32, [None, 28, 28, 1])
    labels = tf.placeholder(tf.float32, [None, 10])
    
    # Feed the inputs into a series of 20 convolutional layers 
    layer = inputs
    for layer_i in range(1, 20):
        layer = conv_layer(layer, layer_i)

    # Flatten the output from the convolutional layers 
    orig_shape = layer.get_shape().as_list()
    layer = tf.reshape(layer, shape=[-1, orig_shape[1] * orig_shape[2] * orig_shape[3]])

    # Add one fully connected layer
    layer = fully_connected(layer, 100)

    # Create the output layer with 1 node for each 
    logits = tf.layers.dense(layer, 10)
    
    # Define loss and training operations
    model_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    train_opt = tf.train.AdamOptimizer(learning_rate).minimize(model_loss)
    
    # Create operations to test accuracy
    correct_prediction = tf.equal(tf.argmax(logits,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # Train and test the network
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for batch_i in range(num_batches):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            # train this batch
            sess.run(train_opt, {inputs: batch_xs, labels: batch_ys})
            
            # Periodically check the validation or training loss and accuracy
            if batch_i % 100 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: mnist.validation.images,
                                                              labels: mnist.validation.labels})
                print('Batch: {:>2}: Validation loss: {:>3.5f}, Validation accuracy: {:>3.5f}'.format(batch_i, loss, acc))
            elif batch_i % 25 == 0:
                loss, acc = sess.run([model_loss, accuracy], {inputs: batch_xs, labels: batch_ys})
                print('Batch: {:>2}: Training loss: {:>3.5f}, Training accuracy: {:>3.5f}'.format(batch_i, loss, acc))

        # At the end, score the final accuracy for both the validation and test sets
        acc = sess.run(accuracy, {inputs: mnist.validation.images,
                                  labels: mnist.validation.labels})
        print('Final validation accuracy: {:>3.5f}'.format(acc))
        acc = sess.run(accuracy, {inputs: mnist.test.images,
                                  labels: mnist.test.labels})
        print('Final test accuracy: {:>3.5f}'.format(acc))
        
        # Score the first 100 test images individually. This won't work if batch normalization isn't implemented correctly.
        correct = 0
        for i in range(100):
            correct += sess.run(accuracy,feed_dict={inputs: [mnist.test.images[i]],
                                                    labels: [mnist.test.labels[i]]})

        print("Accuracy on 100 samples:", correct/100)


num_batches = 800
batch_size = 64
learning_rate = 0.002

tf.reset_default_graph()
with tf.Graph().as_default():
    train(num_batches, batch_size, learning_rate)