# Jonathan Halverson
# Thursday, November 16, 2017
# Batch normalization

One reason why DNN's were abandonned was that the gradients tended to vanish or explode in proceeding along the layers. This problem was solved using a combination of initialization scheme, activation function and batch normalization (BN). The idea behind BN is to maintain the strength of the gradients so backpropagration can be used. BN is typically applied after the summation and before the activation function is applied. The procedure essentially standardizes the outputs and then finds a scaling and offset parameter for each layer.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
tf.reset_default_graph()

In [3]:
n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [4]:
X = tf.placeholder(shape=(None, n_inputs), dtype=tf.float32, name="X")
y = tf.placeholder(shape=(None), dtype=tf.int32, name="y")

In [5]:
training = tf.placeholder_with_default(False, shape=(None), name="training")

In [6]:
from functools import partial

In [7]:
my_batch_norm_layer = partial(tf.layers.batch_normalization, training=training, momentum=0.9)

In [8]:
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

In [9]:
hidden2 = tf.layers.dense(bn1_act, n_hidden2, activation=tf.nn.elu, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

In [10]:
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, activation=None, name="logits_before_bn")
logits = my_batch_norm_layer(logits_before_bn)

In [11]:
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")

In [12]:
learning_rate = 0.01
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

In [13]:
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [14]:
init = tf.global_variables_initializer()

In [15]:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

In [16]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [17]:
batch_size = 1000
n_epochs = 100

In [18]:
with tf.Session() as sess:
     init.run()
     for epoch in range(n_epochs):
          for iteration in range(mnist.train.num_examples // batch_size):
               X_batch, y_batch = mnist.train.next_batch(batch_size)
               sess.run([training_op, extra_update_ops], feed_dict={training:True, X:X_batch, y:y_batch})
     accuracy_val = accuracy.eval(feed_dict={training:False, X:mnist.test.images, y:mnist.test.labels})
     print(epoch, "Test accuracy=", accuracy_val)

(99, 'Test accuracy=', 0.97049999)


With only two hidden layers the effect of BN is not seen. It may have helped speed up training or not.