# Multi-Layer Perceptron
We introduce a multi-layer perceptron, aka a 3 layer fully connected neural network. We do this using the MNIST data once again. We first write the model as we were doing before. Next, we will clean it up a bit using more TensorFlow API. Finally, we will show how we can cleanly organize the functions using `tf.Estimaor`.

## Preparation

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [2]:
num_features = mnist.train.images.shape[1]
num_classes  = mnist.train.labels.shape[1]
num_hidden_1 = 256
num_hidden_2 = 256

## Version 1

Let's build a three layer fully connected neural network. It's actually very straightforward.
We have two hidden layers and one output layer. Two hidden layers have a activation function (chosen to be ReLU). 

In [3]:
tf.reset_default_graph() # Clearing all tensors before this

In [4]:
with tf.name_scope('data'):
    X = tf.placeholder(tf.float32, shape=[None, num_features], name='Input-Images')
    Y = tf.placeholder(tf.float32, shape=[None, num_classes], name='Output-Labels')

In [5]:
with tf.name_scope('fc1'): # first hidden layer variables
    W1 = tf.Variable(tf.random_normal([num_features, num_hidden_1]),name='weights')
    b1 = tf.Variable(tf.random_normal([num_hidden_1]),name='bias')

with tf.name_scope('fc2'): # second hidden layer variables
    W2 = tf.Variable(tf.random_normal([num_hidden_1, num_hidden_2]),name='weights')
    b2 = tf.Variable(tf.random_normal([num_hidden_2]),name='bias')

with tf.name_scope('out'): # output layer variables
    Wout = tf.Variable(tf.random_normal([num_hidden_2, num_classes]),name='weights')
    bout = tf.Variable(tf.random_normal([num_classes]),name='bias')

In [6]:
with tf.name_scope('multilayer_perceptron'): # 3 layer fully connected network
    H1 = tf.nn.relu(X @ W1 + b1, name='H1')
    H2 = tf.nn.relu(H1 @ W2 + b2, name='H2')
    logits = tf.add(H2 @ Wout, bout, name='out')

The final layer outputs a `logit`, which is just the output of the neural network before `softmax` is applied to make it a probability. This is because TensorFlow has a convenient API `tf.nn.softmax_cross_entropy_with_logits` that applies `softmax` to `logit`, and then computes a `cross_entopy` on each element. So, all we have to do is to sum up those cross entropies to reduce it to a single loss.

In [7]:
with tf.name_scope('loss'):
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                             logits=logits, labels=Y),name='loss')
    correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(Y, 1))
    accuracy           = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))   
    
with tf.name_scope('optimizer'):
    learning_rate = 0.01
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    update = optimizer.minimize(loss)

with tf.name_scope('summaries'):
    tf.summary.scalar('loss', loss)
    tf.summary.histogram('histogram-loss', loss)
    summary_op = tf.summary.merge_all()

We can train the model just like we did before. The accuracy doesn't seem to improve much (with more epochs, it should).

In [8]:
# Train
num_epochs  = 25
batch_size  = 100

with tf.Session() as sess:
    writer = tf.summary.FileWriter('log/multilayer_perceptron1', sess.graph)
    
    sess.run(tf.global_variables_initializer())
    total_batch = int(mnist.train.num_examples/batch_size)
    for epoch in range(num_epochs):
        average_cost = 0
        for batch in range(total_batch):
            batch_X, batch_Y = mnist.train.next_batch(batch_size)
            _, c = sess.run([update, loss], feed_dict={X: batch_X,
                                                       Y: batch_Y})
            average_cost += c / total_batch
            summary = sess.run(summary_op, feed_dict={X: batch_X,
                                                      Y: batch_Y})
            global_step = epoch*total_batch + batch
            writer.add_summary(summary, global_step=global_step)
        print("Epoch:",epoch,"Cost:",average_cost)
    
    print("Test Accuracy:", accuracy.eval({X: mnist.test.images, Y: mnist.test.labels}))    
    writer.close()

Epoch: 0 Cost: 63.5227008659
Epoch: 1 Cost: 11.7255829098
Epoch: 2 Cost: 7.04268976911
Epoch: 3 Cost: 4.94169215168
Epoch: 4 Cost: 3.7582281366
Epoch: 5 Cost: 3.02995569802
Epoch: 6 Cost: 2.51047183418
Epoch: 7 Cost: 2.11883776171
Epoch: 8 Cost: 1.82096150981
Epoch: 9 Cost: 1.56255665463
Epoch: 10 Cost: 1.36567232466
Epoch: 11 Cost: 1.20529388641
Epoch: 12 Cost: 1.07357674758
Epoch: 13 Cost: 0.954112300066
Epoch: 14 Cost: 0.868395346037
Epoch: 15 Cost: 0.793990999275
Epoch: 16 Cost: 0.716135683926
Epoch: 17 Cost: 0.652914618337
Epoch: 18 Cost: 0.601221448492
Epoch: 19 Cost: 0.552582887719
Epoch: 20 Cost: 0.508905344259
Epoch: 21 Cost: 0.476214547663
Epoch: 22 Cost: 0.442873191204
Epoch: 23 Cost: 0.410599114106
Epoch: 24 Cost: 0.384783977855
Test Accuracy: 0.9184


## Version 2

The only differense in this version is that each layer is replaced with a single API `tf.layers.dense`. The convenient thing about this is that 1) it's clean and 2) you don't need to initialize weight and bias variables yourself. The rest is the same as before.

In [92]:
tf.reset_default_graph() # Clearing all tensors before this

In [93]:
with tf.name_scope('data'):
    X = tf.placeholder(tf.float32, shape=[None, num_features], name='Input-Images')
    Y = tf.placeholder(tf.float32, shape=[None, num_classes], name='Output-Labels')

In [94]:
with tf.name_scope('multilayer_perceptron'): # Now, it is only three lines
    # Hidden fully connected layer with 256 neurons
    layer_1 = tf.layers.dense(X, num_hidden_1, tf.nn.relu, name='fc1')
    # Hidden fully connected layer with 256 neurons
    layer_2 = tf.layers.dense(layer_1, num_hidden_2, tf.nn.relu, name='fc2')
    # Output fully connected layer with a neuron for each class
    logits = tf.layers.dense(layer_2, num_classes, name='out')

In [95]:
with tf.name_scope('loss'):
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                             logits=logits, labels=Y),name='loss')
    correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(Y, 1))
    accuracy           = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))   
    
with tf.name_scope('optimizer'):
    learning_rate = 0.01
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    update = optimizer.minimize(loss)

with tf.name_scope('summaries'):
    tf.summary.scalar('loss', loss)
    tf.summary.histogram('histogram-loss', loss)
    summary_op = tf.summary.merge_all()

In [72]:
# Train
num_epochs  = 25
batch_size  = 100

with tf.Session() as sess:
    writer = tf.summary.FileWriter('log/multilayer_perceptron2', sess.graph)
    
    sess.run(tf.global_variables_initializer())
    total_batch = int(mnist.train.num_examples/batch_size)
    for epoch in range(num_epochs):
        average_cost = 0
        for batch in range(total_batch):
            batch_X, batch_Y = mnist.train.next_batch(batch_size)
            _, c = sess.run([update, loss], feed_dict={X: batch_X,
                                                       Y: batch_Y})
            average_cost += c / total_batch
            summary = sess.run(summary_op, feed_dict={X: batch_X,
                                                      Y: batch_Y})
            global_step = epoch*total_batch + batch
            writer.add_summary(summary, global_step=global_step)
        print("Epoch:",epoch,"Cost:",average_cost)
    
    print("Test Accuracy:", accuracy.eval({X: mnist.test.images, Y: mnist.test.labels}))    
    writer.close()

Epoch: 0 Cost: 1.1429620449651363
Epoch: 1 Cost: 0.44573915083299925
Epoch: 2 Cost: 0.35664164621721617
Epoch: 3 Cost: 0.31739128164269714
Epoch: 4 Cost: 0.2910888133265754
Epoch: 5 Cost: 0.2713912793858482
Epoch: 6 Cost: 0.25498610958456974
Epoch: 7 Cost: 0.2407954240658066
Epoch: 8 Cost: 0.22825071882117884
Epoch: 9 Cost: 0.21688424812121837
Epoch: 10 Cost: 0.20677164849909882
Epoch: 11 Cost: 0.19733745005320408
Epoch: 12 Cost: 0.1889122671769422
Epoch: 13 Cost: 0.18079517665234468
Epoch: 14 Cost: 0.17358795196495266
Epoch: 15 Cost: 0.16686928619037983
Epoch: 16 Cost: 0.1604892960935832
Epoch: 17 Cost: 0.15470593801953564
Epoch: 18 Cost: 0.14925623104653593
Epoch: 19 Cost: 0.14406697281382297
Epoch: 20 Cost: 0.1391177260740237
Epoch: 21 Cost: 0.1345416298576378
Epoch: 22 Cost: 0.13013896662741917
Epoch: 23 Cost: 0.12604238837618725
Epoch: 24 Cost: 0.12211634944108385
Test Accuracy: 0.9623


The performance is much better now! Why? Probably because the weight initialization method is different. When we manually initialize the network variables, we use the standard normal distribution. Probably Tensorflow uses random normal but with smaller variance.

## Version 3

The final version introduces the newly introduced `tf.estimator`. This is essentially a wrapper for the models you create. If we use `tf.estimator` we no longer have to code up `for` loops for the mini batch training. All we need to do is to write the model, and tell the estimator to run it for a certain number of steps. `tf.estimator` is also cool because it is very easy to make a prediction, and evaluate it.

In [9]:
tf.reset_default_graph() # Clearing all tensors before this

First, create a multi-layer perceptron just like we did before, but this time, let's make it a function.

In [10]:
def multilayer_perceptron(X_dict):
    # Estimator input is a dict, in case of multiple inputs
    X = X_dict['images']
    layer_1 = tf.layers.dense(X, num_hidden_1, tf.nn.relu, name='fc1')
    layer_2 = tf.layers.dense(layer_1, num_hidden_2, tf.nn.relu, name='fc2')
    logits = tf.layers.dense(layer_2, num_classes, name='out')
    return logits

Next, we create the input to the estimator. We need to specify the entire model, including the loss, optimizer, etc. We build a function `model_fn` with a pre-specified signature.

`model_fn` should take in features, labels and mode. Mode tells the function whether you're running it in `TRAIN`, `EVAL` or `PREDICTION`. You can code different behaviors for each mode.

In [5]:
def model_fn(features, labels, mode):
    # get logit prediction from NN
    logits = multilayer_perceptron(features)

    pred_classes = tf.argmax(logits, axis=1)

    # if prediction mode, early return
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode, predictions=pred_classes)
    
    # these are all the same code as before
    with tf.name_scope('loss'):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                                 logits=logits, labels=labels),name='loss')
        accuracy = tf.metrics.accuracy(tf.argmax(logits, 1), tf.argmax(labels, 1))

    with tf.name_scope('optimizer'):
        learning_rate = 0.01
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
        update = optimizer.minimize(loss,
                                    global_step=tf.train.get_global_step())

    with tf.name_scope('summaries'):
        tf.summary.scalar('loss', loss)
        tf.summary.histogram('histogram-loss', loss)
        summary_op = tf.summary.merge_all()

    # Estimator requires to return a EstimatorSpec, that specify
    # the different ops for training, evaluating, ...
    estim_specs = tf.estimator.EstimatorSpec(
        mode=mode,
        predictions=pred_classes,
        loss=loss,
        train_op=update,
        eval_metric_ops={'accuracy': accuracy})

    return estim_specs

Now, you can build the model with a single line.

In [None]:
model = tf.estimator.Estimator(model_fn) # build the Estimator

Training then becomes extremely simple. First, define an input using `tf.estimator.inputs`.

In [None]:
# define the input function for training
batch_size = 128
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.train.images}, y=mnist.train.labels,
    batch_size=batch_size, num_epochs=None, shuffle=True)

Then, training is a one liner!

In [8]:
num_steps = 5000
model.train(input_fn, steps=num_steps) # train the Model

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/lj/p0jqksf54pldc98grzy8m6p00000gn/T/tmpv7e4ojg3', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x181c97af98>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/lj/p0jqksf54pldc98grzy8m6p00000gn/T/tmpv7e4ojg3/model.ckpt.
INFO:tensorflow:loss = 2.3240983, step = 1
INFO:tensorflow:global_step/sec: 239.482
INFO:tensorflow:loss = 1.7728766, step = 101 (0.419 sec)
INFO:tensorflow:global_step/sec: 243.212
INFO:tensorflow:loss = 1.2416974, 

To evaluate the model, you need to specify the test input and run `.evaluate` instead of `.train`. Then, the mode will be set to `PREDICTION`.

In [None]:

# define the input function for evaluating
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.test.images}, y=mnist.test.labels,
    batch_size=batch_size, shuffle=False)
# use the Estimator 'evaluate' method
e = model.evaluate(input_fn)

print("Testing Accuracy:", e['accuracy'])

That's it! You can choose whichever version you prefer when constructing your model. Although we didn't introduce here, it is also a good strategy to create a python `Class` for each of your model and try to be object oriented. In the sense that it tries to "package" a model, `Class` is similar to `tf.Estimator`.

## Exercise
Improve the above network by even a bit.
For example, you can try CNN, dropout, different activation function, tune the learning rate, etc.