Multi Layer Perceptron
=============

The Multi Layer Perceptron (MLP) is an extension of the classical [Perceptron](https://en.wikipedia.org/wiki/Perceptron) having one or more hiddel layers. Historically we must remember that after the [book of Minsky and Papert](https://en.wikipedia.org/wiki/Perceptrons_(book) in 1969 the research on neural networks was abandoned. In 1985 Rumelhart, Hinton and Williams published an [article](http://www.dtic.mil/docs/citations/ADA164453) on the use of a generalized delta rule for training an MLP. Rumelhart et al. experimentally verified that using the additional layer and the new update rule the network was able to solve the XOR problem. This result ignited again the research on neural networks. The authors stated:

*“In short, we believe that we have answered Minsky and Papert's challenge and have found a learning result sufficiently powerful to demonstrate that their pessimism about learning in multilayer machines was misplaced.”*

The MLP in its classical form, is based on an input layer, an hidden layer and an output layer. The transfer function used between the layers is generally a Sigmoid. The loss function can be defined as the mean squared error between the output and the labels. Each layer of the MLP can be represetned as a vector-matrix multiplication between an input vector $\boldsymbol{x}$ and a weight matrix $\boldsymbol{W}$. The resulting value is added to a bias and passed to an activation function, generating an output vector $\boldsymbol{y}$. These operations are equivalent to the weighted sum of the input values.

<p align="center">
<img src="../etc/img/mlp_model.png" width="500">
</p>

It is possible to stack multiple hidden layers into a single feedforward network. An MLP having multiple hidden layers can be defined as a **deep** neural network. However, deep MLP are not so common because the dense connections introduce many parameters and in very deep models this cause an explosion in the computational resources needed to manage it. However, the MLP is often used as the last stage of deep convolutional neural networks.

Implementing the model in Tensorflow
------------------------------------------

It is straightforward to implement the model in Tensorflow. Using the `tf.layers` facilities we can define a perceptron in three lines of code. Here I will use the implementation based on the `Estimator` class that requires to embedd the model into a function and associate it to the estimator object. The model is automatically stored in a folder (specified when you create the estimator) and a checkpoint is saved during the training. Thanks to the checkpoint you can resume the training at any time. The output value returned by the MLP is a real number between 0 and 1 given by the sigmoid. The output can be approximated to the closest integer using `tf.round()`. Through the round off it is possible to interpret the results in term of classification and print the accuracy.

In [None]:
import tensorflow as tf

In [None]:
def my_model_fn(features, labels, mode):
    #Defining the MLP model
    x = tf.reshape(features, [-1, 2])
    h = tf.layers.dense(inputs=x, units=8, activation=tf.nn.sigmoid)   
    y = tf.layers.dense(inputs=h, units=1, activation=tf.nn.sigmoid)
    #PREDICT mode
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions = {"classes": tf.round(y),
                       "probabilities": y}
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    #TRAIN mode
    elif mode == tf.estimator.ModeKeys.TRAIN:
        loss = tf.losses.mean_squared_error(labels=labels, predictions=y)
        #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        optimizer = tf.train.AdamOptimizer()
        train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
        accuracy = tf.metrics.accuracy(labels=labels, predictions=tf.round(y))
        tf.summary.scalar('accuracy', accuracy[1]) #<-- accuracy[1] to grab the value
        logging_hook = tf.train.LoggingTensorHook({"accuracy" : accuracy[1]}, every_n_iter=250)
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op, training_hooks =[logging_hook])
    #EVAL mode
    elif mode == tf.estimator.ModeKeys.EVAL:
        loss = tf.losses.mean_squared_error(labels=labels, predictions=y)
        accuracy = tf.metrics.accuracy(labels=labels, predictions=tf.round(y))
        eval_metric = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric)

In [None]:
mlp = tf.estimator.Estimator(model_fn=my_model_fn, model_dir="./tf_mlp_model")

Training the model
---------------------

Once we have the model ready, we can train it on a dataset. Here I will use the **XOR dataset** that has been created in [another notebook](../xor/xor.ipynb) of this repository. You do not have to run the notebook, since a version of the dataset has been included in TensorBag and is ready to be used. With the estimator class of Tensorflow it is necessary to pass an input function to the trainer. Here I define this function and I parse the dataset that is stored in TFRecord format. The dataset is allocated as a Tensorflow `Dataset` object, that makes very easy to return samples from it. Remember that you can monitor the training using **Tensorboard** through the `--logdir` parameter from the terminal.

In [None]:
def my_input_fn():  
    def _parse_function(example_proto):
        features = {"feature": tf.VarLenFeature(tf.float32),
                    "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
        parsed_features = tf.parse_single_example(example_proto, features)
        feature = tf.cast(parsed_features["feature"], tf.float32)
        feature = tf.sparse_tensor_to_dense(feature, default_value=0)
        label = tf.reshape(parsed_features["label"], [1])
        label = tf.cast(label, tf.float32)
        return feature, label

    tf_train_dataset = tf.data.TFRecordDataset("../xor/xor_train.tfrecord")
    tf_train_dataset = tf_train_dataset.map(_parse_function)
    tf_train_dataset.cache() # caches entire dataset
    #Setting a buffer_size greater than the number of examples in the Dataset 
    #ensures that the data is completely shuffled. 
    tf_train_dataset = tf_train_dataset.shuffle(buffer_size = 8000 * 2) # shuffle all the elements
    #The repeat method has the Dataset restart when it reaches the end.
    tf_train_dataset = tf_train_dataset.repeat() # repeats dataset this times
    #The batch method collects a number of examples and stacks them, to create batches. 
    #This adds a dimension to their shape. The new dimension is added as the first dimension.
    #The batch may have unknown batch size because the last batch can have fewer elements.
    tf_train_dataset = tf_train_dataset.batch(32) # batch size
    
    iterator = tf_train_dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)

In [None]:
mlp.train(input_fn=my_input_fn, steps=5000)

Evaluation on the test set
------------------------------

The XOR dataset also has a test set that can be used to estimate the accuracy of the MLP. Here we use the `eval()` method of the Tensorflow Estimator class and we pass an input function that pre-process the dataset. The result is in term of loss and accuracy.

In [None]:
def my_eval_input_fn():
    def _parse_function(example_proto):
        features = {"feature": tf.VarLenFeature(tf.float32),
                    "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
        parsed_features = tf.parse_single_example(example_proto, features)
        feature = tf.cast(parsed_features["feature"], tf.float32)
        feature = tf.sparse_tensor_to_dense(feature, default_value=0)
        label = tf.reshape(parsed_features["label"], [1])
        label = tf.cast(label, tf.float32)
        return feature, label

    tf_test_dataset = tf.data.TFRecordDataset("../xor/xor_test.tfrecord")
    tf_test_dataset = tf_test_dataset.map(_parse_function)
    tf_test_dataset.cache() # caches entire dataset
    tf_test_dataset = tf_test_dataset.repeat(1) # repeats dataset this times
    tf_test_dataset = tf_test_dataset.batch(1) # batch size  
    
    iterator_test = tf_test_dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator_test.get_next()
    return batch_features, batch_labels

In [None]:
mlp.evaluate(input_fn=my_eval_input_fn, steps=2000)

Using the model on custom data
----------------------------------

To use the model on custom data it is possible to use the `predict()` method of the estimator class. Also in this case an input function must be used in order to return the input features. You may need the truth table of the XOR operator:

- 1 XOR 1 = 0
- 1 XOR 0 = 1
- 0 XOR 1 = 1
- 0 XOR 0 = 0

When the input values are equal (same sign) the output is False (zero). When the input values are different, the output is True (one). Here I hardcoded some values that belong to the four quadrants of the XOR plane. The output is printed on the terminal and should be: zero, one, one, zero. You can also try different values and verify the output from the model.

In [None]:
def my_predict_input_fn():
    feture_batch = tf.constant([[3.5, 2.9], [3.5, -2.9], [-3.5, 2.9], [-3.5, -2.9]])
    
    tf_predict_dataset = tf.data.Dataset.from_tensor_slices((feture_batch))
    tf_predict_dataset = tf_predict_dataset.repeat(1)
    
    iterator_predict = tf_predict_dataset.make_one_shot_iterator()
    batch_features = iterator_predict.get_next()
    return batch_features

In [None]:
predictions = mlp.predict(input_fn=my_predict_input_fn)

for i, prediction in enumerate(predictions):
    print "Predicted class: " + str(prediction['classes'])
    print "Probabilities: " + str(prediction['probabilities'])
    print ""

Improving the performance
-----------------------------

You can further improve the performance changing the **structure of the network**. For instance, you can increase the number of hidden units. You must be careful because when the number of units is too large you may encounter [overfitting](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) problems. Another important factor is the **learning rate**. In this tutorial we used Adam as adaptive gradient method for setting the learning rate. It has been shown in the [article](https://arxiv.org/pdf/1412.6980.pdf) of Kignma and Lei Ba (2015) that Adam perform better than other methods, leading to a lower loss in the same amount of iterations. Comparing different adaptive gradient methods on the MNIST dataset, shows that Adam has an optimal performance.

<p align="center">
<img src="../etc/img/mlp_optimizers.png" width="300">
</p>

Altough the Adam optimizer seems to be the best choice in many classification problmes, you can also try different optimizers. For instance, you can train an MLP on the MNIST dataset and verify the results of Kignma and Lei Ba (2015) on different adaptive methods.

Conclusions
-------------

In this tutorial we saw how the MLP works and how it is possible to train it on a simple non-linear dataset. A good exercise is to train the MLP on a different dataset. A valid dataset is the [Iris Flower](../iris/iris.ipynb), available on this repository. To train the MLP on a new dataset you may need to change the number of units in the input and output layers, and the loss function. You can follow the [Quiz notebook](./mlp_quiz.ipynb) to implement an MLP for the Iris Flowers dataset. Good luck!

**Copyright (c)** 2018 Massimiliano Patacchiola, MIT License