## Feed-forward Neural Networks
In this note book we develop a system to recognize handwritten digit MNIST using feed-forward nets (FFN) a.k.a fully-connected-nets. The architecture of FFN with one hidden layer is illustrated below

<img src="./assets/ffn_1h.png" width="60%">

where in mathematics term we define
* $W_{ih}\in\mathbb{R}^{784\times H}$ is the weights from input-to-hidden
* $b_{h}\in\mathbb{R}^{H}$ is the biases for hidden layer
* $W_{ho}\in\mathbb{R}^{H\times 10}$ is the weights from hidden to output
* $b_{o}\in\mathbb{R}^{10}$ is the biases for output layer
* $a_h(x)$ is activation at hidden layer
* given an input $x$ (represented in row-vector form), the hidden layer $h$ and output layer $o$ are computed by following equation
    $$\left\{\begin{split} 
    h &= a_h\left(x\times W_{ih} + b_h\right)\\
    o &= \mathrm{softmax}\left(h\times W_{ho} + b_o\right)
    \end{split}
    \right.$$
   where 
   $$
   \mathrm{softmax}\left(v_1,\ldots,v_n\right) = \left(\frac{e^{v_1}}{\sum_{i=1}^n e^{v_i}},\ldots,\frac{e^{v_n}}{\sum_{i=1}^n e^{v_i}}\right)
   $$
* given the label $y$, the cross-entropy loss is defined as
$$
-\log(o_{y})
$$
we can see by minimize the loss function, it's equivalent to make $o_y$ closer to 1.

The activation function can be one of the following forms
* $a_h(x) = \frac{1}{1+e^{-x}}$  is called sigmoid activation
* $a_h(x) = \tanh(x)$  is called tanh activation
* $a_h(x) = \max(x,0)$ is call ReLU activation

## Implement FFN

We can implement FFN from scratch but it's much easier to use Tensorflow. Furthermore Tensorflow also allows you to harvest the GPU power without event changing your code.

We start by loading some modules

In [67]:
# import numpy, tensorflow
import numpy as np
import tensorflow as tf
import sys
from tensorflow.examples.tutorials.mnist import input_data
mnist_data = input_data.read_data_sets("MNIST_data/", one_hot=True)

# helper function allow user to choose solver
def get_solver(solver_name):
    if solver_name == 'sgd':
        return tf.train.GradientDescentOptimizer
    elif solver_name == 'momentum':
        return tf.train.MomentumOptimizer
    elif solver_name == 'adam':
        return tf.train.AdamOptimizer
    else:
        raise Exception('solver {} is not tested'.format(solver_name))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


We use the high level package `layers` to craft our FFN

In [68]:
class MnistFFn(object):
    def __init__(self, num_hidden = 256, activation = tf.nn.sigmoid):
        self._num_hidden = num_hidden
        self._activation = activation
        self._build_model()
        
    def _build_model(self):
        self._add_placeholder()
        self._build_net()
        
    def _add_placeholder(self):
        with tf.variable_scope('mnist_input'):
            self._x = tf.placeholder(tf.float32, [None, 784], name = 'images')
            self._y = tf.placeholder(tf.int32,   [None, 10],  name = 'labels')
    
    def _build_net(self):
        xavier_init = tf.contrib.layers.xavier_initializer()
        with tf.variable_scope('input_to_hidden'):
            self._hiddens = tf.layers.dense(inputs = self._x, 
                                            units = self._num_hidden,
                                            activation = self._activation,
                                            kernel_initializer = xavier_init)
        
        with tf.variable_scope('hidden_to_output'):
            
            self._logits  = tf.layers.dense(inputs = self._hiddens,
                                            units = 10,
                                            activation = None,
                                            kernel_initializer = xavier_init)
            
            self._loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits = self._logits,
                                                                                 labels = self._y))
        
        with tf.variable_scope('mnist_eval'):
            self._preds = tf.argmax(self._logits, axis=1)
            self._truth = tf.argmax(self._y, axis = 1)
            self._accuracy = tf.reduce_mean(tf.cast(tf.equal(self._preds, self._truth), tf.float32))
            
    def train(self, mnist_data, num_iters, batch_size=64, 
              solver = 'adam', learning_rate=1e-3, print_every=100):
        with tf.Session() as sess:            
            train_op = get_solver(solver)(learning_rate = learning_rate).minimize(self._loss)
            
            # init value
            sess.run(tf.global_variables_initializer())
            for i in range(num_iters):
                batch_xs, batch_ys = mnist_data.train.next_batch(batch_size)
                loss,_ = sess.run([self._loss, train_op], feed_dict={self._x: batch_xs, self._y: batch_ys})
                
                sys.stdout.write("\rIteration ({}/{})".format(i + 1, num_iters)
                                     + "Loss {:.4f} ".format(loss))
                if ((i+1)%print_every == 0 or i+1==num_iters):
                    train_acc = sess.run(self._accuracy, {self._x: mnist_data.train.images,
                                                          self._y: mnist_data.train.labels})
                    acc = sess.run(self._accuracy, {self._x: mnist_data.validation.images,
                                                    self._y: mnist_data.validation.labels})
                    print('\nTrain vs Validation accuracy {:.2f}% vs {:.2f}%\n'.format(100*train_acc, 
                                                                                       100*acc))
            
            # evaluate on test set
            acc = sess.run(self._accuracy, {self._x: mnist_data.test.images,
                                            self._y: mnist_data.test.labels})
            print('\nTest-accuracy is {:.2f}%\n'.format(100*acc))

Let's try our FFN on MNIST data

In [66]:
tf.reset_default_graph()
ffn_model = MnistFFn(num_hidden=256, activation=tf.nn.relu)

num_iters = 5000
batch_size = 256

ffn_model.train(mnist_data, num_iters, batch_size = batch_size, print_every=500)

Iteration (500/5000)Loss 0.1185 
Train vs Validation accuracy 96.35% vs 96.26%

Iteration (1000/5000)Loss 0.0607 
Train vs Validation accuracy 98.03% vs 97.42%

Iteration (1500/5000)Loss 0.0361 
Train vs Validation accuracy 98.79% vs 97.64%

Iteration (2000/5000)Loss 0.0292 
Train vs Validation accuracy 99.36% vs 98.02%

Iteration (2500/5000)Loss 0.0196 
Train vs Validation accuracy 99.66% vs 97.76%

Iteration (3000/5000)Loss 0.0084 
Train vs Validation accuracy 99.80% vs 97.78%

Iteration (3500/5000)Loss 0.0178 
Train vs Validation accuracy 99.91% vs 98.06%

Iteration (4000/5000)Loss 0.0033 
Train vs Validation accuracy 99.97% vs 97.98%

Iteration (4500/5000)Loss 0.0024 
Train vs Validation accuracy 99.98% vs 98.18%

Iteration (5000/5000)Loss 0.0034 
Train vs Validation accuracy 99.97% vs 97.88%


Test-accuracy is 97.86%

