<div style="    font-variant: small-caps;
    font-weight: normal;
    font-size: 30px;
    text-align: center;
    padding: 15px;
margin: 10px;">Convolutional Neural Network (CNN) for Handwritten Digits Recognition</div>
<div style="    font-variant: small-caps;
    font-weight: normal;
    font-size: 20px;
    text-align: center;
    padding: 15px;">Deep Learning</div>
<div style="  float:right;
    font-size: 12px;
    line-height: 12px;
padding: 10px 15px 8px;">Luca BENEDETTO | Alberto IBARRONDO</div>

<div style=" display: inline-block; font-family: 'Lato', sans-serif; font-size: 12px; font-weight: bold; line-height: 12px; letter-spacing: 1px; padding: 10px 15px 8px; ">29/05/2017</div>

# Summary

In the last notebook, we built a Multilayer Perceptron for recognizing hand-written digits from the MNIST data-set. The best achieved accuracy on testing data was about 97%, but modern implementations of Convolutional Neural Networks whould surpass that mark.

In this notebook, we will build, train and optimize in TensorFlow one of the early Convolutional Neural Networks, **LeNet-5**, and push it beyond 99% accuracy. 






# 1. A first NeuralNetwork model in TensorFlow

## 1.1 Import Modules & Load MNIST Data in TensorFlow

In [117]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.contrib.layers import flatten
from __future__ import print_function
from numpy import array
import numpy as np
import time

In [190]:
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
X_train, y_train           = mnist.train.images, mnist.train.labels
X_validation, y_validation = mnist.validation.images, mnist.validation.labels
X_test, y_test             = mnist.test.images, mnist.test.labels
print("Image Shape: {}".format(X_train[0].shape[0]))
print("Training Set:   {} samples".format(len(X_train)))
print("Validation Set: {} samples".format(len(X_validation)))
print("Test Set:       {} samples".format(len(X_test)))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Image Shape: 784
Training Set:   55000 samples
Validation Set: 5000 samples
Test Set:       10000 samples


Before starting with CNN, let's train and test in TensorFlow a simple example :
**y=softmax(Wx+b)** 

This model should reach an accuracy of about 92 %.

## 1.2 Coding the Graph and Training 

In [2]:
#GRAPH DEFINITION

# Parameters
learning_rate = 0.01
training_epochs = 100
batch_size = 128
display_step = 1
logs_path = 'log_files/'  # useful for tensorboard

# tf Graph Input:  mnist data image of shape 28*28=784
x = tf.placeholder(tf.float32, [None, 784], name='InputData')
# 0-9 digits recognition,  10 classes
y = tf.placeholder(tf.float32, [None, 10], name='LabelData')

# Set model weights
W = tf.Variable(tf.zeros([784, 10]), name='Weights')
b = tf.Variable(tf.zeros([10]), name='Bias')

# Construct model and encapsulating all ops into scopes, 
#  making Tensorboard's Graph visualization more convenient
with tf.name_scope('Model'):
    # Model
    pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
with tf.name_scope('SGD'):
    # Gradient Descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    acc = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
# Create a summary to monitor cost tensor
tf.summary.scalar("TrainingLoss", cost)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("TrainingAccuracy", acc)
# Merge all summaries into a single op
merged_summary_op = tf.summary.merge_all()


#TRAINING


# Launch the graph for training
with tf.Session() as sess:
    sess.run(init)
    # op to write logs to Tensorboard
    summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c, summary = sess.run([optimizer, cost, merged_summary_op],
                                     feed_dict={x: batch_xs, y: batch_ys})
            # Write logs at every iteration
            summary_writer.add_summary(summary, epoch * total_batch + i)
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", 
                  "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    # Calculate accuracy
    print("Accuracy:", acc.eval({x: mnist.test.images, y: mnist.test.labels}))




Epoch:  01   =====> Loss= 1.286880517
Epoch:  02   =====> Loss= 0.732314337
Epoch:  03   =====> Loss= 0.600182705
Epoch:  04   =====> Loss= 0.536474698
Epoch:  05   =====> Loss= 0.497743477
Epoch:  06   =====> Loss= 0.471044735
Epoch:  07   =====> Loss= 0.451285015
Epoch:  08   =====> Loss= 0.435486047
Epoch:  09   =====> Loss= 0.423485379
Epoch:  10   =====> Loss= 0.413127333
Epoch:  11   =====> Loss= 0.404424905
Epoch:  12   =====> Loss= 0.396875077
Epoch:  13   =====> Loss= 0.390122826
Epoch:  14   =====> Loss= 0.384438272
Epoch:  15   =====> Loss= 0.379063723
Epoch:  16   =====> Loss= 0.374401041
Epoch:  17   =====> Loss= 0.370427270
Epoch:  18   =====> Loss= 0.366561572
Epoch:  19   =====> Loss= 0.362732460
Epoch:  20   =====> Loss= 0.359670188
Epoch:  21   =====> Loss= 0.356530713
Epoch:  22   =====> Loss= 0.353755475
Epoch:  23   =====> Loss= 0.351379176
Epoch:  24   =====> Loss= 0.348841268
Epoch:  25   =====> Loss= 0.346291858
Epoch:  26   =====> Loss= 0.344263495
Epoch:  27  

## 1.3 Visualization with Tensorboard

Using [Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard), we can now visualize the created graph, giving us an overview of the architecture and how all of the major components are connected. You can also visalize and analyse the learning curves. 

In order to launch tensorBoard we follow these steps: 
- Open a Terminal and run the command line **"tensorboard --logdir=log_files/"**, it will generate an http link ,ex http://666.6.6.6:6006,
- Copy this  link into a web browser 
- Display the images!

<img src="MNIST_99_Challenge_Figures/Screenshot from 2017-05-26 17:53:17.png",width="800" height="600" align="center">
<center><span>Figure 1: Tensorboard visualization </span></center>


# 2. The 99% MNIST Challenge using CNNs

## 2.1 LeNet5 Implementation

Now that we are familiar with familar with **tensorFlow** and **tensorBoard**, we are going to build, train and test the baseline [LeNet-5](http://yann.lecun.com/exdb/lenet/)  model for the MNIST digits recognition problem.  

Further ahead we will make some optimizations to surpass 99% of accuracy. The best model so far achieved over 99.7% accuracy ([List of Results](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html))


<img src="lenet.png",width="800" height="600" align="center">
<center><span>Figure 2: Lenet 5 </span></center>





The LeNet architecture accepts a 32x32xC image as input, where C is the number of color channels. Since MNIST images are grayscale, C is 1 in this case.

--------------------------
1. **Layer 1: Convolutional.** The output shape should be 28x28x6 **Activation.** sigmoid **Pooling.** The output shape should be 14x14x6.
- 
2. **Layer 2: Convolutional.** The output shape should be 10x10x16. **Activation.** sigmoid **Pooling.** The output shape should be 5x5x16.

3. **Flatten.** Flatten the output shape of the final pooling layer such that it's 1D instead of 3D.  You may need to use **flatten*  from tensorflow.contrib.layers import flatten

4. **Layer 3: Fully Connected.** This should have 120 outputs. **Activation.** sigmoid

5. **Layer 4: Fully Connected.** This should have 84 outputs. **Activation.** sigmoid

6. **Layer 5: Fully Connected.** This should have 10 outputs. **Activation.** softmax


### 2.1.1 LeNet5 model Implementation [Question 2.1.1]
The implementation draws classes and functions from the [Tensorflow API](https://www.tensorflow.org/api_docs/python/tf/nn). 


In [317]:
# LeNet5 variables init 
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.01)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
These functions are used for weigths and bias initilization. The standard deviation in the weights can be tuned in case we find any strange behaviour in the CNN.
</div>

In [304]:
# LeNet5 convolutional and max pool layers
def conv2d(x, W,pad='SAME'):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding=pad)

def max_pool_2x2(x,pad='VALID'):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding=pad)


<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
The stride of value 1x1 in the convolutional will yield a same size feature map when used with padding 'SAME', which adds padding to ensure that the proportion stays equal after the convolution. <br/><br/>
Using a stride of 2x2 in the max pool ensures reducing the features in half, which is why padding 'VALID' is used (equivalent to no padding).
</div>

In [338]:
def LeNet5_Model(data, keep_pr=1, activFunc=tf.nn.sigmoid):
    input_data = tf.reshape(data,[-1,28,28,1])
    
    # -------------------------------------------------
    # --------------------VARIABLES--------------------
    # -------------------------------------------------
    # Convolutional layer 1 variables
    W_conv1 = weight_variable([5,5,1,6])
    b_conv1 = bias_variable([6])
    
    # Convolutional layer 2 variables
    W_conv2 = weight_variable([5,5,6,16])
    b_conv2 = bias_variable([16])
    
    # Fully connected layer 1 param
    W_fc1 = weight_variable([400, 120])
    b_fc1 = bias_variable([120])
    
    # Fully connected layer 2 param
    W_fc2 = weight_variable([120, 84])
    b_fc2 = bias_variable([84])
    
    # Fully connected layer 3 param
    W_fc3 = weight_variable([84, 10])
    b_fc3 = bias_variable([10])
    
    
    # ----------------------------------------------------
    # --------------------COMPUTATIONS--------------------
    # ----------------------------------------------------
    # Convolutional layer 1 & max pooling
    h_conv1 = activFunc(conv2d(input_data,W_conv1)+ b_conv1)
    h_pool1 = max_pool_2x2(h_conv1)

    # Convolutional layer 2 & max pooling
    h_conv2 = activFunc(conv2d(h_pool1,W_conv2, pad='VALID')+ b_conv2)
    h_pool2 = max_pool_2x2(h_conv2)   

    # Flattening
    h_pool2_flat = tf.contrib.layers.flatten(h_pool2)

    # Fully connected layer 1, sigmoid activation
    h_fc1 = activFunc(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
    drop_fc1 = tf.nn.dropout(h_fc1, keep_pr)
    
    # Fully connected layer 2, sigmoid activation
    h_fc2 = activFunc(tf.matmul(drop_fc1, W_fc2) + b_fc2)
    drop_fc2 = tf.nn.dropout(h_fc2, keep_pr)
    
    # Fully connected layer 3, softmax activation
    predicted = tf.nn.softmax(tf.matmul(drop_fc2, W_fc3) + b_fc3)
    
    return predicted


<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
First we define all the variables, then we compute all the activations using the functions defined above. We included a dropout in the fully connected layers, but is set off (with keep_p=1) by default.
</div>

### 2.1.2 Number of parameters in LeNet5 [Question 2.1.2]

In [306]:
NParameters_LeNet5 = \
    2*(5*5*1*6) + \
    2*(5*5*6*16) + \
    400*120 + \
    120 + \
    120*84 + \
    84 + \
    84*10 + \
    10
    
    
print('Mumber of parameters in LeNet5: %d'%NParameters_LeNet5)

Mumber of parameters in LeNet5: 64234


<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
<ul>
<li> Weights and biases for convolutional layer 1 = 2 x (5 x 5 x 1 x 6) </li>
<li> Weights and biases for convolutional layer 2 = 2 x (5 x 5 x 6 x 16)</li>
<li> Weights for fully connected layer 1 = 400 x 120</li>
<li> Biases for fully connected layer 1 = 120</li>
<li> Weights for fully connected layer 2 = 120 x 84</li>
<li> Biases for fully connected layer 2 = 84</li>
<li> Weights for fully connected layer 3 = 84 x 10</li>
<li> Biases for fully connected layer 3 = 10</li>
</ul>
TOTAL: 64234
</div>

### 2.1.3 CNNet: Tensorflow graph creation [Question 2.1.3]
The initial training will be using the parameters cited below:

     Learning rate =0.1
     Loss Function : Cross entropy
     Optimisateur: SGD
     Number of training iterations= 10000
     The batch size =128

In [339]:
def CNNet ( modelName, 
            learning_rate = 0.1, 
            training_epochs = 100,
            batch_size = 128, 
            display_step = 1,
            keep_p = 1,
            activationFunc = tf.nn.sigmoid,
            optimFunc = tf.train.GradientDescentOptimizer,
            X_train=mnist.train.images, 
            y_train=mnist.train.labels,
            X_val=mnist.validation.images,
            y_val=mnist.validation.labels,
            X_test= mnist.test.images,
            y_test=mnist.test.labels,
            loadModel=None
          ):
    
    # ---------- DESCRIPTION OF DATASET ----------
    InputSize = X_train[0].shape[0]
    OutputSize = y_train[0].shape[0]
    TrainingSetSize = len(X_train)
    ValidationSetSize = len(X_validation)
    TestSetSize = len(X_test)    
    
    # ---------- OUTPUT FOLDERS ----------
    logsFolder = 'log_files/' # useful for tensorboard
    saveFolder = 'Models/'    # useful to restore the model
    
    # ---------- RESET GRAPH ----------
    tf.reset_default_graph()

    # ---------- DEFINE VARIABLES ----------
    # tf Graph Input:  mnist data image of shape 28*28*1
    x = tf.placeholder(tf.float32, [None,InputSize], name='InputData')
    # 0-9 digits recognition,  10 classes
    y = tf.placeholder(tf.float32, [None,OutputSize], name='LabelData')
    # Dropout
    keep_prob = tf.placeholder(tf.float32, name='DropoutKeepProbability')

    # ---------- DEFINE GRAPH NODES ----------
    with tf.name_scope('Model'):
        # Model
        model = LeNet5_Model(x, keep_prob, activFunc=activationFunc)
    
    with tf.name_scope('Loss'):
        # Minimize error using cross entropy
        cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(model+1e-9), reduction_indices=1))
        #cost = tf.nn.softmax_cross_entropy_with_logits(model, tf.one_hot(y, 10))
    with tf.name_scope('Optimizer'):
        #Optimization, using cost reduction
        optimizer = optimFunc(learning_rate).minimize(cost)
        
    with tf.name_scope('Accuracy'):
        # Accuracies
        acc = CNNetAccuracy(model, y)
  
        
    # ---------- INITIALIZE VARIABLES ----------    
    init = tf.global_variables_initializer()


    # ---------- TRACK BATCH LOSS AND ACCURACY ----------
    # Create a summary to monitor cost tensor
    tf.summary.scalar("BatchLoss", cost)
    # Create a summary to monitor batch accuracy tensor
    tf.summary.scalar("BatchAccuracy", acc)
    # Merge all summaries into a single op
    merged_summary_op = tf.summary.merge_all()
    
    # ---------- TRAIN MODEL ----------
    CNNtrain(model, cost, optimizer, acc,
             x, y, keep_prob, TrainingSetSize,
             X_train, y_train, X_val, y_val, X_test, y_test,
             init, merged_summary_op,
             modelName, saveFolder,logsFolder,
             learning_rate, training_epochs, batch_size, display_step , keep_p,
             loadModel
            )
    

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
The CNNet function creates the whole graph from scratch (Nodes, Variables, Summaries) and calls the function CNNTrain, which will be in charge of creating a session and running the graph.
</div>

### 2.1.4 CNNet: Accuracy [Question 2.1.4]  
Here we implement the evaluation function for accuracy computation: 

In [308]:
def CNNetAccuracy(model, y):
    accuracy = tf.reduce_mean(
               tf.cast( tf.equal( tf.argmax(model, 1),
                                  tf.argmax(y, 1)), tf.float32))
    return accuracy

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
We've opted for defining the computation the way it should be set inside a Tensorflow graph. However, in order to execute it and obtain a value, we need to create the rest of the graph and use a session to perform the computation (it will be called in the training function)
</div>

### 2.1.5 CNNet Training [Question 2.1.5]
Here we implement training pipeline and run the training data through it to train the model. Other steps to consider are:

- Before each epoch, shuffling the training set. 
- Printing the loss per mini batch and the training/validation accuracy per epoch. (Display results every 100 epochs)
- Saving the model after training
- Printing after training the final testing accuracy 



In [340]:
def read_my_file_format(filename_queue):
    reader = tf.SomeReader()
    key, record_string = reader.read(filename_queue)
    example, label = tf.some_decoder(record_string)
    processed_example = some_processing(example)
    return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
      filenames, num_epochs=num_epochs, shuffle=True)
    example, label = read_my_file_format(filename_queue)
    # min_after_dequeue defines how big a buffer we will randomly sample
    #   from -- bigger means better shuffling but slower start up and more
    #   memory used.
    # capacity must be larger than min_after_dequeue and the amount larger
    #   determines the maximum we will prefetch.  Recommendation:
    #   min_after_dequeue + (num_threads + a small safety margin) * batch_size
    min_after_dequeue = 10000
    capacity = min_after_dequeue + 3 * batch_size
    example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
    return example_batch, label_batch

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
We import the pipeline functions that come from tensorflow: https://www.tensorflow.org/programmers_guide/reading_data
</div>

In [336]:
def CNNtrain(model, cost, optimizer, acc,
             x, y, keep_prob, TrainingSetSize,
             X_train, y_train, X_val, y_val, X_test, y_test,
             init, merged_summary_op,
             modelName, saveFolder, logsFolder,
             learning_rate, training_epochs, batch_size, display_step, keep_p,
             loadModel, dataFileName=None
            ):
    
    # Initial model print
    print("*Model [", modelName,"] {l_r: %.4f; n_iter: %d; batch: %d}"%\
          (learning_rate, training_epochs, batch_size))
    
    # Start a tensorflow session
    with tf.Session() as sess:
        print ("   Start Training!")
        sess.run(init)
        
        # Load model if the parameter loadModel is not empty
        saver = tf.train.Saver()
        if(loadModel):
            saver.restore(sess=sess,save_path='Models/'+loadModel)
        
        # op to write logs to Tensorboard
        summary_writer = tf.summary.FileWriter(logsFolder,
                                               graph=tf.get_default_graph())
        
        # Training cycle
        t0 = time.time()
        for epoch in range(training_epochs):
            avg_cost = 0.
            total_batch = int(mnist.train.num_examples/batch_size)
            # Loop over all batches
            for i in range(total_batch):
                
                # batch_xs, batch_ys = mnist.train.next_batch(batch_size)
                batch_xs, batch_ys = input_pipeline(filenames=dataFileName, 
                                                    batch_size=batch_size, 
                                                    num_epochs=total_batch):
                # Run optimization op (backprop), cost op (to get loss value)
                # and summary nodes
                _, c, summary = sess.run([optimizer, cost, merged_summary_op],
                                          feed_dict={x: batch_xs,
                                                     y: batch_ys,
                                                     keep_prob: keep_p})
                # Write logs at every iteration
                summary_writer.add_summary(summary, epoch * total_batch + i)
                # Compute average loss
                avg_cost += c / total_batch

            # Display logs per epoch step
            if (epoch+1) % display_step == 0:
                tr_acc = acc.eval({x: X_train, y: y_train, keep_prob: 1})
                vl_acc = acc.eval({x: X_val, y: y_val, keep_prob: 1})
                print("   Epoch: %02d | Loss=%.9f | TrainAcc=%.3f %% | ValAcc=%.3f %%"% 
                      (epoch+1, avg_cost, tr_acc*100, vl_acc*100));
        
        
        print ("   Training Finished in %.1f seconds."%(time.time()-t0))
        
        # Evaluating model with the accuracies
        print ("   Final accuracies:")
        print ("   ~ TrainAcc: %.3f %%"%(100*acc.eval({x: X_train, y: y_train, keep_prob: 1})))
        print ("   ~ ValAcc: %.3f %%"%(100*acc.eval({x: X_val, y: y_val, keep_prob: 1})))
        print ("   ~ TestAcc: %.3f %%"%(100*acc.eval({x: X_test, y: y_test, keep_prob: 1})))
        
        # Saving Model
        saver.save(sess=sess,save_path=saveFolder+modelName)
        print ("   Saving model in file: %s"%(saveFolder+modelName))
        

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
This function implements the session, and using and input pipeline, it fetches batches of data to the graph and runs it in order to obtain results.<br/><br/>
The model is evaluated in the end, and we have added a possibility to load previous models as well as saving it. The training and validation accuracies are calculated on every epoch, while test accuracy is only calculated in the end.<br/><br/>
In order to keep the original version available, we have commented the original batch function from the mnist dataset.
</div>

In [299]:
# Training our first model!
CNNet ('lenet5-model', 
           learning_rate = 0.1, 
           training_epochs = 100,
           batch_size = 128,
      )

*Model [ lenet5-model ] {l_r: 0.1000; n_iter: 100; batch: 128}
   Start Training!
   Epoch: 01 | Loss=2.306631047 | TrainAcc=10.391 %| ValAcc=11.000 %
   Epoch: 02 | Loss=2.305539821 | TrainAcc=10.251 %| ValAcc=9.860 %
   Epoch: 03 | Loss=2.304715242 | TrainAcc=11.235 %| ValAcc=11.260 %
   Epoch: 04 | Loss=2.304092920 | TrainAcc=9.945 %| ValAcc=9.760 %
   Epoch: 05 | Loss=2.303851811 | TrainAcc=10.391 %| ValAcc=11.000 %
   Epoch: 06 | Loss=2.302999963 | TrainAcc=10.391 %| ValAcc=11.000 %
   Epoch: 07 | Loss=2.301510114 | TrainAcc=9.945 %| ValAcc=9.760 %
   Epoch: 08 | Loss=2.298957096 | TrainAcc=9.945 %| ValAcc=9.760 %
   Epoch: 09 | Loss=2.290840972 | TrainAcc=23.744 %| ValAcc=23.080 %
   Epoch: 10 | Loss=2.217125939 | TrainAcc=45.135 %| ValAcc=45.980 %
   Epoch: 11 | Loss=1.557864240 | TrainAcc=63.400 %| ValAcc=63.560 %
   Epoch: 12 | Loss=0.869937158 | TrainAcc=79.955 %| ValAcc=80.900 %
   Epoch: 13 | Loss=0.587728509 | TrainAcc=85.653 %| ValAcc=86.280 %
   Epoch: 14 | Loss=0.442426

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
First accuracy obtained is 98.76%! very close to the objective of 99%.
</div>

### 2.1.6 Visualization of results with Tensorboard [Question 2.1.6]
We use tensorBoard to visualise and save the LeNet5 Graph and all learning curves. 
The data is then converted into CSV using the GUI drom tensorboard and is then plotted using Excel. The resulting figures are:

<img src="MNIST_99_Challenge_Figures/LeNet5_graph.png",width="800" height="600" align="center">
<center><span>Figure 3: LeNet5 Graph </span></center>
<img src="MNIST_99_Challenge_Figures/TrainingLeNet5.png",width="800" height="600" align="center">
<center><span>Figure 4: LeNet5 Training </span></center>

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
There is an initial step of search for a steep surface in the loss function, and in the epoch 8-9 it starts converging fast, completely stabilizing after epoch 50.
</div>

# 2.2 LeNet5 Optimization


## 2.2.1 Parameter Tuning [Question 2.2.1]  

We change the sigmoid function to a Relu and perform the next steps:

- Retrain the network with SGD and AdamOptimizer. Compare them with the best parameters:


| Optimizer            |  Gradient Descent         |AdamOptimizer |
| -------------        |: -------------: | ---------:   
| Validation Accuracy  |    98.760 %   |  99.080 %  |      
| Testing Accuracy     |      98.670 %     |  99.110 %  |       
| Training Time        |     8048 s      |     2670 s   |  


- Try with different learning rates for each Optimizer (0.0001 and 0.001 ) and different Batch sizes (50 and 128) for 10000 Epochs. 

- For each optimizer, plot (on the same curve) the **testing accuracies** function to **(learning rate, batch size)** 


- Did you reach the 99% accuracy ? What are the optimal parametres that gave you the best results? 








<img src="MNIST_99_Challenge_Figures/ParameterTuning.png",width="800" height="600" align="center">
<center><span>Figure 5: Parameter Tuning </span></center>

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
Clearly the Adam Optimizer is better than the SGD, and in much less time. <br/><br/>
We tried with several parameters and thebest combination we found is <b>AdamOptimizer, lr=0.001 bs=128</b>.<br/><br/>
Also, we achieved the desired accuracy of 99%!
</div>

In [297]:
# Trying parameters with SGD
setOfParams = [ [0.001, 250, 50, 'lenet5-model_relu_lr0001_bs50'],
                [0.001, 100, 128, 'lenet5-model_relu_lr0001_bs128'],
                [0.0001, 250, 50, 'lenet5-model_relu_lr00001_bs50'],
                [0.0001, 100, 128, 'lenet5-model_relu_lr00001_bs128'],
    
]
for p in setOfParams:
    CNNet (modelName=p[3], 
           learning_rate = p[0], 
           training_epochs = p[1],
           batch_size = p[2],
           activationFunc=tf.nn.relu,
      )

*Model [ lenet5-model_relu_lr0001_bs50 ] {l_r: 0.0010; n_iter: 250; batch: 50}
   Start Training!
   Epoch: 01 | Loss=2.285461369 | TrainAcc=32.773 %| ValAcc=32.020 %
   Epoch: 02 | Loss=2.132812783 | TrainAcc=59.864 %| ValAcc=59.840 %
   Epoch: 03 | Loss=1.361398856 | TrainAcc=81.147 %| ValAcc=82.020 %
   Epoch: 04 | Loss=0.553005308 | TrainAcc=87.944 %| ValAcc=88.600 %
   Epoch: 05 | Loss=0.384707743 | TrainAcc=89.689 %| ValAcc=90.440 %
   Epoch: 06 | Loss=0.323377239 | TrainAcc=90.982 %| ValAcc=92.000 %
   Epoch: 07 | Loss=0.287973725 | TrainAcc=91.936 %| ValAcc=92.680 %
   Epoch: 08 | Loss=0.263154375 | TrainAcc=92.489 %| ValAcc=93.280 %
   Epoch: 09 | Loss=0.239348642 | TrainAcc=93.165 %| ValAcc=93.920 %
   Epoch: 10 | Loss=0.222846294 | TrainAcc=93.533 %| ValAcc=94.260 %
   Epoch: 11 | Loss=0.211532476 | TrainAcc=93.960 %| ValAcc=94.620 %
   Epoch: 12 | Loss=0.196222519 | TrainAcc=94.267 %| ValAcc=94.860 %
   Epoch: 13 | Loss=0.184968149 | TrainAcc=94.775 %| ValAcc=95.500 %
   Ep

In [298]:
# Trying parameters with Adam Optimizer
setOfParams = [ [0.001, 250, 50, 'lenet5-model_adam_relu_lr0001_bs50'],
                [0.001, 100, 128, 'lenet5-model_adam_relu_lr0001_bs128'],
                [0.0001, 250, 50, 'lenet5-model_adam_relu_lr00001_bs50'],
                [0.0001, 100, 128, 'lenet5-model_adam_relu_lr00001_bs128'],
    
]
for p in setOfParams:
    CNNet (modelName=p[3], 
           learning_rate = p[0], 
           training_epochs = p[1],
           batch_size = p[2],
           activationFunc=tf.nn.relu,
           optimFunc=tf.train.AdamOptimizer
      )

*Model [ lenet5-model_adam_relu_lr0001_bs50 ] {l_r: 0.0010; n_iter: 250; batch: 50}
   Start Training!
   Epoch: 01 | Loss=0.244328993 | TrainAcc=97.695 %| ValAcc=97.900 %
   Epoch: 02 | Loss=0.065261318 | TrainAcc=98.433 %| ValAcc=98.400 %
   Epoch: 03 | Loss=0.045404583 | TrainAcc=98.811 %| ValAcc=98.600 %
   Epoch: 04 | Loss=0.035783383 | TrainAcc=99.156 %| ValAcc=98.740 %
   Epoch: 05 | Loss=0.029785236 | TrainAcc=99.391 %| ValAcc=98.880 %
   Epoch: 06 | Loss=0.023469996 | TrainAcc=99.227 %| ValAcc=98.580 %
   Epoch: 07 | Loss=0.022254807 | TrainAcc=99.578 %| ValAcc=98.780 %
   Epoch: 08 | Loss=0.017690490 | TrainAcc=99.673 %| ValAcc=98.980 %
   Epoch: 09 | Loss=0.013129403 | TrainAcc=99.747 %| ValAcc=99.180 %
   Epoch: 10 | Loss=0.013963227 | TrainAcc=99.555 %| ValAcc=98.620 %
   Epoch: 11 | Loss=0.012480628 | TrainAcc=99.584 %| ValAcc=98.860 %
   Epoch: 12 | Loss=0.012241643 | TrainAcc=99.720 %| ValAcc=98.840 %
   Epoch: 13 | Loss=0.009946326 | TrainAcc=99.849 %| ValAcc=99.180 %


<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
As seen in the training, the AdamOptimizer gets NaN at a certain point. In order to make it numerically stable, we have included a small value in the logarighm of the loss node in the grapl (1e-9), and we have reduced the standard deviation in the weight initialization
</div>

In [320]:
# Trying parameters with Adam Optimizer - AFTER CHANGING THE OPTIMIZER! 
# (Weight init & safe softmax)
setOfParams = [ [0.001, 100, 50, 'lenet5-model_adam_relu_lr0001_bs50'],
                [0.001, 100, 128, 'lenet5-model_adam_relu_lr0001_bs128'],
                [0.0001, 100, 50, 'lenet5-model_adam_relu_lr00001_bs50'],
                [0.0001, 100, 128, 'lenet5-model_adam_relu_lr00001_bs128'],
    
]
for p in setOfParams:
    CNNet (modelName=p[3], 
           learning_rate = p[0], 
           training_epochs = p[1],
           batch_size = p[2],
           activationFunc=tf.nn.relu,
           optimFunc=tf.train.AdamOptimizer
      )

*Model [ lenet5-model_adam_relu_lr0001_bs50 ] {l_r: 0.0010; n_iter: 100; batch: 50}
   Start Training!
   Epoch: 01 | Loss=0.502500334 | TrainAcc=95.858 % | ValAcc=96.600 %
   Epoch: 02 | Loss=0.106422822 | TrainAcc=97.651 % | ValAcc=97.320 %
   Epoch: 03 | Loss=0.073697202 | TrainAcc=98.216 % | ValAcc=98.040 %
   Epoch: 04 | Loss=0.058289272 | TrainAcc=98.545 % | ValAcc=98.040 %
   Epoch: 05 | Loss=0.045475034 | TrainAcc=98.578 % | ValAcc=98.200 %
   Epoch: 06 | Loss=0.040004382 | TrainAcc=99.122 % | ValAcc=98.800 %
   Epoch: 07 | Loss=0.032673832 | TrainAcc=99.055 % | ValAcc=98.660 %
   Epoch: 08 | Loss=0.030387537 | TrainAcc=99.180 % | ValAcc=98.800 %
   Epoch: 09 | Loss=0.024355082 | TrainAcc=99.173 % | ValAcc=98.700 %
   Epoch: 10 | Loss=0.023165833 | TrainAcc=99.229 % | ValAcc=98.820 %
   Epoch: 11 | Loss=0.018626527 | TrainAcc=99.598 % | ValAcc=99.080 %
   Epoch: 12 | Loss=0.018691430 | TrainAcc=99.309 % | ValAcc=98.560 %
   Epoch: 13 | Loss=0.016446427 | TrainAcc=99.685 % | Val

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
Other measure that we took was to stop the training when reaching 100% accuracy. Nevertheless, even with 100% accuracy the loss keeps going down, which is why we implemented the model loading and retrained the best model for other 30 epochs:
</div>

In [344]:
# Ending the training of the best optimizer
CNNet (modelName='lenet5-model_best', 
       learning_rate = 0.001, 
       training_epochs = 30,
       batch_size = 128,
       activationFunc=tf.nn.relu,
       optimFunc=tf.train.AdamOptimizer, 
       loadModel='lenet5-model_adam_relu_lr0001_bs128'
  )

*Model [ lenet5-model_best ] {l_r: 0.0010; n_iter: 30; batch: 128}
 Start Training!
 Epoch: 01 | Loss=0.000036285 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 02 | Loss=0.000020620 | TrainAcc=100.000 % | ValAcc=99.060 %
 Epoch: 03 | Loss=0.000013097 | TrainAcc=100.000 % | ValAcc=99.060 %
 Epoch: 04 | Loss=0.000011698 | TrainAcc=100.000 % | ValAcc=99.060 %
 Epoch: 05 | Loss=0.000008156 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 06 | Loss=0.000006704 | TrainAcc=100.000 % | ValAcc=99.020 %
 Epoch: 07 | Loss=0.000005328 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 08 | Loss=0.000004981 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 09 | Loss=0.000003284 | TrainAcc=100.000 % | ValAcc=99.020 %
 Epoch: 10 | Loss=0.000003292 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 11 | Loss=0.000002527 | TrainAcc=100.000 % | ValAcc=99.040 %
 Epoch: 12 | Loss=0.000001922 | TrainAcc=100.000 % | ValAcc=99.020 %
 Epoch: 13 | Loss=0.000001719 | TrainAcc=100.000 % | ValAcc=99.020 %
 Epoch: 14 | Loss=0

### 2.2.2 Dropout layer [Question 2.2.2]
What about applying a dropout layer on the Fully conntected layer and then retraining the model with the best Optimizer and parameters(Learning rate and Batsh size) obtained in the previous section? (probability to keep units=0.75). For this stage we ensure that the keep prob is set to 1.0 to evaluate the performance of the network including all nodes.

In [312]:
CNNet ('lenet5-model', 
           learning_rate = 0.001, 
           training_epochs = 100,
           batch_size = 128,
           activationFunc = tf.nn.relu,
           optimFunc=tf.train.AdamOptimizer,
           keep_p = 0.75
      )

*Model [ lenet5-model ] {l_r: 0.0010; n_iter: 100; batch: 128}
   Start Training!
   Epoch: 01 | Loss=1.462579143 | TrainAcc=86.182 % | ValAcc=86.920 %
   Epoch: 02 | Loss=0.320648048 | TrainAcc=95.773 % | ValAcc=95.900 %
   Epoch: 03 | Loss=0.156452648 | TrainAcc=97.318 % | ValAcc=97.500 %
   Epoch: 04 | Loss=0.107257504 | TrainAcc=97.940 % | ValAcc=97.900 %
   Epoch: 05 | Loss=0.079941446 | TrainAcc=98.598 % | ValAcc=98.300 %
   Epoch: 06 | Loss=0.067996642 | TrainAcc=98.916 % | ValAcc=98.600 %
   Epoch: 07 | Loss=0.056781804 | TrainAcc=98.987 % | ValAcc=98.820 %
   Epoch: 08 | Loss=0.049156536 | TrainAcc=99.084 % | ValAcc=98.780 %
   Epoch: 09 | Loss=0.046372587 | TrainAcc=99.335 % | ValAcc=98.820 %
   Epoch: 10 | Loss=0.039569505 | TrainAcc=99.242 % | ValAcc=98.860 %
   Epoch: 11 | Loss=0.035637989 | TrainAcc=99.491 % | ValAcc=99.100 %
   Epoch: 12 | Loss=0.033921816 | TrainAcc=99.558 % | ValAcc=99.000 %
   Epoch: 13 | Loss=0.029400493 | TrainAcc=99.427 % | ValAcc=98.780 %
   Epoch

<div class='alert alert-warning'>
<b>COMMENT:</b><br/>
We had everything implemented in the functions above, we only needed to call it.<br/><br/>

We once again surpass the 99% frontier! In fact, since the validation acuracy is 99.2%, this suggests that further playing with the keep probability could lead to an even better model. However, this is our of the scope of this notebook
</div>