# About  
This jupyter notebook is the assignment 3 of lesson 3, Udacity class "Deep Learning"  
Finished on: 2018/06/17

Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in `1_notmnist.ipynb`.

In [2]:
pickle_file = './data/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

## Logistic regression model (0 hidden layer NN)

In [11]:
train_subset = 10000
reg_beta = 0.2 # regularization hyperparameter beta

graph = tf.Graph()
with graph.as_default():
    
    # Constants 
    tf_train_dataset = tf.constant(train_dataset[:train_subset, :])
    tf_train_label = tf.constant(train_labels[:train_subset])
    tf_valid_dataset = tf.constant(valid_dataset[:train_subset, :])
    tf_valid_label = tf.constant(valid_labels[:train_subset])
    tf_test_dataset = tf.constant(test_dataset[:train_subset, :])
    tf_test_label = tf.constant(test_labels[:train_subset])
    
    # Variables
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    bias = tf.Variable(tf.zeros([num_labels]))
    
    # Training 
    logits = tf.matmul(tf_train_dataset, weights) + bias 
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_label, logits = logits))
    ## L2 norm
    loss = tf.reduce_mean(loss + reg_beta * tf.nn.l2_loss(weights))
    
    # Optimizer 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01).minimize(loss)
    
    # Predicts
    pred_train = tf.nn.softmax(logits)
    pred_valid = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + bias )
    pred_test = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + bias)
    

In [13]:
num_steps = 1001
with tf.Session(graph = graph) as sess:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        _, l, predictions = sess.run([optimizer, loss, pred_train])
        if step % 50 == 0:
            print('Loss at step %d: %f' % (step, l))
            print('Training accuracy: %.1f%%' % accuracy(predictions, train_labels[:train_subset, :]))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph dependencies.
            print('Validation accuracy: %.1f%%' % accuracy(pred_valid.eval(), valid_labels))
    print('Test accuracy: %.1f%%' % accuracy(pred_test.eval(), test_labels))

Initialized
Loss at step 0: 628.211304
Training accuracy: 10.7%
Validation accuracy: 10.3%
Loss at step 50: 511.313171
Training accuracy: 13.8%
Validation accuracy: 13.9%
Loss at step 100: 416.509979
Training accuracy: 17.9%
Validation accuracy: 18.3%
Loss at step 150: 339.625366
Training accuracy: 22.8%
Validation accuracy: 23.2%
Loss at step 200: 277.157776
Training accuracy: 28.1%
Validation accuracy: 27.7%
Loss at step 250: 226.312622
Training accuracy: 32.9%
Validation accuracy: 33.7%
Loss at step 300: 184.887939
Training accuracy: 38.6%
Validation accuracy: 38.4%
Loss at step 350: 151.133865
Training accuracy: 43.9%
Validation accuracy: 43.4%
Loss at step 400: 123.621254
Training accuracy: 48.7%
Validation accuracy: 48.1%
Loss at step 450: 101.173904
Training accuracy: 53.3%
Validation accuracy: 51.7%
Loss at step 500: 82.845200
Training accuracy: 56.6%
Validation accuracy: 55.0%
Loss at step 550: 67.873894
Training accuracy: 59.3%
Validation accuracy: 57.8%
Loss at step 600: 55.

# Neural Network model (1 hidden layer, 1024 neurons)

In [26]:
train_subset = 10000
reg_beta = 0.2 # regularization hyperparameter beta
batch_size = 500 
hidden_nodes = 1024

graph = tf.Graph()
with graph.as_default():
    
    # Placeholders
    tf_train_dataset = tf.placeholder(tf.float32, shape = [batch_size, image_size * image_size])
    tf_train_label = tf.placeholder(tf.float32, shape = [batch_size, num_labels])
    
    # Constants
    tf_valid_dataset = tf.constant(valid_dataset[:train_subset, :])
    tf_valid_label = tf.constant(valid_labels[:train_subset])
    tf_test_dataset = tf.constant(test_dataset[:train_subset, :])
    tf_test_label = tf.constant(test_labels[:train_subset])
    
    # Variables
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, hidden_nodes]))
    bias_1 = tf.Variable(tf.zeros([hidden_nodes]))
    
    weights_2 = tf.Variable(tf.truncated_normal([hidden_nodes, num_labels]))
    bias_2 = tf.Variable(tf.truncated_normal([num_labels]))
    
    
    # Training 
    logits = tf.matmul(tf_train_dataset, weights_1) + bias_1
    relu = tf.nn.relu(logits)
    logits = tf.matmul(relu, weights_2) + bias_2
    
    # Loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_label, logits = logits))
    ## L2 norm
    loss = tf.reduce_mean(loss + reg_beta * tf.nn.l2_loss(weights_2))
    
    # Optimizer 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01).minimize(loss)
    
    # Predicts
    pred_train = tf.nn.softmax(logits)
    
    logits_valid = tf.matmul(tf_valid_dataset, weights_1) + bias_1
    relu_valid = tf.nn.relu(logits_valid)
    pred_valid = tf.nn.softmax(tf.matmul(relu_valid, weights_2) + bias_2)
    
    logits_test = tf.matmul(tf_test_dataset, weights_1) + bias_1
    relu_test = tf.nn.relu(logits_test)
    pred_test = tf.nn.softmax(tf.matmul(relu_test, weights_2) + bias_2)
    

In [27]:
num_steps = 1001
with tf.Session(graph = graph) as sess:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset : (offset + batch_size), :]
        batch_labels = train_labels[offset : (offset + batch_size), :]
        
        feed_dict = {tf_train_dataset: batch_data, 
                    tf_train_label: batch_labels}
        
        _, l, predictions = sess.run([optimizer, loss, pred_train], feed_dict = feed_dict)
        
        if step % 50 == 0:
            print('Loss at step %d: %f' % (step, l))
            print('Training accuracy: %.1f%%' % accuracy(predictions, batch_labels))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph dependencies.
            print('Validation accuracy: %.1f%%' % accuracy(pred_valid.eval(), valid_labels))
    print('Test accuracy: %.1f%%' % accuracy(pred_test.eval(), test_labels))

Initialized
Loss at step 0: 1210.989258
Training accuracy: 9.0%
Validation accuracy: 10.8%
Loss at step 50: 679.233582
Training accuracy: 66.4%
Validation accuracy: 60.9%
Loss at step 100: 556.882446
Training accuracy: 65.0%
Validation accuracy: 67.9%
Loss at step 150: 446.955048
Training accuracy: 73.6%
Validation accuracy: 70.9%
Loss at step 200: 364.582031
Training accuracy: 75.0%
Validation accuracy: 72.2%
Loss at step 250: 300.174927
Training accuracy: 73.8%
Validation accuracy: 73.3%
Loss at step 300: 243.154388
Training accuracy: 73.8%
Validation accuracy: 73.8%
Loss at step 350: 199.225189
Training accuracy: 75.6%
Validation accuracy: 74.5%
Loss at step 400: 164.146622
Training accuracy: 73.8%
Validation accuracy: 75.0%
Loss at step 450: 133.221405
Training accuracy: 73.4%
Validation accuracy: 75.4%
Loss at step 500: 107.265060
Training accuracy: 74.8%
Validation accuracy: 75.5%
Loss at step 550: 85.976402
Training accuracy: 81.2%
Validation accuracy: 76.2%
Loss at step 600: 71

---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

## Neural Network with a few batches to overfit

In [28]:
train_subset = 10000
reg_beta = 0.2 # regularization hyperparameter beta
batch_size = 2500 # use large batch size so tht only have a few batches to check overfit
hidden_nodes = 1024

graph = tf.Graph()
with graph.as_default():
    
    # Placeholders
    tf_train_dataset = tf.placeholder(tf.float32, shape = [batch_size, image_size * image_size])
    tf_train_label = tf.placeholder(tf.float32, shape = [batch_size, num_labels])
    
    # Constants
    tf_valid_dataset = tf.constant(valid_dataset[:train_subset, :])
    tf_valid_label = tf.constant(valid_labels[:train_subset])
    tf_test_dataset = tf.constant(test_dataset[:train_subset, :])
    tf_test_label = tf.constant(test_labels[:train_subset])
    
    # Variables
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, hidden_nodes]))
    bias_1 = tf.Variable(tf.zeros([hidden_nodes]))
    
    weights_2 = tf.Variable(tf.truncated_normal([hidden_nodes, num_labels]))
    bias_2 = tf.Variable(tf.truncated_normal([num_labels]))
    
    
    # Training 
    logits = tf.matmul(tf_train_dataset, weights_1) + bias_1
    relu = tf.nn.relu(logits)
    logits = tf.matmul(relu, weights_2) + bias_2
    
    # Loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_label, logits = logits))
    ## L2 norm
    loss = tf.reduce_mean(loss + reg_beta * tf.nn.l2_loss(weights_2))
    
    # Optimizer 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01).minimize(loss)
    
    # Predicts
    pred_train = tf.nn.softmax(logits)
    
    logits_valid = tf.matmul(tf_valid_dataset, weights_1) + bias_1
    relu_valid = tf.nn.relu(logits_valid)
    pred_valid = tf.nn.softmax(tf.matmul(relu_valid, weights_2) + bias_2)
    
    logits_test = tf.matmul(tf_test_dataset, weights_1) + bias_1
    relu_test = tf.nn.relu(logits_test)
    pred_test = tf.nn.softmax(tf.matmul(relu_test, weights_2) + bias_2)
    

In [29]:
num_steps = 1001
with tf.Session(graph = graph) as sess:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset : (offset + batch_size), :]
        batch_labels = train_labels[offset : (offset + batch_size), :]
        
        feed_dict = {tf_train_dataset: batch_data, 
                    tf_train_label: batch_labels}
        
        _, l, predictions = sess.run([optimizer, loss, pred_train], feed_dict = feed_dict)
        
        if step % 50 == 0:
            print('Loss at step %d: %f' % (step, l))
            print('Training accuracy: %.1f%%' % accuracy(predictions, batch_labels))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph dependencies.
            print('Validation accuracy: %.1f%%' % accuracy(pred_valid.eval(), valid_labels))
    print('Test accuracy: %.1f%%' % accuracy(pred_test.eval(), test_labels))

Initialized
Loss at step 0: 1199.322876
Training accuracy: 8.6%
Validation accuracy: 10.6%
Loss at step 50: 705.495239
Training accuracy: 59.1%
Validation accuracy: 58.8%
Loss at step 100: 569.368347
Training accuracy: 66.1%
Validation accuracy: 66.1%
Loss at step 150: 458.644135
Training accuracy: 69.9%
Validation accuracy: 69.2%
Loss at step 200: 371.726013
Training accuracy: 72.9%
Validation accuracy: 71.3%
Loss at step 250: 303.405090
Training accuracy: 73.7%
Validation accuracy: 72.4%
Loss at step 300: 247.893951
Training accuracy: 73.9%
Validation accuracy: 73.1%
Loss at step 350: 202.419571
Training accuracy: 74.6%
Validation accuracy: 73.7%
Loss at step 400: 163.972168
Training accuracy: 76.4%
Validation accuracy: 74.3%
Loss at step 450: 134.226044
Training accuracy: 76.3%
Validation accuracy: 75.0%
Loss at step 500: 109.968369
Training accuracy: 75.5%
Validation accuracy: 75.3%
Loss at step 550: 88.088905
Training accuracy: 76.8%
Validation accuracy: 75.4%
Loss at step 600: 72

** One can see that with reduced batch number overfit lead to low test accuracy **

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

Introduce keep probality rate: keep_prob for dropout

## Neural Network with dropout 

In [32]:
train_subset = 10000
reg_beta = 0.2 # regularization hyperparameter beta
batch_size = 2500 # use large batch size so tht only have a few batches to check overfit
hidden_nodes = 1024
keep_prob = 0.8

graph = tf.Graph()
with graph.as_default():
    
    # Placeholders
    tf_train_dataset = tf.placeholder(tf.float32, shape = [batch_size, image_size * image_size])
    tf_train_label = tf.placeholder(tf.float32, shape = [batch_size, num_labels])
    
    # Constants
    tf_valid_dataset = tf.constant(valid_dataset[:train_subset, :])
    tf_valid_label = tf.constant(valid_labels[:train_subset])
    tf_test_dataset = tf.constant(test_dataset[:train_subset, :])
    tf_test_label = tf.constant(test_labels[:train_subset])
    
    # Variables
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, hidden_nodes]))
    bias_1 = tf.Variable(tf.zeros([hidden_nodes]))
    
    weights_2 = tf.Variable(tf.truncated_normal([hidden_nodes, num_labels]))
    bias_2 = tf.Variable(tf.truncated_normal([num_labels]))
    
    
    # Training 
    logits = tf.matmul(tf_train_dataset, weights_1) + bias_1
    relu = tf.nn.relu(logits)
    dropout = tf.nn.dropout(relu, keep_prob)
    logits = tf.matmul(dropout, weights_2) + bias_2
    
    # Loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_label, logits = logits))
    ## L2 norm
    loss = tf.reduce_mean(loss + reg_beta * tf.nn.l2_loss(weights_2))
    
    # Optimizer 
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01).minimize(loss)
    
    # Predicts
    pred_train = tf.nn.softmax(logits)
    
    logits_valid = tf.matmul(tf_valid_dataset, weights_1) + bias_1
    relu_valid = tf.nn.relu(logits_valid)
    pred_valid = tf.nn.softmax(tf.matmul(relu_valid, weights_2) + bias_2)
    
    logits_test = tf.matmul(tf_test_dataset, weights_1) + bias_1
    relu_test = tf.nn.relu(logits_test)
    pred_test = tf.nn.softmax(tf.matmul(relu_test, weights_2) + bias_2)
    

In [33]:
num_steps = 1001
with tf.Session(graph = graph) as sess:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset : (offset + batch_size), :]
        batch_labels = train_labels[offset : (offset + batch_size), :]
        
        feed_dict = {tf_train_dataset: batch_data, 
                    tf_train_label: batch_labels}
        
        _, l, predictions = sess.run([optimizer, loss, pred_train], feed_dict = feed_dict)
        
        if step % 50 == 0:
            print('Loss at step %d: %f' % (step, l))
            print('Training accuracy: %.1f%%' % accuracy(predictions, batch_labels))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph dependencies.
            print('Validation accuracy: %.1f%%' % accuracy(pred_valid.eval(), valid_labels))
    print('Test accuracy: %.1f%%' % accuracy(pred_test.eval(), test_labels))

Initialized
Loss at step 0: 1219.139893
Training accuracy: 6.5%
Validation accuracy: 6.7%
Loss at step 50: 728.422241
Training accuracy: 52.4%
Validation accuracy: 63.5%
Loss at step 100: 576.903259
Training accuracy: 62.6%
Validation accuracy: 71.0%
Loss at step 150: 466.278717
Training accuracy: 65.4%
Validation accuracy: 73.9%
Loss at step 200: 375.564819
Training accuracy: 69.0%
Validation accuracy: 75.7%
Loss at step 250: 305.931152
Training accuracy: 71.5%
Validation accuracy: 76.8%
Loss at step 300: 250.590698
Training accuracy: 71.0%
Validation accuracy: 77.5%
Loss at step 350: 202.889252
Training accuracy: 73.1%
Validation accuracy: 77.9%
Loss at step 400: 164.611145
Training accuracy: 73.2%
Validation accuracy: 78.4%
Loss at step 450: 134.834412
Training accuracy: 73.0%
Validation accuracy: 78.8%
Loss at step 500: 109.871841
Training accuracy: 72.5%
Validation accuracy: 79.1%
Loss at step 550: 87.227806
Training accuracy: 74.8%
Validation accuracy: 79.5%
Loss at step 600: 72.

** One can see that test accuracy got imporved by using dropout, compare to previous few batch overfitting case**

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


## Neural Network with Learning rate decay 

implement learning rate decay to 1 layer NN wth 1024 neurons

In [38]:
train_subset = 10000
reg_beta = 0.01 # regularization hyperparameter beta
batch_size = 500 # use large batch size so tht only have a few batches to check overfit
hidden_nodes = 1024
keep_prob = 0.8
init_learning_rate = 0.1
decay_steps = 200 
decay_rate = 0.9

graph = tf.Graph()
with graph.as_default():
    
    # Placeholders
    tf_train_dataset = tf.placeholder(tf.float32, shape = [batch_size, image_size * image_size])
    tf_train_label = tf.placeholder(tf.float32, shape = [batch_size, num_labels])
    
    # Constants
    tf_valid_dataset = tf.constant(valid_dataset[:train_subset, :])
    tf_valid_label = tf.constant(valid_labels[:train_subset])
    tf_test_dataset = tf.constant(test_dataset[:train_subset, :])
    tf_test_label = tf.constant(test_labels[:train_subset])
    
    # Variables
    weights_1 = tf.Variable(tf.truncated_normal([image_size * image_size, hidden_nodes]))
    bias_1 = tf.Variable(tf.zeros([hidden_nodes]))
    
    weights_2 = tf.Variable(tf.truncated_normal([hidden_nodes, num_labels]))
    bias_2 = tf.Variable(tf.truncated_normal([num_labels]))
    
    
    # Training 
    logits = tf.matmul(tf_train_dataset, weights_1) + bias_1
    relu = tf.nn.relu(logits)
    dropout = tf.nn.dropout(relu, keep_prob)
    logits = tf.matmul(dropout, weights_2) + bias_2
    
    # Loss
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_label, logits = logits))
    ## L2 norm
    loss = tf.reduce_mean(loss + reg_beta * tf.nn.l2_loss(weights_2))
    
    # Optimizer with learning rate decay
    step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(init_learning_rate, step, decay_steps, decay_rate)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(loss)
    
    # Predicts
    pred_train = tf.nn.softmax(logits)
    
    logits_valid = tf.matmul(tf_valid_dataset, weights_1) + bias_1
    relu_valid = tf.nn.relu(logits_valid)
    pred_valid = tf.nn.softmax(tf.matmul(relu_valid, weights_2) + bias_2)
    
    logits_test = tf.matmul(tf_test_dataset, weights_1) + bias_1
    relu_test = tf.nn.relu(logits_test)
    pred_test = tf.nn.softmax(tf.matmul(relu_test, weights_2) + bias_2)
    

In [39]:
num_steps = 1001
with tf.Session(graph = graph) as sess:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        batch_data = train_dataset[offset : (offset + batch_size), :]
        batch_labels = train_labels[offset : (offset + batch_size), :]
        
        feed_dict = {tf_train_dataset: batch_data, 
                    tf_train_label: batch_labels}
        
        _, l, predictions = sess.run([optimizer, loss, pred_train], feed_dict = feed_dict)
        
        if step % 50 == 0:
            print('Loss at step %d: %f' % (step, l))
            print('Training accuracy: %.1f%%' % accuracy(predictions, batch_labels))
            # Calling .eval() on valid_prediction is basically like calling run(), but
            # just to get that one numpy array. Note that it recomputes all its graph dependencies.
            print('Validation accuracy: %.1f%%' % accuracy(pred_valid.eval(), valid_labels))
    print('Test accuracy: %.1f%%' % accuracy(pred_test.eval(), test_labels))

Initialized
Loss at step 0: 457.989288
Training accuracy: 9.8%
Validation accuracy: 27.9%
Loss at step 50: 85.762344
Training accuracy: 72.6%
Validation accuracy: 77.6%
Loss at step 100: 72.076103
Training accuracy: 75.6%
Validation accuracy: 80.1%
Loss at step 150: 54.977676
Training accuracy: 77.0%
Validation accuracy: 80.6%
Loss at step 200: 44.105583
Training accuracy: 80.4%
Validation accuracy: 81.5%
Loss at step 250: 40.892647
Training accuracy: 75.8%
Validation accuracy: 81.9%
Loss at step 300: 37.808998
Training accuracy: 77.2%
Validation accuracy: 80.9%
Loss at step 350: 29.687538
Training accuracy: 78.8%
Validation accuracy: 82.5%
Loss at step 400: 31.920956
Training accuracy: 76.6%
Validation accuracy: 82.6%
Loss at step 450: 26.198807
Training accuracy: 74.0%
Validation accuracy: 82.1%
Loss at step 500: 22.735498
Training accuracy: 78.6%
Validation accuracy: 82.4%
Loss at step 550: 15.746124
Training accuracy: 84.0%
Validation accuracy: 83.5%
Loss at step 600: 15.219471
Tra

** One can see that by adding learnig rate decay the test accuracy got improved, in the mean time, the run time got reduced **

Future: adding more hidden layers to improve test accuracy