## Visualizing the progress of neural network training with Tensorboard

There are many hyperparameters to tune when you're trying to improve the performance of a neural network. There are also many output measures during the training process. Keep track of all these numbers is a very challenging task. Fortunately, the good folks at Google knew this already and built [Tensorboard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard) exactly for this purpose. 

In this project, we'll use CIFAR-10 dataset and a simple convolutional neural network to learn how to use Tensorboard to keep track of and visualize parameters you're interested in during the training process. We'll also learn how to better understand a computation graph using Tensorboard. 

### Get the training data
Run the following cell to download the [CIFAR-10 dataset for python](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz).

In [1]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import tarfile

cifar10_dataset_folder_path = 'cifar-10-batches-py'
tar_gz_path = 'cifar-10-python.tar.gz'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isdir(cifar10_dataset_folder_path):
    if not isfile(tar_gz_path):
        with DLProgress(unit='B', unit_scale=True, miniters=1, desc='CIFAR-10 Dataset') as pbar:
            urlretrieve('https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz', tar_gz_path, pbar.hook)
    with tarfile.open(tar_gz_path) as tar:
        tar.extractall()
        tar.close()

In [2]:
import helper
import numpy as np

## Preprocessing Functions
### Normalize
In the cell below, we implement the `normalize` function to take in image data, `x`, and return it as a normalized Numpy array. The values should be in the range of 0 to 1, inclusive.  The return object should be the same shape as `x`.

In [3]:
def normalize(x):
    x = x.astype(np.float32)
    return ( x - x.min() ) / ( x.max() - x.min() )

### One-hot encode
Here we implement the `one_hot_encode` function. The input, `x`, are a list of labels.  Implement the function to return the list of labels as One-Hot encoded Numpy array.  The possible values for labels are 0 to 9. The one-hot encoding function should return the same encoding for each value between each call to `one_hot_encode`.  Make sure to save the map of encodings outside the function.

In [4]:
def one_hot_encode(x):
    output = np.zeros([len(x), 10])   
    for idx, item in enumerate(x):
        output[idx, item] = 1
    return output

### Randomize Data
As you saw from exploring the data above, the order of the samples are randomized.  It doesn't hurt to randomize it again, but you don't need to for this dataset.

## Preprocess all the data and save it
Running the code cell below will preprocess all the CIFAR-10 data and save it to file. The code below also uses 10% of the training data for validation.

In [5]:
# Preprocess Training, Validation, and Testing Data
helper.preprocess_and_save_data(cifar10_dataset_folder_path, normalize, one_hot_encode)

# Check Point
This is your first checkpoint.  If you ever decide to come back to this notebook or have to restart the notebook, you can start from here.  The preprocessed data has been saved to disk.

In [6]:
import pickle
import helper

# Load the Preprocessed Validation data
valid_features, valid_labels = pickle.load(open('preprocess_validation.p', mode='rb'))

## Build the network


### Input

Here, we're using [tf.name_scope](https://www.tensorflow.org/api_docs/python/tf/name_scope) to group the nodes within a single name scope. All the nodes in the same name scope will appear as a group when you later visualize the computation graph in Tensorboard. Note that every time you use `tf.name_scope`, you'll create a new one even when you're trying to use the same name. So for example, if you do `with tf.name_scope('inputs'):` twice, you'll actually create two name scopes. The first one is 'inputs', the second one is 'inputs_1'. 

In [7]:
import tensorflow as tf

def neural_net_input(image_shape, n_classes):
    # Grouping nodes into a single name scope in the computation graph.
    with tf.name_scope('inputs'):
        x = tf.placeholder(tf.float32, [None, image_shape[0], image_shape[1], image_shape[2]], "x")
        y = tf.placeholder(tf.float32, [None, n_classes], "y")
        keep_prob = tf.placeholder(tf.float32, None, "keep_prob")
        learning_rate = tf.placeholder(tf.float32, None, "lr")
    return x, y, keep_prob, learning_rate

### Convolution and Max Pooling Layer
Here we're using to [tf.summary](https://www.tensorflow.org/api_docs/python/tf/summary) to keep track of parameters we're interested in. `tf.summary` writes the value of the parameters to protocol buffer, which can later be written into a log file using [tf.summary.FileWriter](https://www.tensorflow.org/api_docs/python/tf/summary/FileWriter) and viewed in Tensorboard. Here, since weights and biases are not just single scalars, so we use histogram to keep track of them. 

In [8]:
def conv2d_maxpool(x_tensor, conv_num_outputs, conv_ksize, conv_strides, pool_ksize, pool_strides):
    wc1 = tf.Variable(tf.truncated_normal( [ conv_ksize[0], conv_ksize[1], x_tensor.shape[3].value, conv_num_outputs ],
                                      mean=0.0, stddev=0.1, dtype=tf.float32))
    
    bc1 = tf.Variable(tf.truncated_normal([conv_num_outputs],
                                      mean=0.0, stddev=0.1, dtype=tf.float32))
    
    x_out = tf.nn.conv2d(x_tensor, wc1, strides=[1, conv_strides[0], conv_strides[1], 1], padding='SAME')
    x_out = tf.nn.bias_add(x_out, bc1)
    
    x_out = tf.nn.relu(x_out)
    x_out = tf.nn.max_pool(x_out, ksize=[1, pool_ksize[0], pool_ksize[1], 1], 
                           strides=[1, pool_strides[0], pool_strides[1], 1], padding='SAME')
    
    # Keeping track of weights and biases
    tf.summary.histogram('weights', wc1)
    tf.summary.histogram('biases', bc1)
    return tf.nn.relu(x_out) 

### Flatten Layer

In [9]:
def flatten(x_tensor):
    return tf.reshape(x_tensor, [tf.shape(x_tensor)[0], np.prod(x_tensor.shape[1:]).value])
    #return tf.reshape(x_tensor, [-1, np.prod(x_tensor.shape[1:]).value])

### Fully-Connected Layer

In [10]:
def fully_conn(x_tensor, num_outputs):
    fc = tf.reshape(x_tensor, [-1, np.prod(x_tensor.shape[1:]).value])
    w = tf.Variable(tf.truncated_normal([np.prod(x_tensor.shape[1:]).value, num_outputs], mean=0.0, stddev=0.1, dtype=tf.float32))
    b = tf.Variable(tf.truncated_normal([num_outputs],mean=0.0, stddev=0.1, dtype=tf.float32))
    tf.summary.histogram('weights', w)
    tf.summary.histogram('biases', b)
    return tf.nn.relu(tf.add(tf.matmul(fc, w), b))

### Output Layer

In [11]:
def output(x_tensor, num_outputs):
    w = tf.Variable(tf.random_normal([x_tensor.shape[1].value, num_outputs]))
    b = tf.Variable(tf.random_normal([num_outputs]))
    return tf.add(tf.matmul(x_tensor, w), b)

### Create Convolutional Model

In [12]:
def conv_net(x, keep_prob):
    
    with tf.name_scope("CNN"):
        with tf.variable_scope("conv1"):
            conv1 = conv2d_maxpool(x, 10, (3, 3), (1, 1), (2, 2), (2, 2))
        with tf.variable_scope("conv2"):
            conv2 = conv2d_maxpool(conv1, 20, (5, 5), (1, 1), (2, 2), (2, 2))
        with tf.variable_scope("conv3"):
            conv3 = conv2d_maxpool(conv2, 30, (7, 7), (1, 1), (2, 2), (2, 2))

        # TODO: Apply a Flatten Layer
        f = flatten(conv3)

        # TODO: Apply 1, 2, or 3 Fully Connected Layers
        with tf.variable_scope("fc1"):
            fc1 = fully_conn(f, 100)
            fc1 = tf.nn.dropout(fc1, keep_prob)
        with tf.variable_scope("fc2"):
            fc2 = fully_conn(fc1, 50)
            fc2 = tf.nn.dropout(fc2, keep_prob)
        with tf.variable_scope("fc3"):
            fc3 = fully_conn(fc2, 20)

        o = output(fc2, 10)
    
    return o

### Building a training op

In [13]:
def build_op():
    
    # Remove previous weights, bias, inputs, etc..
    tf.reset_default_graph()

    # Inputs
    x, y, keep_prob, lr = neural_net_input((32,32,3), 10)
    
    # Model
    logits = conv_net(x, keep_prob)

    # Name logits Tensor, so that is can be loaded from disk after training
    with tf.name_scope('logits'):
        logits = tf.identity(logits, name='logits')

    # Loss and Optimizer
    with tf.name_scope('cost'):
        cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name='cost')
    loss_summ = tf.summary.scalar('loss', cost)

    with tf.name_scope('train'):
        optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(cost)

    # Accuracy
    with tf.name_scope('predictions'):
        correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
    accuracy_summ = tf.summary.scalar('accuracy', accuracy)
    
    #summary = tf.summary.merge([loss_summ, accuracy_summ], name='summary')
    #Alternatively, you can use tf.summary.merge_all()
    summary = tf.summary.merge_all()
    
    return x, y, keep_prob, lr, summary, optimizer, cost, accuracy

## Train the Neural Network

### Hyperparameters

In [14]:
# TODO: Tune Parameters
epochs = 20
batch_size = 128
keep_probability = 0.5
learning_rate = 0.001

### Train the Model

In [15]:
#for batch_size in [512, 128, 32, 8, 2]:
for learning_rate in [0.1, 0.01, 0.001, 0.0001, 0.00001]:

    x, y, keep_prob, lr, summary, optimizer, cost, accuracy = build_op()
    
    with tf.Session() as sess:
        steps = 0
        counter = 0
        
        # Initializing the variables
        sess.run(tf.global_variables_initializer())
        
        # If you want to compare training and validation, one good way to do it is to use two separate
        # file writer to keep logs for each process but keep them in the same folder. This way, you can 
        # later view them in the same plot. 
        #train_log_string = 'log/train, learning_rate={:.5f}, batch_size={}'.format(learning_rate, batch_size)
        valid_log_string = 'log/valid, learning_rate={:.5f}, batch_size={}'.format(learning_rate, batch_size)
        #train_filewriter = tf.summary.FileWriter(train_log_string, sess.graph)
        valid_filewriter = tf.summary.FileWriter(valid_log_string, sess.graph)
        
        # Training cycle
        for epoch in range(epochs):
            # Loop over all batches
            n_batches = 5
            for batch_i in range(1, n_batches + 1):
                for batch_features, batch_labels in helper.load_preprocess_training_batch(batch_i, batch_size):
                    steps += batch_features.shape[0]
                    counter += batch_features.shape[0]
                    sess.run(optimizer, feed_dict={x: batch_features, y: batch_labels,
                                                   keep_prob: keep_probability, lr:learning_rate})
                
                    # Log only after the model is trained on every 500 samples. Getting summary takes time, so the 
                    # more frequently you look, the more extra time it'll cost you.
                    if counter > 500:
                        #train_summ = sess.run(summary, feed_dict={x: batch_features, y: batch_labels,
                        #                                                   keep_prob: 1., lr:learning_rate})
                        #train_filewriter.add_summary(train_summ, steps)
                        valid_summ = sess.run(summary, feed_dict={x: valid_features, y: valid_labels,
                                                                             keep_prob: 1., lr:learning_rate})
                        
                        valid_filewriter.add_summary(valid_summ, steps)
                        counter -= 500
                        
                loss, valid_acc= sess.run([cost, accuracy], feed_dict={x: valid_features, y: valid_labels, 
                                                                        keep_prob: 1., lr:learning_rate})
                
                print('Epoch {:>2}, CIFAR-10 Batch {}:  '.format(epoch + 1, batch_i), end='')
                print('Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(loss, valid_acc))

Epoch  1, CIFAR-10 Batch 1:  Loss:     2.3048 Validation Accuracy: 0.106800
Epoch  1, CIFAR-10 Batch 2:  Loss:     2.3067 Validation Accuracy: 0.094200
Epoch  1, CIFAR-10 Batch 3:  Loss:     2.3144 Validation Accuracy: 0.094600
Epoch  1, CIFAR-10 Batch 4:  Loss:     2.3099 Validation Accuracy: 0.099800
Epoch  1, CIFAR-10 Batch 5:  Loss:     2.3102 Validation Accuracy: 0.097800
Epoch  2, CIFAR-10 Batch 1:  Loss:     2.3044 Validation Accuracy: 0.097800
Epoch  2, CIFAR-10 Batch 2:  Loss:     2.3086 Validation Accuracy: 0.094200
Epoch  2, CIFAR-10 Batch 3:  Loss:     2.3144 Validation Accuracy: 0.094600
Epoch  2, CIFAR-10 Batch 4:  Loss:     2.3102 Validation Accuracy: 0.099800
Epoch  2, CIFAR-10 Batch 5:  Loss:     2.3100 Validation Accuracy: 0.097800
Epoch  3, CIFAR-10 Batch 1:  Loss:     2.3046 Validation Accuracy: 0.105000
Epoch  3, CIFAR-10 Batch 2:  Loss:     2.3091 Validation Accuracy: 0.094200
Epoch  3, CIFAR-10 Batch 3:  Loss:     2.3143 Validation Accuracy: 0.094600
Epoch  3, CI

Epoch  3, CIFAR-10 Batch 5:  Loss:     1.4012 Validation Accuracy: 0.510800
Epoch  4, CIFAR-10 Batch 1:  Loss:     1.4278 Validation Accuracy: 0.488000
Epoch  4, CIFAR-10 Batch 2:  Loss:     1.3500 Validation Accuracy: 0.525200
Epoch  4, CIFAR-10 Batch 3:  Loss:     1.3236 Validation Accuracy: 0.537200
Epoch  4, CIFAR-10 Batch 4:  Loss:     1.3062 Validation Accuracy: 0.552600
Epoch  4, CIFAR-10 Batch 5:  Loss:     1.3343 Validation Accuracy: 0.536000
Epoch  5, CIFAR-10 Batch 1:  Loss:     1.3639 Validation Accuracy: 0.526400
Epoch  5, CIFAR-10 Batch 2:  Loss:     1.2486 Validation Accuracy: 0.567000
Epoch  5, CIFAR-10 Batch 3:  Loss:     1.3512 Validation Accuracy: 0.538200
Epoch  5, CIFAR-10 Batch 4:  Loss:     1.2129 Validation Accuracy: 0.577600
Epoch  5, CIFAR-10 Batch 5:  Loss:     1.2260 Validation Accuracy: 0.583200
Epoch  6, CIFAR-10 Batch 1:  Loss:     1.2135 Validation Accuracy: 0.581000
Epoch  6, CIFAR-10 Batch 2:  Loss:     1.2165 Validation Accuracy: 0.579600
Epoch  6, CI

Epoch  6, CIFAR-10 Batch 4:  Loss:     2.4153 Validation Accuracy: 0.117200
Epoch  6, CIFAR-10 Batch 5:  Loss:     2.4138 Validation Accuracy: 0.118200
Epoch  7, CIFAR-10 Batch 1:  Loss:     2.4025 Validation Accuracy: 0.118800
Epoch  7, CIFAR-10 Batch 2:  Loss:     2.3985 Validation Accuracy: 0.120000
Epoch  7, CIFAR-10 Batch 3:  Loss:     2.3957 Validation Accuracy: 0.122200
Epoch  7, CIFAR-10 Batch 4:  Loss:     2.3906 Validation Accuracy: 0.122600
Epoch  7, CIFAR-10 Batch 5:  Loss:     2.3877 Validation Accuracy: 0.123000
Epoch  8, CIFAR-10 Batch 1:  Loss:     2.3735 Validation Accuracy: 0.125600
Epoch  8, CIFAR-10 Batch 2:  Loss:     2.3683 Validation Accuracy: 0.124800
Epoch  8, CIFAR-10 Batch 3:  Loss:     2.3667 Validation Accuracy: 0.126200
Epoch  8, CIFAR-10 Batch 4:  Loss:     2.3583 Validation Accuracy: 0.126200
Epoch  8, CIFAR-10 Batch 5:  Loss:     2.3530 Validation Accuracy: 0.128000
Epoch  9, CIFAR-10 Batch 1:  Loss:     2.3409 Validation Accuracy: 0.129800
Epoch  9, CI

### Visualize training logs in Tensorflow

With the above code, after training begins, logs will be incrementally written to file on disk. You can actually monitor the ongoing training process. To visualize the logs, open another terminal and type in `tensorflow --logdir yourlogdir`. Here, `yourlogdir` is where you save the log files to. After tensorboard finds the log files, it'll give you an URL, which is very likely an equivalent of [localhost:6006](localhost:6006). Type this URL in your browser and you can visualize the training process ! 