# Lab 1 - Intro to BlueCrystal and TensorFlow

In this first lab session, you will learn the basics of implementing deep learning models using TensorFlow 1.2 and how to use BlueCrystal Phase 4 (BC4) for training them. The aim is to learn the methodology for building, debugging and training models using this framework, and you will start by training a shallow neural network for recognising objects using the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html). CIFAR-10 has 10 object classes, 60,000 examples with 6,000 examples per class.

[![CIFAR10 samples](cifar10-sample.png)](https://www.cs.toronto.edu/~kriz/cifar.html)

### Objectives:

1.- Build your first deep learning model using TensorFlow 1.2 for recognising objects using the CIFAR-10 dataset. 

2.- Train your model on BC4 and visualize the training process

3.- Evaluate your model.


## NOTICE:

Please ensure you can successfully run the [GPU stress Test](RunningTensorFlow.ipynb) before attempting Lab 1. This will ensure you have a working platform prior to proceeding with this lab sheet

# 1. Building your first CNN

## 1.1 TensorFlow-1.2

TensorFlow was originally developed by the Google Brain team as an internal machine learning tool and was open-sourced under the Apache 2.0 License in November 2015. Since then, it has become a popular choice among researchers and companies due to its balanced trade-off between flexibility (required in research) and production-worthiness (required in industry). Additionally, it's well documented and maintained, backed by a large community (> 10,000 commits and > 3000 TF-related repos in one year). 

Its core is written in C++, with Python, C++, Java and Go frontends. For the lab sessions we will use the **version 1.2** with **Python** as frontend due to its compatibility with Numpy and the [`tf.contrib.learn`](https://www.tensorflow.org/api_docs/python/tf/contrib/learn) and [`tf.contrib.slim`](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) APIs that will become handy later on.

> **NOTE** `tf.contrib.learn` is different from the independent project [tflearn](http://tflearn.org/)

### Graphs and sessions

TensorFlow does all its computation in graphs (creators referred to them as **dataflow graphs**), [Danijar Hafner's website](https://danijar.com/what-is-a-tensorflow-session/) and [TensorFlow's documentation](https://www.tensorflow.org/programmers_guide/graphs) contains more details about the concept of computational graphs and their advantages, in this Lab we will focus only on how to build and excecute them.

The **graph** will define the variables and computation. It doesn’t compute anything nor does it hold any values, it just defines the operations that we want to be perfermed.

The excecution of the graph, referred as a **session**, allocates resources, feeds the data, computes the operations and holds the values of intermediate results and variables. The Figure below shows how the data flow from the *input readers*, to *opertations* such as convolutions and activations, then the *gradients* are computed and error is *backpropagated* to the *weight* and *bias variables*. 

![](https://www.tensorflow.org/images/tensors_flowing.gif)


**We encourage to use single graphs and single session** for your models in this course. 

## 1.2 Downloading Relevant Files

* First, visit the [GitHub labsheet repository](https://github.com/COMSM0018-Applied-Deep-Learning/labsheets)
* Clone the repository `git clone "https://github.com/COMSM0018-Applied-Deep-Learning/labsheets.git"`
* Copy both `CIFAR10`, `Lab_1_intro` into a new folder which we will refer to as `/path_to_files/`
* Using Jupyter notebook, open the file `Lab_1_intro/simple_train_cifar.py`

## 1.3 Building your first Model

Next, we will describe the code in the  ```simple_train_cifar.py``` file for implementing a CNN for recognising objects on the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset. Your very first deep learning model will be formed by two *convolutional layers* and two *fully connected layers* with the following sizes and hyperparameters: 

* Filter 5 x 5, with stride 1 and padding [`'SAME'`](https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t) for convolutions
* Kernel 2 x 2 , with stride 2 and padding [`'SAME'`](https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t) for max pooling 
* 1024 Neurons for the fully connected layer
* ReLU activation function for both convolutional layers, and fully connected layers
* Weight and bias initialization using random values from a truncated normal distribution with $\sigma= 0.1$  and $\mu=0$ 


### Imports

We first begin with the imports that the lab will require. We will use ```FLAGS``` for parsing values to the graph.

In [None]:
# %load -r 12:35 simple_train_cifar.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import os

import tensorflow as tf

sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'CIFAR10'))
import cifar10 as cf

FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('data-dir', os.getcwd() + '/dataset/',
                           'Directory where the dataset will be stored and checkpoint.')
tf.app.flags.DEFINE_integer('max-steps', 1000,
                            'Number of mini-batches to train on.')
tf.app.flags.DEFINE_integer('log-frequency', 100,
                            'Number of steps between logging results to the console and saving summaries')
tf.app.flags.DEFINE_integer('save-model', 1000,
                            'Number of steps between model saves')

### Hyperparameters

We also define all variables required for training globally, including batch size, learning rate, as well as the dataset's information including input size (img_height, img_channels) and the number of classes.

`train_dir` refers to the location of the training files in the dataset

In [None]:
# %load -r 36:48 simple_train_cifar.py
# Optimisation hyperparameters
tf.app.flags.DEFINE_integer('batch-size', 128, 'Number of examples per mini-batch')
tf.app.flags.DEFINE_float('learning-rate', 1e-4, 'Number of examples to run.')
tf.app.flags.DEFINE_integer('img-width', 32, 'Image width')
tf.app.flags.DEFINE_integer('img-height', 32, 'Image height')
tf.app.flags.DEFINE_integer('img-channels', 3, 'Image channels')
tf.app.flags.DEFINE_integer('num-classes', 10, 'Number of classes')
tf.app.flags.DEFINE_string('train-dir',
                           '{cwd}/logs/exp_bs_{bs}_lr_{lr}'.format(cwd=os.getcwd(),
                                                                   bs=FLAGS.batch_size,
                                                                   lr=FLAGS.learning_rate),
                           'Directory where to write event logs and checkpoint.')

### Convolutional and Pooling Layers
We can now define a function which will create a 2D convolutional layer with full stride and a max pooling layer:



In [None]:
# %load -r 107:116 simple_train_cifar.py
def conv2d(x, W):
    '''conv2d returns a 2d convolution layer with full stride.'''
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME', name='convolution')


def max_pool_2x2(x):
    '''max_pool_2x2 downsamples a feature map by 2X.'''
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                          strides=[1, 2, 2, 1], padding='SAME', name='pooling')

### Variables and Operations

*Tensor Variables* such as Weights ($W$) and Biases ($b$) for Convolutional Neural Networks (CNNs) are declared  via the  `tf.Variable` class, while for *Tensor Constants* there are several numpy-like methods such as `tf.zeros`, `tf.ones`, `tf.constant`, etc. For this lab we will use only *Tensor Variables*, [here](https://www.tensorflow.org/api_guides/python/constant_op) you can see how to use constants if your require them later. 

We will use the function `tf.Variable()` for creating variables and should they need to be intiliazed:  `tf.Variable(<initial-value>, name=<optional-name>)`. We will use a random truncated initialization by using the `tf.truncated_normal(shape, stddev)` function.

With the *Tensor Variables* we can perform *arithmetic*, *basic math*, *matrix math* and *sequence indexing*   operations, check out [math operations](https://www.tensorflow.org/api_guides/python/math_ops). Common operations for neural networks such as *convolutions*, *activations*, *pooling*, *recurrent architectures* etc. are defined in the [neural network operations](https://www.tensorflow.org/api_guides/python/nn) module.

### Weight and Bias Variables
Next we can define functions that will create weight and bias variables of given shapes:

In [None]:
# %load -r 118:128 simple_train_cifar.py
def weight_variable(shape):
    '''weight_variable generates a weight variable of a given shape.'''
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial, name='weights')


def bias_variable(shape):
    '''bias_variable generates a bias variable of a given shape.'''
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial, name='biases')

### Defining the CNN
With the previous code, we can now define and create our CNN. As mentioned previously, we want the network to have the following layers:
* Convolutional Layer - Conv1
* Pooling Layer - Pool1
* Convolutional Layer - Conv2
* Pooling Layer - Pool2
* Fully Connected Layer - FC1
* An Output Layer - FC2

We first create a function, `deepnn(x)`, which will build us a graph that takes in our image `x` and returns the tensor of class probabilities (i.e. a vector with shape 10x1 indicating the relative probabilities that `x` belongs to a given class, as determined by the index of the vector). We also initialise some variables we will use later when creating the CNN.

In [None]:
# %load -r 50:68 simple_train_cifar.py

def deepnn(x):
    '''deepnn builds the graph for a deep net for classifying CIFAR10 images.

  Args:
      x: an input tensor with the dimensions (N_examples, 3072), where 3072 is the
        number of pixels in a standard CIFAR10 image.

  Returns:
      y: is a tensor of shape (N_examples, 10), with values
        equal to the logits of classifying the object images into one of 10 classes
        (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
      img_summary: a string tensor containing sampled input images.
    '''
    # Reshape to use within a convolutional neural net.  Last dimension is for
    # 'features' - it would be 1 one for a grayscale image, 3 for an RGB image,
    # 4 for RGBA, etc.


With that finished, the following code defines the layers inside your CNN.

Even though the code is provided, look at each step as to what each layer takes in as input.

In [None]:
# %load -r 70:104 simple_train_cifar.py

    # First convolutional layer - maps one grayscale image to 32 feature maps.
    with tf.variable_scope('Conv_1'):
        W_conv1 = weight_variable([5, 5, FLAGS.img_channels, 32])
        b_conv1 = bias_variable([32])
        h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

        # Pooling layer - downsamples by 2X.
        h_pool1 = max_pool_2x2(h_conv1)

    with tf.variable_scope('Conv_2'):
        # Second convolutional layer -- maps 32 feature maps to 64.
        W_conv2 = weight_variable([5, 5, 32, 64])
        b_conv2 = bias_variable([64])
        h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

        # Second pooling layer.
        h_pool2 = max_pool_2x2(h_conv2)

    with tf.variable_scope('FC_1'):
        # Fully connected layer 1 -- after 2 round of downsampling, our 28x28
        # image is down to 8x8x64 feature maps -- maps this to 1024 features.
        W_fc1 = weight_variable([8 * 8 * 64, 1024])
        b_fc1 = bias_variable([1024])

        h_pool2_flat = tf.reshape(h_pool2, [-1, 8*8*64])
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

    with tf.variable_scope('FC_2'):
        # Map the 1024 features to 10 classes, one for each digit
        W_fc2 = weight_variable([1024, FLAGS.num_classes])
        b_fc2 = bias_variable([FLAGS.num_classes])

        y_conv = tf.matmul(h_fc1, W_fc2) + b_fc2

We can now declare our main function, which uses a tensoflow session that initialises the CNN we defined, and allows us to check that all the conections and data feeding are in the right place through [TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard).

```python
def main(_):
    tf.reset_default_graph()

    # Import data - this function loads the training data from the directory, using the defined batch size 
    cifar = cf.cifar10(batchSize=FLAGS.batch_size, downloadDir=FLAGS.data_dir)
    with tf.variable_scope("inputs"):
        # Create the model
        x = tf.placeholder(tf.float32, [None, FLAGS.img_width * FLAGS.img_height * FLAGS.img_channels])
        # Define loss and optimizer
        y_ = tf.placeholder(tf.float32, [None, FLAGS.num_classes])
  
    # Build the graph for the deep net
    y_conv = deepnn(x)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        # For loop for train and validation
        for step in range(FLAGS.max_steps):
            # Training: Backpropagation using train set
            (trainImages, trainLabels) = cifar.getTrainBatch()
            (testImages, testLabels) = cifar.getTestBatch()   
      
            sess.run(train_step, feed_dict={x: trainImages, y_: trainLabels})


if __name__ == '__main__':
    tf.app.run(main=main)
```

### Optimisation and Gradient Descent

Now that we have a CNN we want to actually train it! Time to include a loss function. We're using the standard cross entropy loss between the logits and the labels from the ground truth.

In [None]:
# %load -r 145:150 simple_train_cifar.py
with tf.variable_scope('x_entropy'):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))

    train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(cross_entropy)
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))

We can now add the main training loop inside of the main function inside the tensorflow session. We run the loop 10,000 times however we only bother printing out the training information every 100 steps. Note we do not have to write any derivatives explicitely due to TensorFlow's [automatic differentiation](http://www.columbia.edu/~ahd2125/post/2015/12/5/).

In [None]:
# %load -r 170:175 simple_train_cifar.py
for step in range(FLAGS.max_steps):
    # Training: Backpropagation using train set
    (trainImages, trainLabels) = cifar.getTrainBatch()
    (testImages, testLabels) = cifar.getTestBatch()


We have defined everything your TF model needs to be trained, so we only need to include some code for visualising the training progess and saving checkpoints.

### Summaries and Tensorboard
Tensorboard allows for the visualisation of training and testing statistics in addition to a graphical output of the CNN that was trained. To do this we can run tensorboard on Blue Crystal and, via the use of port forwarding, view the results on the lab machines. 

First, we need to indicate what we want to be save on the summaries, for now we will save some images that are feed in to model, the loss and accuracy for every batch. 



In [None]:
# %load -r 151:177 simple_train_cifar.py
loss_summary = tf.summary.scalar('Loss', cross_entropy)
acc_summary = tf.summary.scalar('Accuracy', accuracy)

    # summaries for TensorBoard visualisation
    validation_summary = tf.summary.merge([img_summary, acc_summary])
    training_summary = tf.summary.merge([img_summary, loss_summary])
    test_summary = tf.summary.merge([img_summary, acc_summary])

    # saver for checkpoints
    saver = tf.train.Saver(tf.global_variables(), max_to_keep=1)

    with tf.Session() as sess:
        summary_writer = tf.summary.FileWriter(FLAGS.train_dir + '_train', sess.graph)
        summary_writer_validation = tf.summary.FileWriter(FLAGS.train_dir + '_validate', sess.graph)
        summary_writer_test = tf.summary.FileWriter(FLAGS.train_dir + '_test', sess.graph)

        sess.run(tf.global_variables_initializer())

        # Training and validation
        for step in range(FLAGS.max_steps):
            # Training: Backpropagation using train set
            (trainImages, trainLabels) = cifar.getTrainBatch()
            (testImages, testLabels) = cifar.getTestBatch()

            _, summary_str = sess.run([train_step, training_summary], feed_dict={x: trainImages, y_: trainLabels})


#### Saving checkpoints

Lastly, we include a saver for saving checkpoints so you can use them as a backup of your training and later for evaluation of your model.

Through the flags, we have set the the code to save the model every 1000 steps.

In [None]:
# %load -r 186:189 simple_train_cifar.py
# Save the model checkpoint periodically.
if step % FLAGS.save_model == 0 or (step + 1) == FLAGS.max_steps:
    checkpoint_path = os.path.join(FLAGS.train_dir + '_train', 'model.ckpt')

### Validating and evaluating results

Finally, below we show how to use the CIFAR-10's *test set*, in order to see how well the model performs for classifing unseen examples. In this Lab sessions will use the test set primarly for hyperparameters selection. Bare in mind that for this task, usually a subsampling of the training set (commonly detoned as *validation set*) is used for this task. Using validation and test sets helps to identify cases of under and overfitting, as well as for benchmarking the performance among different algorithms. 


In [None]:
# %load -r 190:209 simple_train_cifar.py

    # Testing

    # resetting the internal batch indexes
    cifar.reset()
    evaluatedImages = 0
    test_accuracy = 0
    nRuns = 0

    while evaluatedImages != cifar.nTestSamples:
        # don't loop back when we reach the end of the test set
        (testImages, testLabels) = cifar.getTestBatch(allowSmallerBatches=True)
        test_accuracy_temp, _ = sess.run([accuracy, test_summary], feed_dict={x: testImages, y_: testLabels})

        nRuns = nRuns + 1
        test_accuracy = test_accuracy + test_accuracy_temp
        evaluatedImages = evaluatedImages + testLabels.shape[0]

    test_accuracy = test_accuracy / nRuns

## 1.4 Complete File

We now have a complete file for defining a CNN, training, validation and checkpoints. We will next move onto training the model using Blue Crystal 4

In [None]:
# %load simple_train_cifar.py
############################################################
#                                                          #
#  Code for Lab 1: Intro to TensorFlow and Blue Crystal 4  #
#                                                          #
############################################################

'''Based on TensorFLow's tutorial: A deep MNIST classifier using convolutional layers.

See extensive documentation at
https://www.tensorflow.org/get_started/mnist/pros
'''

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import os

import tensorflow as tf

sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'CIFAR10'))
import cifar10 as cf

FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('data-dir', os.getcwd() + '/dataset/',
                           'Directory where the dataset will be stored and checkpoint.')
tf.app.flags.DEFINE_integer('max-steps', 1000,
                            'Number of mini-batches to train on.')
tf.app.flags.DEFINE_integer('log-frequency', 100,
                            'Number of steps between logging results to the console and saving summaries')
tf.app.flags.DEFINE_integer('save-model', 1000,
                            'Number of steps between model saves')

# Optimisation hyperparameters
tf.app.flags.DEFINE_integer('batch-size', 128, 'Number of examples per mini-batch')
tf.app.flags.DEFINE_float('learning-rate', 1e-4, 'Number of examples to run.')
tf.app.flags.DEFINE_integer('img-width', 32, 'Image width')
tf.app.flags.DEFINE_integer('img-height', 32, 'Image height')
tf.app.flags.DEFINE_integer('img-channels', 3, 'Image channels')
tf.app.flags.DEFINE_integer('num-classes', 10, 'Number of classes')
tf.app.flags.DEFINE_string('train-dir',
                           '{cwd}/logs/exp_bs_{bs}_lr_{lr}'.format(cwd=os.getcwd(),
                                                                   bs=FLAGS.batch_size,
                                                                   lr=FLAGS.learning_rate),
                           'Directory where to write event logs and checkpoint.')


def deepnn(x):
    '''deepnn builds the graph for a deep net for classifying CIFAR10 images.

  Args:
      x: an input tensor with the dimensions (N_examples, 3072), where 3072 is the
      number of pixels in a standard CIFAR10 image.

  Returns:
      y: is a tensor of shape (N_examples, 10), with values
      equal to the logits of classifying the object images into one of 10 classes
      (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
      img_summary: a string tensor containing sampled input images.
    '''
    # Reshape to use within a convolutional neural net.  Last dimension is for
    # 'features' - it would be 1 one for a grayscale image, 3 for an RGB image,
    # 4 for RGBA, etc.

    x_image = tf.reshape(x, [-1, FLAGS.img_width, FLAGS.img_height, FLAGS.img_channels])

    img_summary = tf.summary.image('Input_images', x_image)

    # First convolutional layer - maps one grayscale image to 32 feature maps.
    with tf.variable_scope('Conv_1'):
        W_conv1 = weight_variable([5, 5, FLAGS.img_channels, 32])
        b_conv1 = bias_variable([32])
        h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

        # Pooling layer - downsamples by 2X.
        h_pool1 = max_pool_2x2(h_conv1)

    with tf.variable_scope('Conv_2'):
        # Second convolutional layer -- maps 32 feature maps to 64.
        W_conv2 = weight_variable([5, 5, 32, 64])
        b_conv2 = bias_variable([64])
        h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

        # Second pooling layer.
        h_pool2 = max_pool_2x2(h_conv2)

    with tf.variable_scope('FC_1'):
        # Fully connected layer 1 -- after 2 round of downsampling, our 28x28
        # image is down to 8x8x64 feature maps -- maps this to 1024 features.
        W_fc1 = weight_variable([8 * 8 * 64, 1024])
        b_fc1 = bias_variable([1024])

        h_pool2_flat = tf.reshape(h_pool2, [-1, 8*8*64])
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

    with tf.variable_scope('FC_2'):
        # Map the 1024 features to 10 classes, one for each digit
        W_fc2 = weight_variable([1024, FLAGS.num_classes])
        b_fc2 = bias_variable([FLAGS.num_classes])

        y_conv = tf.matmul(h_fc1, W_fc2) + b_fc2
        return y_conv, img_summary


def conv2d(x, W):
    '''conv2d returns a 2d convolution layer with full stride.'''
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME', name='convolution')


def max_pool_2x2(x):
    '''max_pool_2x2 downsamples a feature map by 2X.'''
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                          strides=[1, 2, 2, 1], padding='SAME', name='pooling')


def weight_variable(shape):
    '''weight_variable generates a weight variable of a given shape.'''
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial, name='weights')


def bias_variable(shape):
    '''bias_variable generates a bias variable of a given shape.'''
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial, name='biases')


def main(_):
    tf.reset_default_graph()

    # Import data
    cifar = cf.cifar10(batchSize=FLAGS.batch_size, downloadDir=FLAGS.data_dir)

    with tf.variable_scope('inputs'):
        # Create the model
        x = tf.placeholder(tf.float32, [None, FLAGS.img_width * FLAGS.img_height * FLAGS.img_channels])
        # Define loss and optimizer
        y_ = tf.placeholder(tf.float32, [None, FLAGS.num_classes])

    # Build the graph for the deep net
    y_conv, img_summary = deepnn(x)

    with tf.variable_scope('x_entropy'):
        cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))

    train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(cross_entropy)
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')
    loss_summary = tf.summary.scalar('Loss', cross_entropy)
    acc_summary = tf.summary.scalar('Accuracy', accuracy)

    # summaries for TensorBoard visualisation
    validation_summary = tf.summary.merge([img_summary, acc_summary])
    training_summary = tf.summary.merge([img_summary, loss_summary])
    test_summary = tf.summary.merge([img_summary, acc_summary])

    # saver for checkpoints
    saver = tf.train.Saver(tf.global_variables(), max_to_keep=1)

    with tf.Session() as sess:
        summary_writer = tf.summary.FileWriter(FLAGS.train_dir + '_train', sess.graph)
        summary_writer_validation = tf.summary.FileWriter(FLAGS.train_dir + '_validate', sess.graph)
        summary_writer_test = tf.summary.FileWriter(FLAGS.train_dir + '_test', sess.graph)

        sess.run(tf.global_variables_initializer())

        # Training and validation
        for step in range(FLAGS.max_steps):
            # Training: Backpropagation using train set
            (trainImages, trainLabels) = cifar.getTrainBatch()
            (testImages, testLabels) = cifar.getTestBatch()

            _, summary_str = sess.run([train_step, training_summary], feed_dict={x: trainImages, y_: trainLabels})

            if step % (FLAGS.log_frequency + 1)== 0:
                summary_writer.add_summary(summary_str, step)

            # Validation: Monitoring accuracy using validation set
            if step % FLAGS.log_frequency == 0:
                validation_accuracy, summary_str = sess.run([accuracy, validation_summary], feed_dict={x: testImages, y_: testLabels})
                print('step %d, accuracy on validation batch: %g' % (step, validation_accuracy))
                summary_writer_validation.add_summary(summary_str, step)

            # Save the model checkpoint periodically.
            if step % FLAGS.save_model == 0 or (step + 1) == FLAGS.max_steps:
                checkpoint_path = os.path.join(FLAGS.train_dir + '_train', 'model.ckpt')
                saver.save(sess, checkpoint_path, global_step=step)

    # Testing

    # resetting the internal batch indexes
    cifar.reset()
    evaluatedImages = 0
    test_accuracy = 0
    nRuns = 0

    while evaluatedImages != cifar.nTestSamples:
        # don't loop back when we reach the end of the test set
        (testImages, testLabels) = cifar.getTestBatch(allowSmallerBatches=True)
        test_accuracy_temp, _ = sess.run([accuracy, test_summary], feed_dict={x: testImages, y_: testLabels})

        nRuns = nRuns + 1
        test_accuracy = test_accuracy + test_accuracy_temp
        evaluatedImages = evaluatedImages + testLabels.shape[0]

    test_accuracy = test_accuracy / nRuns
    print('test set: accuracy on test set: %0.3f' % test_accuracy)


if __name__ == '__main__':
    tf.app.run(main=main)


## 2 Blue Crystal Phase 4

BlueCrystal Phase 4 (BC4) is the latest update on the University's High Performance Computing (HPC) machine. It includes 32 GPU-accelerated nodes, each of them with two NVIDIA Tesla P100 GPU accelerators and also a visualization node equipped with NVIDIA GRID GPUs; what matters to us are the Tesla P100 GPU accelerators that we will use for training your Deep Learing algorithms. 

Further information on BC4 and the support we have for it are available at: https://www.acrc.bris.ac.uk/acrc/phase4.htm

**NOTE**: You may try to debug and run programs on your own machine, but sadly we are unable to offer assistance for installing and/or set-up of the dependencies on personal machines. 


There are two *modes* for using BC4: *Interactive* and *Batch*. We will use *Interactive* as much as possible during lab sessions, since it allows the immediate excution of your program and you can see outputs directy on the terminal window (great for debugging); while the *Batch* method queues your job and generates files related with the excecution of your file. You will use *Batch* as part of CW2, so we will revisit that later.

## 2.1   Copying Lab-1 files between your machine and BC4

You need to copy the provided folders `CIFAR10` and `Lab_1_intro` (which contains `simple_train_cifar.py`, `submit_job.sh`, `tensorboard_params.sh`, `go_interactive.sh`) to your account in BC4. For copying individual files from your machine to your home directory on BC4 use the next example with go_interactive.sh:

For copying individual files from your machine to your home directory on BC4 use the next example with `go_interactive.sh`:

```bash
scp  /path_to_files/Lab_1_intro/go_interactive.sh <your_UoB_ID>@bc4login.acrc.bris.ac.uk:
```

or all files at once by using: 

```bash
scp -r /path_to_files/*  <your_UoB_ID>@bc4login.acrc.bris.ac.uk:
```

For copying back files from BC4 to your machine use the  command ```scp``` from a terminal on your machine, you can copy individual files, as well as directories:

```bash
scp  <your_UoB_ID>@bc4login.acrc.bris.ac.uk:/path_on_bc4/foo.foo   /path_in_your_machine/
```
 
 Alternatively, you may wish to use SSHFS to mount a directory on BC4 to a directory using:
 
```bash
mkdir -p ~/bc4 && sshfs <your_UOB_ID>@bc4login.acrc.bris.ac.uk:/dir_on_bc4/ ~/bc4
```
 

## 2.2  Logging in, running scripts  and managing your directory

Replicate the next steps for logging-in BC4. 

### **Logging in** 

The connection to BC4 is done via SSH, thus open a **new** Terminal window and type: 

```bash
ssh <your_UoB_ID>@bc4login.acrc.bris.ac.uk
```

You should see something like this in your home directory:
 
 ```
 CIFAR10
 |----------cifar10.py
 Lab_1_intro
 |----------simple_train_cifar.py 
 |----------submit_job.sh 
 |----------tensorboard_params.sh 
 |----------go_interactive.sh```
 
**NOTE: If you cannot see the file structure above, you have not copied the files correctly**

## 2.3 Training your first CNN.


**It's finally here, the moment you've been waiting for!** 

Follow the next steps for running the training script:

1. Using the blue crystal ssh login (2.2) change to the lab 1 directory:

    ```cd Lab_1_intro/```
    
2. make all .sh files executables by using the command `chmod`: 

   ```chmod +x go_interactive.sh submit_job.sh tensorboard_params.sh```
   
3. Switch to interactive mode, and note the change of the gpu login to a reserved gpu:

    ```./go_interactive.sh ```
    
4. Run the following script. It will pop up two values: `ipnport=XXXXX` and `ipnip=XX.XXX.X.X.`

    ```./tensorboard_params.sh```
    
    **Write them down since we will use them for using TensorBoard.**

5. Train the model using the command:
    
    ```bash
    python simple_train_cifar.py &
    tensorboard --logdir=logs/ --port=<ipnport>
    ```
   
   where `ipnport` comes from the previous step. It might take a minute or two before you start seeing the accuracy on the validation batch at every step

## 2.4 Visualising and Monitoring Your Training


1. Open a **new Terminal window** on your machine and type: 
    
    ``` ssh  <USER_NAME>@bc4login.acrc.bris.ac.uk -L 6006:<ipnip>:<ipnport>```</mark> 
    
    where `ipnip` and `ipnport` comes from step 2 in **2.3**.

2. Open your web browser (Use Chrome; Firefox currently has issues with tensorboard) and open the port 6006 (http://localhost:6006). This should open TensorBoard, and you can navigate through the summaries that we included.

3. Go to the **GRAPH** tab and navigate to your CNN, identifying the two convolutional layers `CONV_1` `CONV_2`, and their hyperparameters as well as the two fully-connected layers `FC_1` and `FC_2`

4. Go to the **SCALARS** tab 

5. Tick the box *Show data download links*

6. Click on **Loss**

7. Download the csv file `run_exp_bs_128_lr_0.0001_train,tag_Loss.csv`

** Keep the csv file for your portfolio of Lab_1 **

By using TensorBoard you can monitor the training process. In the following labs you will perform experiments by varying hyperparemeters such as learning rate, batch size, epochs, etc.

## 2.6 Saving your trained model

You should copy your log files back from BC4, and save them for your first lab portfolio

```bash
scp -r <your_UoB_ID>@bc4login.acrc.bris.ac.uk:/Lab_1_intro/logs   /path_in_your_machine/

```

Both your directory `logs/` and your `csv` file should be submitted as part of your Lab_1 portfolio (see [**section 5**](#5.-Preparing-Lab_1-Portfolio)).

# 3. Training your Second CNN

It is now time to train your own modification to the CNN above.

Choose one hyperparameter to change in your built CNN. You might add/remove layers, change layer sizes, ...

** Discuss your choice with a TA **

1. Duplicate and rename your file ```simple_train_cifar.py``` into ```second_train_cifar.py```
2. Change the code to reflect **your chosen hyperparameter change**
3. Copy ```second_train_cifar.py``` to BC4 as you've done previously
4. Train the model using the following command. 
   ```bash
   python second_train_cifar.py &
   tensorboard --logdir=logs_second/ --port=<ipnport>
   ```
   **NOTE: the change in the logs directory**

5. Load your tensorboard to read from the *new logs* directory
6. Save *logs_second* folder and the new *csv* back to your machine

# 4. Closing all sessions

Once the training has finished, **close all sessions** by typing `exit`. You need to do this twice for an **interactive session.** 

**Please make sure closing your session in order to release the gpu node.**

# 5. Preparing Lab_1 Portfolio

You should by now have the following files, which you can zip under the name `Lab_1_<username>.zip` 

**From your logs, include only the TensorBoard summaries and remove the checkpoints (model.ckpt-* files)**

 ```
 Lab_1_<username>.zip
 |----------logs\ 
            |----------exp_bs_128_lr_0.0001_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------exp_bs_128_lr_0.0001_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------run_exp_bs_128_lr_0.0001_train,tag_Loss.csv
 |----------logs_second\
            |----------<your second training>_train
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
            |----------<your second training>_validate
                       |----------events.out.tfevents.xxxxxxxxxx.gpuxx.bc4.acrc.priv
 |----------<your second training.csv>
 ```
 
 Store this zip safely. You will be asked to upload all your labs' portfolio to ** SAFE after Week 10 ** - check SAFE for deadline details.

# NOTE: Using the batch method 

During the course it may occur that very few GPUs are available due maintenance, other people using them, unforseen events, etc. making it not possible to use the *Interactive Session* previously described. Should this happen you will have to use the *batch method*. To do this establish a connection to BC4 as described before. Open the file "submit_job.sh" using emacs, vim or your favourite CLI text editor, and **modify line #10** to include your email (Blue Crystal will send you notifications about the jobs you're submitting) and **modify line #14** for the filename you are running. Now, run the next command for submitting a job to the BC4 queueing system:

```sbatch submit_job.sh```

And you should see a generated files (`hostname_<job_number>.err` and `hostname_<job_number>.out`) that shows the outputs after running your python script.
In the `hostname_<job_number>.out` you should see output that you normally see on the terminal window when using the interactive mode,  `hostname_<job_number>.err` shows the output that occurred during the runing of script that exited with a non-zero exit status.