# Notebook for Conversion from JPEGs to TFRecords with feature and labels. Notebook (1/6) in the End-to-End Scalable Deep Learning Pipeline on Hops.

TinyImageNet dataset contains a training set of 100,000 images, a validation set of 10,000 images, and a test
set of also 10,000 images. These images are sourced from 200 different classes of objects. The images are downscaled from the original ImageNet’s dataset size of 256x256 to 64x64. 

The two initial tasks before we can train a model on this dataset are:

1. Group the images with their labels 
2. Convert the JPEGs into TFRecords

When using large datasets, like ImageNet, dealing with JPEGs is not very efficient, nor compatible with all of the functionality in the Tensorflow framework. 

TFRecords is a binary format for representing features and labels in Tensoflow, using a binary format for a large dataset can have a huge impact on disk space and processing speed. In addition, TFRecords are easier to work with than working with the raw JPEGs.

![step1.png](./../images/step2.png)

This notebook read JPEGs and labels from:

- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train
- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/val
- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/test

and dataset metadata from:

- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/words.txt
- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/val/val_annotations.txt

The notebook output TFRecords to:

- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw

and dataset statistics (number of records in each dataset (train/val/test)) to:

- hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/sizes.txt

## Package Imports

Tested with versions:

- numpy: 1.14.5
- hops: 2.6.4
- pydoop: 2.0a3
- tensorboard: 1.8.0
- tensorflow: 1.8.0
- tensorflow-gpu: 1.8.0
- tfspark: 1.3.5

In [1]:
import tensorflow as tf
import pydoop.hdfs as py_hdfs
from hops import hdfs
import numpy as np

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
853,application_1536227070932_0179,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.
  return f(*args, **kwds)

## Constants

In [2]:
PROJECT_DIR = hdfs.project_path()
DATASET_BASE_DIR = PROJECT_DIR + "tiny-imagenet/tiny-imagenet-200/"
TRAIN_DIR = DATASET_BASE_DIR + "train"
TEST_DIR = DATASET_BASE_DIR + "test"
VAL_DIR = DATASET_BASE_DIR + "val/images/"
ID_TO_CLASS_FILE = DATASET_BASE_DIR + "/words.txt"
OUTPUT_DIR = DATASET_BASE_DIR + "tfrecords_raw/"
VAL_LABELS_FILE = DATASET_BASE_DIR + "val/val_annotations.txt"
FILE_PATTERN = "*.JPEG"
SIZES_FILE = DATASET_BASE_DIR + "sizes.txt"

## Parse Metadata about the Dataset

The dataset has some .txt files with annotation and other metadata that needs to be parsed.

In [3]:
def parse_metadata():
    """ 
    Parses the words.txt file into a map of label -> words and a list of ordered nids (index of nid = integer label).
    Also parses the val_annotations.txt file into a map of (validation_file_name --> nid)
    """
    # list all directories in the train set, the directory name is the "nid" and identifies the label
    train_dirs = py_hdfs.ls(TRAIN_DIR)
    
    # remove the path except the nid
    train_nid_list = list(map(lambda x: x.replace(TRAIN_DIR + "/", ""), train_dirs))
    
    # the number of nids equal then number of unique classes/labels
    num_classes = len(train_nid_list)
    
    # read the words.txt file that contains lines of the form "nid\twords"
    with py_hdfs.open(ID_TO_CLASS_FILE, 'r') as f:
        file_lines = f.read().decode("utf-8").split("\n")
    label_to_word = {}
    
    for l in file_lines:
        # parse each line
        wnid, word = l.split('\t')
        if wnid in train_nid_list:
            # convert the nids into integer labels by using the position in the index
            label = train_nid_list.index(wnid)
            word = str(label) + ": " + word
            # save the mapping of integer label --> words
            label_to_word[label] = word
    
    # read the val_annotations.txt file that contains lines of the form: 
    # "validation_image\tnid\tx_pos\ty_pos\tw_pos\th_pos"
    with py_hdfs.open(VAL_LABELS_FILE, 'r') as f:
        file_lines = f.read().decode("utf-8").split("\n")
    validation_file_to_nid = {}
    for l in file_lines:
        # parse each line
        tokens = l.split('\t')
        #skip corrupted lines
        if len(tokens) > 2:
            validation_img = tokens[0]
            wnid = tokens[1]
            # we only care about classification in this tutorial, not localization 
            if wnid in train_nid_list:
                validation_file_to_nid[validation_img] = wnid
    
    return train_nid_list, label_to_word, validation_file_to_nid

## Data Exploration

Lets look at a few sample images from the dataset:
```python
import matplotlib.pyplot as plt
%matplotlib inline
sample_images = ["hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01443537/images/n01443537_327.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01443537/images/n01443537_328.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01698640/images/n01698640_381.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01882714/images/n01882714_300.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01770393/images/n01770393_101.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01774750/images/n01774750_24.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n01784675/images/n01784675_101.JPEG","hdfs:///Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/train/n09332890/images/n09332890_300.JPEG"]
img_op_0 = tf.image.decode_jpeg(tf.read_file(sample_images[0]))
img_op_1 = tf.image.decode_jpeg(tf.read_file(sample_images[1]))
img_op_2 = tf.image.decode_jpeg(tf.read_file(sample_images[2]))
img_op_3 = tf.image.decode_jpeg(tf.read_file(sample_images[3]))
img_op_4 = tf.image.decode_jpeg(tf.read_file(sample_images[4]))
img_op_5 = tf.image.decode_jpeg(tf.read_file(sample_images[5]))
img_op_6 = tf.image.decode_jpeg(tf.read_file(sample_images[6]))
img_op_7 = tf.image.decode_jpeg(tf.read_file(sample_images[7]))
sample_images_parsed = []
with tf.Session() as sess:
    sample_images_parsed.append(img_op_0.eval())
    sample_images_parsed.append(img_op_1.eval())
    sample_images_parsed.append(img_op_2.eval())
    sample_images_parsed.append(img_op_3.eval())
    sample_images_parsed.append(img_op_4.eval())
    sample_images_parsed.append(img_op_5.eval())
    sample_images_parsed.append(img_op_6.eval())
    sample_images_parsed.append(img_op_7.eval())

plt.rcParams["figure.figsize"] = (14,10)
count = 0
for img in sample_images_parsed:
    count += 1
    plt.subplot(4,4,count)
    plt.imshow(img)
    plt.axis("off")
plt.savefig("sample_images.png")
plt.show()
```
![sample_images.png](./../images/sample_images.png)

In [227]:
number_of_examples_per_class = list(map(lambda d: len(py_hdfs.ls(d + "/images/")), py_hdfs.ls(TRAIN_DIR)))

Lets look at the class distribution in the train set
```python
plt.hist(sizes, bins='auto')
plt.xlabel('Number of examples)')
plt.ylabel('Number of classes')
plt.title('Class distribution histogram')
plt.savefig("class_distribution.png")
plt.show()
```
![class_distribution.png](./../images/class_distribution.png)

As we can see, the examples are uniformly distributed over the classes. In the training dataset there are 200 classes with 500 examples in each (total 100 000 images in the training dataset)

## Build a Tensorflow Computational Graph for the Computation of Reading and Parsing Files and Labels

The graph uses Queues to read concurrently using all available threads on the machine. The graph contains of very few operations, where the main operations are `img_reader.read` and  `tf.image.decode_jpeg`

In [228]:
def init_graph():
    """ 
    Initialize the graph and variables for Tensorflow engine 
    """
    # get operation for initializing the global variables in the graph
    init_g = tf.global_variables_initializer()
    
    # create a session for encapsulating the environment where 
    # operations can be run and tensors can be evaluated
    sess = tf.Session()
    
    # run the initialization operation
    sess.run(init_g)
    return sess

In [229]:
def build_graph_for_processing_dir(files_dir, file_pattern, base_dir, nid_list, test_dir=False, val_dir=False):
    """
    Builds the computational graph for parsing images and labels
    from a directory in HopsFS.
    The images are parsed into tensors and their corresponding label.
    """
    # Convert regular expression file pattern into a list of files (images)
    img_filenames = tf.gfile.Glob(files_dir + file_pattern)
    
    # Create a queue of the filenames (images)
    img_queue = tf.train.string_input_producer(img_filenames)
    
    # Get the number of images to parse
    num_images = len(img_filenames)
    
    # Setup a reader for reading the queue of filenames
    # This reader will read the entire contents of a file as a value and returns (filename, filecontents)
    img_reader = tf.WholeFileReader()
    
    # Operation for reading a single file from the queue
    file_name_op, file_contents_op = img_reader.read(img_queue)
    
    if test_dir or val_dir:
        # We don't have labels for testset, for simplicity just set it to -1 and save in the same structure as
        # val/train records
        # Similarly, for validation directory we do not have a single label for a directory so we have to infer
        # it per file instead, it will be done dynamically when the graph is run so just set the label to -1 here.
        label = -1
    else:
        # Infer the label using the nid_list and the directory name
        label = nid_list.index(files_dir.replace(base_dir, "").replace("/images", "").replace("/", ""))
        
    # Operation for decoding JPEG to tensor
    img_to_tensor_op = tf.image.decode_jpeg(file_contents_op)
    
    return img_to_tensor_op, label, file_name_op, num_images

In [230]:
def run_graph_for_processing_dir(sess, img_op, label, file_name_op, num_images, files_dir, nid_list, validation_file_to_nid, val_dir=False):
    """
    Runs the computational graph for parsing images from HopsFS into tensors and their label
    """
    images = []
    labels = []
    # Size of the queue
    print("Reading {} files from directory {}".format(num_images, files_dir))
    #for i in range(num_images):
    for i in range(num_images):
        # these two must be run in the same call to sess.run() otherwise they become unsynced which messes up labels for validation set..
        img_tensor, file_name_str = sess.run([img_op, file_name_op])
        # For validation directory we have to infer labels per file rather than per directory
        # We do this using the metadata file val_annotations.txt and the filename
        if val_dir:
            label = nid_list.index(validation_file_to_nid[file_name_str.decode("utf-8").replace(VAL_DIR, "")])
        # 2% of the data is not colored... just skipping these for now
        # Can convert these to 3 channel grey scale but not sure if it is worth it
        if img_tensor.shape == (64,64,3):
            images.append(img_tensor)
            labels.append(label)
    return np.array(images), np.array(labels)

## Use Tensorflow Library to Save the Parsed Files and Labels into TFRecords

A TFRecords file contains a sequence of binary strings (multiple records). Since it is just bytes you can write anything to this file but for it to be easily read by Tensorflow later on, we should stick to Tensorflow's pre-defined binary formats.

Tensorflow have two predefined protocol buffer message types that you can use: Example and SequenceExample. SequenceExample is designed for data where you can have variable number of features, whereas Example is better if you always are storing the exact same data structure for each record.

An Example protocol buffer message is simply a map of string keys to values, where the values are list of integers, floats or bytes.

In our case, we are working with multi-class classification or images and we will use an Example of the format:

```json
{
        'height': _int64_feature(height),
        'width': _int64_feature(width),
        'channel': _int64_feature(channels),
        'label': _int64_feature(label),
        'label_one_hot': _bytes_feature(label_one_hot_raw),
        'image_raw': _bytes_feature(img_raw)
}
```
Each value in the map is an instance of a Tensorflow tf.train.Feature, which can be int, float, or byte. 

In [231]:
def _int64_feature(value):
    """
    Wrapper for inserting int64 features into Example proto.
    """
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

In [232]:
def _bytes_feature(value):
    """
    Wrapper for inserting bytes features into Example proto.
    """
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

In [233]:
def _onehot_encoder(num_labels):
    """ 
    Creates a matrix where
    matrix[i] gives the one_hot_encoded version of integer label i
    """
    onehot_lookup = []
    for i in range(0, num_labels):
        temp = [0] * num_labels
        temp[i] = 1
        onehot_lookup.append(temp)
    return onehot_lookup

In [234]:
def convert_to_tf_records(images, labels, filename, onehot_lookup):
    """ 
    Saves a list of images and a list of corresponding labels into TFRecords on HopsFS
    """
    print("Converting the parsed images to TFRecords and saving to file: {}".format(filename))
    num_examples = labels.shape[0]
    if images.shape[0] != num_examples:
        raise ValueError("Images size %d does not match label size %d." % (images.shape[0], num_examples))
    rows = images.shape[1]
    cols = images.shape[2]
    depth = images.shape[3]
    writer = tf.python_io.TFRecordWriter(filename)
    for index in range(num_examples):
        image_raw = images[index].tostring()
        example = tf.train.Example(features=tf.train.Features(
            feature={
            'height': _int64_feature(rows),
            'width': _int64_feature(cols),
            'channel': _int64_feature(depth),
            'label': _int64_feature(int(labels[index])),
            'label_one_hot': _bytes_feature(bytes(onehot_lookup[int(labels[index])])),
            'image_raw': _bytes_feature(image_raw)
        }))
        writer.write(example.SerializeToString())

## Put it All Together Into a Wrapper Function

The function is designed to process one directory at a time, so that we can paralellize the execution using Spark

In [235]:
def process_dir(files_dir, file_pattern, base_dir, output_category):
    """
    This function orchestrates the conversion from JPEGs to TFRecords of a single directory
    It initiates and builds the computational graph,
    runs the conversion and saves resulting TFRecords to HopsFS
    """
    # Parse dataset metadata and annotation files
    # Get list of orderd nids, a map of label-->words from the words.txt file,
    # and a map of validation_img --> nid from the val_annotations.txt file
    nid_list, label_to_word, validation_file_to_nid = parse_metadata()
    # Create a matrix for converting integers into one-hot-encoding
    onehot_lookup = _onehot_encoder(len(nid_list))
    # If we are processing a test directory there are no labels
    test_dir = output_category == "test/"
    # If we are processing a validation directory the labels are not in the directory names but from
    # the val_annotations.txt file
    val_dir = output_category == "val/"
    # Build graph and get necessary tensorflow operations
    img_dir_op, dir_label, file_name_op, img_dir_filenames = build_graph_for_processing_dir(files_dir, file_pattern, base_dir, nid_list, test_dir=test_dir, val_dir=val_dir)

    # Initialize TF
    sess = init_graph()

    # Get coordinator for threads to be able to read
    coord = tf.train.Coordinator()

    # Starts all queue runners in the graph and return list of the threads
    threads = tf.train.start_queue_runners(coord=coord, sess=sess)

    # Run the graph for each image to read it and convert to a tensor
    images, labels = run_graph_for_processing_dir(sess, img_dir_op, dir_label, file_name_op, img_dir_filenames, files_dir, nid_list, validation_file_to_nid, val_dir=val_dir)
    
    # Convert the images and labels to tf records saved on disk
    if test_dir:
        filename = OUTPUT_DIR + output_category + "test" + ".tfrecords"
    if val_dir:
        filename = OUTPUT_DIR + output_category + "val" + ".tfrecords"
    if not test_dir and not val_dir:
        filename = OUTPUT_DIR + output_category + str(labels[0]) + ".tfrecords"
    
    num_written_records = len(images)
    
    convert_to_tf_records(images, labels, filename, onehot_lookup)
    
    print("num_written_records: {}".format(num_written_records))
    
    # Cleanup this session
    sess.close()
    
    return (filename, output_category, num_written_records) 

## Parallel/Distributed Processing of the Dataset

Processing the files and converting the data into TFRecords is an embarrasingly parallel operation, we can utilize spark for doing this in a scalable manner. 

Simple benchmark shows that preprocessing the TinyImageNet dataset using 1 machine takes around 20 minutes, but with 8 machines it completed in 4.5 minutes. 

Using the same technique for processing the entire ImageNet will be crucial for performance.

In [236]:
# Get all training directories from HopsFS and save into tuple (output_sub_dir, directory path)
train_dirs = list(map(lambda x: ("train/", x + "/images/"), py_hdfs.ls(TRAIN_DIR)))

# Get the single validation directory from HopsFS and save into tuple (output_sub_dir, directory path)
validation_dir = [("val/", VAL_DIR)]

# Get the single test directory from HopsFS and save into tuple (output_sub_dir, directory path)
test_dir = [("test/", TEST_DIR + "/images/")]

# Concat all directories that should be processed and convert into an RDD
dir_rdd = spark.sparkContext.parallelize(train_dirs + validation_dir + test_dir)

# Process the directories in paralell using spark, writing out TFRecords to HopsFS
result = dir_rdd.map(lambda dir_tuple: process_dir(dir_tuple[1], FILE_PATTERN, TRAIN_DIR, dir_tuple[0])).collect()

# Collect results and write some statistics to HopsFS and to the log
files_written = list(map(lambda x: x[0], result))
train_records_count = sum(list(map(lambda x: int(x[2]), filter(lambda y: y[1] == "train/", result))))
val_records_count = sum(list(map(lambda x: int(x[2]), filter(lambda y: y[1] == "val/", result))))
test_records_count = sum(list(map(lambda x: int(x[2]), filter(lambda y: y[1] == "test/", result))))
print("The following files were written to HopsFS ({}): {}".format(len(files_written), ",".join(files_written)))
print("Number of train records total: {}".format(train_records_count))
print("Number of val records total: {}".format(val_records_count))
print("Number of test records total: {}".format(test_records_count))
py_hdfs.dump("train,{}\nval,{}\ntest,{}".format(train_records_count,val_records_count,test_records_count), SIZES_FILE)

The following files were written to HopsFS (202): hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/0.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/1.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/2.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/3.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/4.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/5.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipeline/tiny-imagenet/tiny-imagenet-200/tfrecords_raw/train/6.tfrecords,hdfs://10.0.104.196:8020/Projects/ImageNet_EndToEnd_MLPipelin