[TODO]
 - Create an example on how to achieve maximum local performance with caching. It can be a very simple function.
 - Improve the tests with Cache and Local Storage Cache.
 - It isn't clear why the amount of small files in Cloud Storage is the problem here.
 - It would be good to motivate why we want to bundle images into TFRecord files.
 - Explain why use Interleave before and after
 - Why is it ok to skip resizing now? This makes it sound like we might be comparing apples and oranges
 - it looks like the cache flag is never used


# Building High Performance Data Pipelines with tf.Data and Google Cloud Storage

This article goes through the steps of building a high performance data input pipeline using Tensorflow and Google Cloud Storage.
The concepts and techniques are evolved at each step, going from the slower to the fastest solution.

This article uses the Stanford Dogs Dataset [1] with ~20000 images and 120 classes.

[1] https://www.kaggle.com/jessicali9530/stanford-dogs-dataset

## Benchmark function

The benchmark will measure the number of images ingested (read) per second from Cloud Storage to the host virtual machine. There are several ways to implement this calculation, but a simple function was used to iterate through the dataset and measure the time.

The following python function ('timeit' function) from Tensorflow documentation [1] (as of 03/18/2020 - version 2.1) is used. Since tf.data.Dataset implements \__iter__, it is possible to iterate on this data to observe the progression.

[1] https://www.tensorflow.org/tutorials/load_data/images#performance

In [1]:
# First let's import Tensorflow
import tensorflow as tf

In [2]:
# Now import some additional libraries
from numpy import zeros
import numpy as np
from datetime import datetime

In [4]:
# Benchmark function for dataset
import time
default_timeit_steps = 1000

# Iterate through each element of a dataset. An element is a pair 
# of image and label.
def timeit(ds: tf.data.TFRecordDataset, steps: int = default_timeit_steps, 
           batch_size: int = BATCH_SIZE) -> None:
    
    start = time.time()
    it = iter(ds)
    
    for i in range(steps):
        batch = next(it)
        
        if i%10 == 0:
            print('.',end='')
    print()
    end = time.time()
    
    duration = end-start
    print("{} batches: {} s".format(steps, duration))
    print("{:0.5f} Images/s".format(BATCH_SIZE*steps/duration))

## Let's create the Dataset using tf.data - Reading images individually

All the images are located in a bucket in Google Cloud Storage (example: gs://cloud_bucket/label/image.jpeg).\
Labels are extracted (parsed) from the image name.

In this first step, the dataset is created from the images file paths (gs://...), and labels are extracted and one-hot encoded.

This dataset maps each image in the bucket individually.

In [18]:
# Global variables

# Paths where images are located
FILENAMES = 'gs://tf-data-pipeline/*/*.jpg'

# Paths where labels can be parsed
FOLDERS = 'gs://tf-data-pipeline/*'

# Image resolution and shape
RESOLUTION = (224,224)
IMG_SHAPE=(224,224,3)

# Batch size and tf.data AUTOTUNE
BATCH_SIZE = 1
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [19]:
# Get labels from folder's name and create a map to an ID
def get_label_map(path: str) -> (dict, dict):
    #list folders in this path
    folders_name = tf.io.gfile.glob(path)

    labels = []
    for folder in folders_name:
        labels.append(folder.split(sep='/')[-1])

    # Generate a Label Map and Interted Label Map
    label_map = {labels[i]:i for i in range(len(labels))}
    inv_label_map = {i:labels[i] for i in range(len(labels))}
    
    return label_map, inv_label_map

In [20]:
# One hot encode the image's labels
def one_hot_encode(label_map: dict, filepath: list) -> dict:
    labels = dict()
    
    for i in range(len(filepath)):
        encoding = zeros(len(label_map), dtype='uint8')
        encoding[label_map[filepath[i].split(sep='/')[-2]]] = 1
        
        labels.update({filepath[i]:list(encoding)})
    
    return labels

In [21]:
label_map, inv_label_map = get_label_map(FOLDERS)

In [22]:
list(label_map.items())[:5]

[('n02085620-Chihuahua', 0),
 ('n02085782-Japanese_spaniel', 1),
 ('n02085936-Maltese_dog', 2),
 ('n02086079-Pekinese', 3),
 ('n02086240-Shih-Tzu', 4)]

In [23]:
# List all files in bucket
filepath = tf.io.gfile.glob(FILENAMES)
NUM_TOTAL_IMAGES = len(filepath)

In [24]:
# Split the features (image path) from labels
dataset = one_hot_encode(label_map, filepath)
dataset = [[k,v] for k,v in dataset.items()]

features = [i[0] for i in dataset]
labels = [i[1] for i in dataset]

In [25]:
# Create Dataset from Features and Labels
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In [26]:
# Example of one element of the dataset
# At this point we have a dataset containing the path and labels of an image
print(next(iter(dataset)))

(<tf.Tensor: shape=(), dtype=string, numpy=b'gs://tf-data-pipeline/n02085620-Chihuahua/n02085620_10074.jpg'>, <tf.Tensor: shape=(120,), dtype=int32, numpy=
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>)


Next we define some preprocessing functions to:
 - Read the data from Cloud Storage
 - Decode JPEG
 - Convert image to a range between 0 and 1, as float
 - Resize image

In [27]:
# Download image bytes from Cloud Storage
def get_bytes_label(filepath, label):
    raw_bytes = tf.io.read_file(filepath)
    return raw_bytes, label

In [28]:
# Preprocess Image
def process_image(raw_bytes, label):
    image = tf.io.decode_jpeg(raw_bytes, channels=3)
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    image = tf.image.resize(image, (224,224))
    
    return image, label

#### Building the dataset

From the Dataset already built with image paths and labels, the preprocessing functions are applied to download the bytes from Cloud Storage and apply some transformations to the data. \
All of these steps are only performed when the dataset is iterated.

At this point, all the steps are executed while streaming the data, including:
 - IO intensive operations like download de images (get_bytes_label)
 - CPU intensive operations like decode and resize the image (process_image)

Since there are thousands of images, this process can take longer.

Some observations:
 - "num_parallel_calls = tf.data.experimental.AUTOTUNE" was used to let tensorflow runtime decide the best parametrization for its functions.\
 - "dataset.cache" was implemented, but as we are reading a large amount of data, this may not fit into memory and become impossible to use.
 - "dataset.prefetch" allows buffering of elements in order to increase performance.

In [29]:
# Map transformations for each element inside the dataset
# Maps are separated as IO Intensive and CPU Intensive
def build_dataset(dataset, batch_size=BATCH_SIZE, cache=False):
    
    if cache:
        if isinstance(cache, str):
            dataset = dataset.cache(cache)
        else:
            dataset = dataset.cache()
    
    dataset = dataset.shuffle(NUM_TOTAL_IMAGES)
    
    # Extraction: IO Intensive
    dataset = dataset.map(get_bytes_label, num_parallel_calls=AUTOTUNE)

    # Transformation: CPU Intensive
    dataset = dataset.map(process_image, num_parallel_calls=AUTOTUNE)
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size=batch_size)
    
    # Pipeline next iteration
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    
    return dataset

In [30]:
train_ds = build_dataset(dataset)

## First Attempt: No cache, no tricks

In this first benchmark no caching mecanism is used and the images are read one by one from the bucket.

The biggest problem here is to read 1000's of files one by one. This can really slow down the process.\
Let's call our "timeit" function to measure the time needed for the load. 

In [None]:
# Iterate through this dataset for 100 elements.
timeit(train_ds, batch_size=1, steps=1000)

## Ok, let's put some local cache in action

tf.data.Dataset implements a cache function. 

If no parameter is passad to the cache, it uses the memory of the host to cache the data. \
The problem is if your dataset is bigger than your host memory and you can't cache the epoch in memory. In this case the cache won't help and we still have a bottleneck.\
It is also possible to cache the images in a local storage for reuse in future epochs. But in this case we are measuring the throughput from Cloud Storage and want to achieve the best performance possible.

First let's test the throughput using cache in memory and than in as a local file.

In [33]:
# Memory
train_cache_ds = build_dataset(dataset, cache=True)
timeit(train_cache_ds, batch_size=1, steps=50000)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [None]:
# Local Cache File
train_local_cache_ds = build_dataset(dataset, cache='./dog.tfcache')
timeit(train_local_cache_ds)

### Hum ...

Ok, but no performance improvement?

In this case, even using memory and local cache, the host VM is not able to fetch more data, mainly because of the amount of small files in Cloud Storage.

To solve this problem we can follow some best practices for designing a performant TensorFlow data input pipeline (from the Tensorflow documentation [1]):

 - Use the prefetch transformation to overlap the work of a producer and consumer.
 - Parallelize the data reading transformation using the interleave transformation.
 - Parallelize the map transformation by setting the num_parallel_calls argument.
 - Use the cache transformation to cache data in memory during the first epoch
 - Vectorize user-defined functions passed in to the map transformation
 - Reduce memory usage when applying the interleave, prefetch, and shuffle transformations.
 
But before we continue, let's do some tracing to understand what is going on.

[1] https://www.tensorflow.org/guide/data_performance


In [None]:
tf.summary.trace_off()
tf.summary.trace_on(graph=False, profiler=True)

train_ds = build_dataset(dataset)
timeit(train_ds, steps=1000)

tf.summary.trace_export('Data Pipeline', profiler_outdir='/home/jupyter/tensorflow-data-pipeline/logs/')

In [None]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

In [None]:
%tensorboard --logdir=/home/jupyter/tensorflow-data-pipeline/logs

<table style="width:100%">
  <tr>
    <th>High Level View</th>
    <th>Zoom View</th> 
  </tr>
  <tr>
    <td><img src="https://storage.cloud.google.com/renatoleite-nb/images/trace1.png"></td>
    <td><img src="https://storage.cloud.google.com/renatoleite-nb/images/trace2.png"></td>
  </tr>
</table>

Two threads were created to read the files in parallel. \
The next step is to bundle together all the images in a TFRecord file, so let's do it.

## Using TF.Record for speedup de reading process

Up to now the images were read one by one, which proved to be a very inefficient process. \
To mitigate this problem, one solution is to preprocess and write the images and labels to TFRecord files.

In the following steps, the images are preprocessed and written to TFRecords.
The following steps are followed:
 - Read the data from Cloud Storage
 - Decode the JPEG and resize the image
 - Encode the JPEG
 - Serialize the images into Bytes (tf.train.BytesList) and Labels into Ints (tf.train.Int64List)
 - Create a tf.Example with this two components and return a serialized string.

In [19]:
# Function to download bytes from Cloud Storage
def get_bytes_label_tfrecord(filepath, label):
    raw_bytes = tf.io.read_file(filepath)
    return raw_bytes, label

In [20]:
# Preprocess Image
def process_image_tfrecord(raw_bytes, label):
    image = tf.io.decode_jpeg(raw_bytes, channels=3)
    image = tf.image.resize(image, (224,224), method='nearest')
    image = tf.io.encode_jpeg(image, optimize_size=True)
    
    return image, label

In [21]:
# Read images, preprocess and return a dataset
def build_dataset_tfrecord(dataset):
    
    dataset = dataset.map(get_bytes_label_tfrecord, num_parallel_calls=AUTOTUNE)
    dataset = dataset.map(process_image_tfrecord, num_parallel_calls=AUTOTUNE)
    
    return dataset

In [22]:
def tf_serialize_example(image, label):
    
    def _bytes_feature(value):
        """Returns a bytes_list from a string / byte."""
        if isinstance(value, type(tf.constant(0))):
            value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

    def _float_feature(value):
        """Returns a float_list from a float / double."""
        return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

    def _int64_feature(value):
        """Returns an int64_list from a bool / enum / int / uint."""
        return tf.train.Feature(int64_list=tf.train.Int64List(value=value))    
    
    def serialize_example(image, label):
        
        feature = {
            'image': _bytes_feature(image),
            'label': _int64_feature(label)
        }

        example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
        
        return example_proto.SerializeToString()
    
    tf_string = serialize_example(image, label)

    return tf_string

The following function shards the dataset into batches of images and labels. \
For each shard, the images and labels are serialized and written to the TFRecord file.

The TFRecordWriter allows the compression of files to some formats. One chosen here is GZIP.

In [23]:
# Create TFRecord with `n_shards` shards
def create_tfrecord(ds, n_shards):

    for i in range(n_shards):
        batch = map(lambda x: tf_serialize_example(x[0],x[1]), ds.shard(n_shards, i)
                    .apply(build_dataset_tfrecord)
                    .as_numpy_iterator())
        
        with tf.io.TFRecordWriter('output_file-part-{i}.tfrecord'.format(i=i), 'GZIP') as writer:
            print('Creating TFRecord ... output_file-part-{i}.tfrecord'.format(i=i))
            for a in batch:
                writer.write(a)

In [None]:
create_tfrecord(dataset, 4)

# Consuming the TFRecord and Re-Running the Benchmark

The TFRecords are saved in the local filesystem. To continue our benchmark, it is necessary to copy the files to a bucket in Cloud Storage.\
The files were copied to the following path:

In [24]:
TFRECORDS = 'gs://renatoleite-nb/tfrecords/*'

To read the Serialized data inside each TFRecord, it is necessary to pass a description of the features (image and label) previously encoded as a tf.feature. \
To do so, create a dictionary describing each component we will read.

In [25]:
# Create a description of the features.
feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenSequenceFeature([], tf.int64, allow_missing=True)
}

Then a function can parse an example from the TFRecord, using the description created before.

In [26]:
@tf.function
def _parse_function(example_proto):
    # Parse the input `tf.Example` proto using the dictionary above.
    return tf.io.parse_single_example(example_proto, feature_description)

First all the files were listed inside the specified bucketand created a dataset using ".from_tensorf_slices", but it would be possible to create a TFRecordDataset directly from this listing.\
The reason this was done is because the dataset with listing is used later.

In [34]:
# List all the TFRecords and create a dataset from it
filenames = tf.io.gfile.glob(TFRECORDS)
filenames_dataset = tf.data.Dataset.from_tensor_slices(filenames)

NameError: name 'TFRECORDS' is not defined

In [28]:
# Preprocess Image
@tf.function
def process_image_tfrecord(record):  
    image = tf.io.decode_jpeg(record['image'], channels=3)
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    
    label = record['label']
    
    return image, label

The TFRecordDataset has a flag "num_parallel_reads" to parallelize the number of reads by the runtime. \
This flag is set to AUTOTUNE to let Tensorflow decide how many threads are necessary to optimize the process.

In [29]:
# Create a Dataset composed of TFRecords (paths to bucket)
@tf.function
def get_tfrecord(filename):
    return tf.data.TFRecordDataset(filename, compression_type='GZIP', num_parallel_reads=AUTOTUNE)

The new function to build the dataset has the following changes:
 - Use of "interleave" to parallelize the opening of files.
 - Use of the "\_parse_function" to extract and deserialize the image and label.
 - Lighter version of preprocess, without resizing it.

In [38]:
def build_dataset_test(dataset, batch_size=BATCH_SIZE, cache=False):
    
    if cache:
        if isinstance(cache, str):
            dataset = dataset.cache(cache)
        else:
            dataset = dataset.cache()
    
    dataset = dataset.interleave(get_tfrecord, num_parallel_calls=AUTOTUNE)
    
    # Transformation: IO Intensive 
    dataset = dataset.map(_parse_function, num_parallel_calls=AUTOTUNE)

    # Transformation: CPU Intensive
    dataset = dataset.map(process_image_tfrecord, num_parallel_calls=AUTOTUNE)
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size=batch_size)
    
    # Pipeline next iteration
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    
    return dataset

In [39]:
test_ds = build_dataset_test(filenames_dataset)

In [40]:
timeit(test_ds, steps=50000)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

This new benchmark gives us around 1700 images per second, a much better version of the original pipeline developed.\
To speedup the training processo and utilized better your resources like GPUs and TPUs, it is critical to build a very efficient data pipeline. otherwise this can quickly become a bottleneck in you training loop.