In tensorflow, we need to build a fully input pipline to fully utilize multi-cores or multi-nodes' execution abilities. There're native parallelism existed within tensorflow. The implementation is made up of 3 stages:
* I/O reads: Choose and read image files from disk;
* Image processing: Decode image records into images, preprocess, and organized into mini-batches;
* CPU-to-GPU Data Transfer: Transfer images from CPU to GPU.

The dominant part of each stage is executed in parallel with the other stages using data_flow_ops.StagingArea.
StagingArea is a queue-like operator similar to tf.FIFOQueue. The difference is that StagingArea offers simpler functionality and can be executed on both CPU and GPU in parallel with other stages.

Breaking the input pipeline into 3 stages that operate independently in parallel is scallable and takes full advantage of large multi-core environments.

h2. Parallelize I/O Reads

dataflow_ops.RecordInput is used to parallelize reading from disk. Given a list of input files representing TFRecords, RecordInput continuously reads records using background threads. The records are placed into its own large internal pool and when it has loaded at least half of its capacity, it produces output tensors.

This op has its own internal threads that are dominated by I/O time that consumes minimal CPU, which allows it to run smoothly in parallel with the rest of the model.


In [1]:
import tensorflow as tf
from tensorflow.python.ops import data_flow_ops

record_input = data_flow_ops.RecordInput(
        file_pattern = dataset.tf_record_pattern(subset),
        seed = 301,
        parallelism = 64,
        buffer_size = 10000,
        batch_size = 256,
        name = 'record_input')
records = record_input.get_yield_op()
records = tf.split(records, self.batch_size, 0)
records = [tf.reshape(record, []) or record in records]

We can use a wrapper class 'RecordInput' provided by data_flow_ops to input data parallelly. And its interface can be seen like below:
class RecordInput(object):
    '''RecordInput asynchronously reads and randomly yields TFRecords.
    A RecordInput Op will continuously read a batch of records asynchronously into a buffer of some fixed capacity.
    It can also asynchronously yield random records from this buffer.
    It will not start yielding until at least 'buffer_size / 2' elements have been placed into the buffer so that
    sufficient randomization can take place.
    
    The order the files are read will be shifted each epoch by 'shift_amount' so that the data is presented in a 
    different order every epoch.'''
    def __init__(self,
                file_pattern,
                batch_size = 1,
                buffer_size = 1,
                parallelism = 1,
                shift_ratio = 0,
                seed = 0,
                name = None):
     '''Constructs a RecordInput Op.
     
     Args:
         file_pattern: File path to the dataset, possibly containing wildcards.All matching files will be iterated
         over each epoch;
         batch_size: How many records to return at a time;
         buffer_size: The maximum number of records the buffer will contain. This must be smaller than the total 
         number of records in an epoch or deadlock can occur;
         parallelism: How many reader threads to use for reading from files;
         shift_ratio: What percentage of the total number files to move the start file forward by each epoch;
         seed: Specify the random number seed used by generator that randomizes records;
         name: Optional name for the operation.
    Raises:
        ValueError: If one of the arguments is invalid.
     '''
Then in its member function 'get_yield_op', it will call the function of gen_data_flow_ops module to yield a minibatch every time it is executed.

We walked through code within gen_data_flow_ops and the definition of 'record_input' can be seen like below:
    def record_input(file_pattern, file_random_seed = None, file_shuffle_shift_ratio=None, file_buffer_size = None,
                    file_parallelism = None, batch_size = None, name = None):
        r'''Emits randomized records.
        
        Args:
            file_pattern: A 'string'. Glob pattern for the data files;
            file_randome_seed: An optional 'int'. Defaults to '301'.Random seeds used to produce randomized records;
            file_shuffle_shift_ratio: An optional 'float'. Defaults to '0'. Shifts the list of files after the list 
            is randomly shuffled;
            file_buffer_size: An optional 'int'. Defaults to '10000'. The randomization shuffling buffer;
            file_parallelism: An optional 'int'. Defaults to '16'. How many sstables are opened and concurrently 
            iterated over.
            batch_size: An optional 'int'. Defaults to '32'. The batch size;
            name: A name for the operation (optional).
       Returns:
           A 'Tensor' of type 'string'. A tensor of shape [batch_size].
        '''
        ##The result will be obtained through native ops provided by c++ compiled library
        result = _op_def_lib.apply_op("RecordInput",
                file_pattern = file_pattern,
                file_random_seed = file_random_seed,
                file_shuffle_shift_ratio = file_shuffle_shift_ratio,
                file_buffer_size = file_buffer_size,
                file_parallelism = file_parallelism,
                batch_size = batch_size,
                name = name)
        return result

h2. Parallelize Image Processing

After images are read from RecordInput they are passed as tensors to the image processing pipeline.
Here we assume that the input pipeline is targeting 8 GPUs with a batch size of 256 ( 32 per GPU ).

256 records are read and processed individually in parallel. This starts with 256 independent RecordInput read ops in the graph. Each read op is followed by an identical set of ops for image preprocessing that are considered independent and executed in parallel. The image preprocessing ops include operations such as image decoding, distortion and resizing.

Once the images are through preprocessing, they are concatenated together into 8 batch size 32 tensors. Rather than
using tf.concat for this purpose, which is implemented as a single op that waits for all the inputs to be ready before concatenating them together, tf.parallel_stack is used. tf.parallel_stack allocates an uninitialized tensor as an output, and each input tensor is written to its designated portion of the output tensor as soon as the input is available.

When all the input tensors are finished, the output tensor is passed along in the graph. This effectively hides all the memory latency with the long tail of producing all the input tensors.