# Tensorflow input pipeline: large datasets and data augmentation

## Introduction

The new high level Dataset API makes it quite easy to deal with large datasets. No need to bother with queues anymore.

This notebook gives a short example of how to use the Dataset API for:

1. loading multiple files for each input example,
2. data augmentation.

Loading multiple files for each input example can be needed for several applications:

* dealing with image sequences,
* object detection with multiple cameras,
* etc.

Data augmentation is a classical problem in deep learning. However, the Dataset API documentation does not give any hint about how to achieve data augmentation. I am not sure that the method used below is the best for this task. If you find a better way, don't hesitate to tell me!

__Note on tensorflow version:__

> This notebook works with tensorflow 1.4. It uses the Dataset API from *tf.data*. In previous versions, the Dataset API was in *tf.contrib.data*. Although I did not test it, it should be possible to use this notebook with tensorflow 1.3 by replacing *tf.data* by *tf.contrib.data*.

## Let's go!

In [1]:
import tensorflow as tf
import numpy as np

### Create some dummy data

We are going to create some dummy data files. In a real application, use your own data files. Here, we will create 4 files: two groups of two files. Each file will contain a 3x3 matrix.

In [2]:
height = 3
width = 3

# Create random files containing raw matrix of shape 3x3
for i in range(2):
    for j in range(2):
        # Create matrix with fake data
        matrix = np.zeros((height,width)) + i + (j*10)
        # Save matrix as raw float32 file
        matrix.astype('float32').tofile('data/file_' + str(i) + '_' + str(j) + '.raw')

Let's have a look at one of our dummy data files.

In [3]:
# Print one of the generated files for checking
matrix = np.fromfile('data/file_1_1.raw', dtype=np.float32)
matrix = matrix.reshape((height,width))
print(matrix)

[[ 11.  11.  11.]
 [ 11.  11.  11.]
 [ 11.  11.  11.]]


OK, our dummy data file looks good.

### Create parser

We will need a parser to read our examples from the data files.

Here, each example will be created by stacking data coming from two different files and a label. Therefore, the parser arguments will be the two filenames and the label. It will return two tensors containing the example and the label.

In [4]:
# Create parser
# Args: filenames
# Returns: tensor containing read and decoded element
def _parse_data(filename0, filename1, label):
    # Read and decode first file
    matrix0 = tf.read_file(filename0)
    matrix0 = tf.decode_raw(matrix0, out_type=tf.float32)
    matrix0 = tf.reshape(matrix0, [height,width])
    
    # Read and decode second file
    matrix1 = tf.read_file(filename1)
    matrix1 = tf.decode_raw(matrix1, out_type=tf.float32)
    matrix1 = tf.reshape(matrix1, [height,width])
    
    # Stack the two elements together
    X = tf.stack([matrix0, matrix1])
    
    # Get label (you could implement more complex logic here if needed)
    y = label
    
    return X, y

### Create data augmentation function

We will need a function to implement the data augmentation logic.

This function takes one example (X,y), and returns a dataset with two examples: the original example, and a second one generated on the fly.

In [5]:
# Data augmentation function: create several examples from one example
# Args: One example X,y
# Returns: Dataset containing several examples, after data augmentation
def _data_augment(X,y):
    # Generate new data from example X
    X0 = X
    X1 = -X # Dummy data augmentation, but we could use any transformation we need. We could also generate more examples.
    X_augmented = tf.stack([X0,X1])
    
    # Repeat y
    y_augmented = tf.stack([y,y])
    
    dataset = tf.data.Dataset.from_tensor_slices((X_augmented, y_augmented))
        
    return dataset

### It's almost done

Now that we have implemented our parsing and data augmentation logic, we can create the dataset.

As you can see, with the TensorFlow Dataset API, it is really easy! 

In [6]:
####################################################################
# Create dataset from files
####################################################################

# Get the filenames lists
filenames0 = ['data/file_0_0.raw', 'data/file_0_1.raw']
filenames1 = ['data/file_1_0.raw', 'data/file_1_1.raw']

# Create tensorflow constant containing the filenames
tf_filenames0 = tf.constant(filenames0)
tf_filenames1 = tf.constant(filenames1)

# Create labels dataset
labels = tf.constant([15, 25])

# Create dataset containing filenames and labels
dataset = tf.data.Dataset.from_tensor_slices((tf_filenames0, tf_filenames1, labels))

# Use our _parse_data function to read files and decode data
dataset = dataset.map(_parse_data)

#####################################################################
# Data augmentation
#####################################################################

# Apply data augmentation
dataset = dataset.interleave(_data_augment, cycle_length=1)

### Print the dataset

We can now print the dataset to check that everything is working.

In [7]:
#####################################################################
# Print the dataset
#####################################################################

# create TensorFlow Iterator object
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:

    # iterate over the dataset
    while True:
        try:
            elem = sess.run(next_element)
            print(elem)
        except tf.errors.OutOfRangeError:
            print("End of dataset")
            break

(array([[[ 0.,  0.,  0.],
        [ 0.,  0.,  0.],
        [ 0.,  0.,  0.]],

       [[ 1.,  1.,  1.],
        [ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]], dtype=float32), 15)
(array([[[-0., -0., -0.],
        [-0., -0., -0.],
        [-0., -0., -0.]],

       [[-1., -1., -1.],
        [-1., -1., -1.],
        [-1., -1., -1.]]], dtype=float32), 15)
(array([[[ 10.,  10.,  10.],
        [ 10.,  10.,  10.],
        [ 10.,  10.,  10.]],

       [[ 11.,  11.,  11.],
        [ 11.,  11.,  11.],
        [ 11.,  11.,  11.]]], dtype=float32), 25)
(array([[[-10., -10., -10.],
        [-10., -10., -10.],
        [-10., -10., -10.]],

       [[-11., -11., -11.],
        [-11., -11., -11.],
        [-11., -11., -11.]]], dtype=float32), 25)
End of dataset


### It is working!

We can see that parsing and data augmentation work well:

* each example contains:
  * two matrices coming from two different files
  * a label
* after each example, there is an augmented example (-X, y)