The tf.data API introduces two new abstractions to TensorFlow:

    * A tf.data.Dataset represents a sequence of elements, in which each element contains one or more Tensor objects.
        *Creating a source (e.g. Dataset.from_tensor_slices()) constructs a dataset from one or more tf.Tensor objects.
        *Applying a transformation (e.g. Dataset.batch()) constructs a dataset from one or more tf.data.Dataset objects.
        
    * tf.data.Iterator provides the main way to extract elements from a dataset. 

In [2]:
import tensorflow as tf

# 0. Basic mechanics

A dataset comprises elements that each have the same structure. An element contains one or more tf.Tensor objects, called components. Each component has a tf.DType representing the type of elements in the tensor, and a tf.TensorShape representing the (possibly partially specified) static shape of each element. 

The nested structure of these properties map to the structure of an element, which may be a single tensor, a tuple of tensors, or a nested tuple of tensors. 

In [3]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4,8]))
print(dataset1.output_types)
print(dataset1.output_shapes)

<dtype: 'float32'>
(8,)


In [4]:
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4,100], maxval=100, dtype=tf.int32)))
print(dataset2.output_types)
print(dataset2.output_shapes)

(tf.float32, tf.int32)
(TensorShape([]), TensorShape([Dimension(100)]))


In [5]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)
print(dataset3.output_shapes)

(tf.float32, (tf.float32, tf.int32))
(TensorShape([Dimension(8)]), (TensorShape([]), TensorShape([Dimension(100)])))


 In addition to tuples, you can use collections.namedtuple or a dictionary mapping strings to tensors to represent a single element of a Dataset.

In [6]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random_uniform([4]),
    "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)  # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes)  # ==> "{'a': (), 'b': (100,)}"

{'a': tf.float32, 'b': tf.int32}
{'a': TensorShape([]), 'b': TensorShape([Dimension(100)])}


The Dataset transformations support datasets of any structure. When using the **Dataset.map(), Dataset.flat_map(), and Dataset.filter()** transformations, which apply a function to each element, the element structure determines the arguments of the function:

```python


    dataset1 = dataset1.map(lambda x: ...)

    dataset2 = dataset2.flat_map(lambda x, y: ...)

    # Note: Argument destructuring is not available in Python 3.
    dataset3 = dataset3.filter(lambda x, (y, z): ...)

Once you have built a Dataset to represent your input data, the next step is to create an Iterator to access elements from that dataset. The tf.data API currently supports the following iterators, in increasing level of sophistication:

1. one-shot,
2. initializable,
3. reinitializable, and
4. feedable.

In [11]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
dataset2 = tf.data.Dataset.from_tensor_slices((tf.random_uniform([4]), tf.random_uniform([4, 100])))
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

iterator = dataset3.make_initializable_iterator()

sess = tf.Session()
sess.run(iterator.initializer)
next1, (next2, next3) = iterator.get_next()

# 1. Reading input data

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use **Dataset.from_tensor_slices()**.


Consuming TFRecord data

The tf.data API supports a variety of file formats so that you can process large datasets that do not fit in memory. For example, the TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. The **tf.data.TFRecordDataset** class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

Consuming text data

Many datasets are distributed as one or more text files. The **tf.data.TextLineDataset** provides an easy way to extract lines from one or more text files. Given one or more filenames, a TextLineDataset will produce one string-valued element per line of those files. Like a TFRecordDataset, TextLineDataset accepts filenames as a tf.Tensor, so you can parameterize it by passing a tf.placeholder(tf.string).

Consuming CSV data

The CSV file format is a popular format for storing tabular data in plain text. The **tf.contrib.data.CsvDataset** class provides a way to extract records from one or more CSV files that comply with RFC 4180. Given one or more filenames and a list of defaults, a CsvDataset will produce a tuple of elements whose types correspond to the types of the defaults provided, per CSV record. Like TFRecordDataset and TextLineDataset, CsvDataset accepts filenames as a tf.Tensor, so you can parameterize it by passing a tf.placeholder(tf.string). 
By default, a CsvDataset yields every column of every line of the file, which may not be desirable, for example if the file starts with a header line that should be ignored, or if some columns are not required in the input. These lines and fields can be removed with the header and select_cols arguments respectively.

# Preprocessing data with Dataset.map()

The Dataset.map(f) transformation produces a new dataset by applying a given function f to each element of the input dataset. It is based on the map() function that is commonly applied to lists (and other structures) in functional programming languages. The function f takes the tf.Tensor objects that represent a single element in the input, and returns the tf.Tensor objects that will represent a single element in the new dataset. Its implementation uses standard TensorFlow operations to transform one element into another.



Applying arbitrary Python logic with tf.py_func()

For performance reasons, we encourage you to use TensorFlow operations for preprocessing your data whenever possible. However, it is sometimes useful to call upon external Python libraries when parsing your input data. To do so, invoke, the tf.py_func() operation in a Dataset.map() transformation.

# 2. Batching dataset elements

Simple batching

The simplest form of batching stacks n consecutive elements of a dataset into a single element. The Dataset.batch() transformation does exactly this, with the same constraints as the tf.stack() operator, applied to each component of the elements: i.e. for each component i, all elements must have a tensor of the exact same shape.

In [10]:
inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

sess = tf.Session()
print(sess.run(next_element))  # ==> ([0, 1, 2,   3],   [ 0, -1,  -2,  -3])
print(sess.run(next_element))  # ==> ([4, 5, 6,   7],   [-4, -5,  -6,  -7])
print(sess.run(next_element))  # ==> ([8, 9, 10, 11],   [-8, -9, -10, -11])

(array([0, 1, 2, 3]), array([ 0, -1, -2, -3]))
(array([4, 5, 6, 7]), array([-4, -5, -6, -7]))
(array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11]))


Batching tensors with padding

The above recipe works for tensors that all have the same size. However, many models (e.g. sequence models) work with input data that can have varying size (e.g. sequences of different lengths). To handle this case, the Dataset.padded_batch() transformation enables you to batch tensors of different shape by specifying one or more dimensions in which they may be padded.

In [12]:
dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=[None])

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

sess = tf.Session()
print(sess.run(next_element))  # ==> [[0, 0, 0], [1, 0, 0], [2, 2, 0], [3, 3, 3]]
print(sess.run(next_element)) 

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]
[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]


# 3. Training workflows

Processing multiple epochs

The tf.data API offers two main ways to process multiple epochs of the same data.

The simplest way to iterate over a dataset in multiple epochs is to use the **Dataset.repeat()** transformation. Applying the Dataset.repeat() transformation with no arguments will repeat the input indefinitely. The Dataset.repeat() transformation concatenates its arguments without signaling the end of one epoch and the beginning of the next epoch. If you want to receive a signal at the end of each epoch, you can write a training loop that catches the tf.errors.OutOfRangeError at the end of a dataset.

Randomly shuffling input data

The **Dataset.shuffle()** transformation randomly shuffles the input dataset using a similar algorithm to tf.RandomShuffleQueue: it maintains a fixed-size buffer and chooses the next element uniformly at random from that buffer.

Using high-level APIs

The **tf.train.MonitoredTrainingSession** API simplifies many aspects of running TensorFlow in a distributed setting. MonitoredTrainingSession uses the tf.errors.OutOfRangeError to signal that training has completed, so to use it with the tf.data API, we recommend using Dataset.make_one_shot_iterator().

In [None]:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()

next_example, next_label = iterator.get_next()
loss = model_function(next_example, next_label)

training_op = tf.train.AdagradOptimizer(...).minimize(loss)

with tf.train.MonitoredTrainingSession(...) as sess:
    while not sess.should_stop():
        sess.run(training_op)

To use a Dataset in the input_fn of a **tf.estimator.Estimator**, simply return the Dataset and the framework will take care of creating an iterator and initializing it for you. For example:

In [14]:
def dataset_input_fn():
    filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
    dataset = tf.data.TFRecordDataset(filenames)

    # Use `tf.parse_single_example()` to extract data from a `tf.Example`
    # protocol buffer, and perform any additional per-record preprocessing.
    def parser(record):
        keys_to_features = {
            "image_data": tf.FixedLenFeature((), tf.string, default_value=""),
            "date_time": tf.FixedLenFeature((), tf.int64, default_value=""),
            "label": tf.FixedLenFeature((), tf.int64,
                                        default_value=tf.zeros([], dtype=tf.int64)),
        }
        parsed = tf.parse_single_example(record, keys_to_features)

        # Perform additional preprocessing on the parsed data.
        image = tf.image.decode_jpeg(parsed["image_data"])
        image = tf.reshape(image, [299, 299, 1])
        label = tf.cast(parsed["label"], tf.int32)

        return {"image_data": image, "date_time": parsed["date_time"]}, label

    # Use `Dataset.map()` to build a pair of a feature dictionary and a label
    # tensor for each example.
    dataset = dataset.map(parser)
    dataset = dataset.shuffle(buffer_size=10000)
    dataset = dataset.batch(32)
    dataset = dataset.repeat(num_epochs)

    # Each element of `dataset` is tuple containing a dictionary of features
    # (in which each value is a batch of values for that feature), and a batch of
    # labels.
    return dataset