# Datasets in TensorFlow

`feed_dict` is bad. As TensorFlow's [reading data guide](https://www.tensorflow.org/api_guides/python/reading_data) says, feeding is inefficient and should only be used for small experiments and debugging. Before TensorFlow 1.2, the main alternative was a [complex queue-based system](https://www.tensorflow.org/programmers_guide/threading_and_queues) which felt a bit too difficult for average humans (or at least for your humble author). The situation has greatly improved with the emergence of the [`Dataset` API](https://www.tensorflow.org/programmers_guide/datasets), but as good as the official guide is, it can still be a bit unclear how to use this system to implement complex data pipelines. The goal of this notebook is to supplement the official guide and help you (and me!) think about these situations. I will restrict myself to the most basic kind of iterator because it plays nicely with TensorFlow's higher-level APIs. I recommend you never write raw TensorFlow code unless you *absolutely have to*.

This guide was written in TensorFlow 1.3. `Dataset`s are still in `tf.contrib`-land, which means that the API is not guaranteed to be stable for future releases. It's also still a **work in progress**. Be warned!

## Introduction and Setup

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.data as tfdat

sess = tf.InteractiveSession()
tf.set_random_seed(12)

This function will make a one shot iterator and print out data values until the provided dataset is empty. It will be used a lot in this notebook.

In [2]:
def run_through(dataset):
    iterator = dataset.make_one_shot_iterator()
    while True:
        try:
            print(sess.run(iterator.get_next()))
        except tf.errors.OutOfRangeError:
            break

## Single Data Source

Let's first consider a single data source. That is, we will not be considering pairs of data points, which might be used in a Siamese neural network, or triplets that might be used in a algorithm using triplet loss. I will use an extremely small, simple dataset so we can be clear about exactly what happens when we create more `Dataset` objects later on.

In [3]:
dat = np.arange(4)
ds = tfdat.Dataset.from_tensor_slices(dat)

### Unmodified Dataset

First, I will iterate through the dataset unmodified, just to be crystal-clear about what we're discussing here:

In [4]:
run_through(ds)

0
1
2
3


### Dataset -> Repeat

As advertised, the `repeat(n)` method will repeat the dataset sequence n times.

In [5]:
ds_rep = ds.repeat(2)
run_through(ds_rep)

0
1
2
3
0
1
2
3


### Dataset -> Shuffle

#### Buffer Size >= n

Use `shuffle()` with the same buffer size as the dataset to ensure that all of the data are randomly shuffled. You are, however, guaranteed to get the entire dataset. So, for example, there is no random seed where this will produce two 3's.

In [6]:
ds_shuff = ds.shuffle(4)
run_through(ds_shuff)

3
2
0
1


In [7]:
ds_big_shuff = ds.shuffle(10)
run_through(ds_big_shuff)

3
2
1
0


#### Buffer Size < n

Using a smaller buffer size means that you won't get a truly random shuffle from the dataset. In this stupid example:
1. The buffer is filled with data, which comes in order from the dataset. 0 is added to the buffer.
2. The iterator receives a `get_next()` command and randomly selects a value from the buffer. The only thing there is 0, so it produces 0.
3. The buffer is now empty. The next element in the original sequence (the 1) is added.
4. The iterator produces 1.

etc...

In [8]:
ds_small_shuff = ds.shuffle(1)
run_through(ds_small_shuff)

0
1
2
3


Let's try a slightly larger buffer size and run through the dataset multiple times. Note how the value 2 will never appear as the first element, and 3 will never appear as one of the first two elements.

In [9]:
ds_not_as_small_shuff = ds.shuffle(2)
run_through(ds_not_as_small_shuff)

0
2
1
3


In [10]:
ds_not_as_small_shuff = ds.shuffle(2)
run_through(ds_not_as_small_shuff)

1
0
3
2


In [11]:
ds_not_as_small_shuff = ds.shuffle(2)
run_through(ds_not_as_small_shuff)

1
2
3
0


In [12]:
ds_not_as_small_shuff = ds.shuffle(2)
run_through(ds_not_as_small_shuff)

1
0
2
3


Note that while not providing true uniform randomness, having a buffer that's smaller than the size of your dataset is typical in "big data" situations where you cannot fit everything in memory. This is therefore an important case.

### Dataset -> Shuffle -> Repeat

Here we see where things begin to get a bit confusing:

In [13]:
ds_shuff_rep = ds.shuffle(4).repeat(2)
run_through(ds_shuff_rep)

3
0
1
2
3
0
1
2


What happened here? We got the same shuffle on both of our repeats! Was it just bad luck? No, and as it turns out, we get the same effect even with a smaller buffer size.

In [14]:
ds_shuff_rep = ds.shuffle(2).repeat(2)
run_through(ds_shuff_rep)

0
2
3
1
0
2
3
1


### Dataset -> Repeat -> Shuffle

The same does not happen when we do things in this order.

In [15]:
ds_rep_shuff = ds.repeat(2).shuffle(4)
run_through(ds_rep_shuff)

0
3
1
2
2
0
1
3


In [16]:
ds_rep_small_shuff = ds.repeat(2).shuffle(2)
run_through(ds_rep_small_shuff)

1
2
3
0
1
0
2
3


When the buffer is at most the size of the dataset, it appears we will (at least fairly frequently) get one full copy of the dataset per epoch. Use a larger buffer size to avoid this.

In [17]:
ds_rep_small_shuff = ds.repeat(2).shuffle(8)
run_through(ds_rep_small_shuff)

1
2
3
3
0
0
2
1


## Two Data Sources

In my own work, I am working with Siamese networks, which require a pair of training examples to be input to the network. The network performs operations on both of these inputs in parallel and computes a similarity score based on them. You can see that a Siamese network working on a pair of examples is not the same as a garden-variety network working on, say, a 2-example mini-batch. The latter case does not require any calculations which require both of the examples at the same time, at least not before the gradient update step.

The examples here will generalize to still more complex cases requiring three or more examples at once.

In [18]:
ds1 = tfdat.Dataset.from_tensor_slices(dat)
ds2 = tfdat.Dataset.from_tensor_slices(dat)
ds = tfdat.Dataset.zip((ds1, ds2))
run_through(ds)

(0, 0)
(1, 1)
(2, 2)
(3, 3)


### Shuffle -> Repeat -> Zip

Same order in both epochs.

In [19]:
ds = tfdat.Dataset.zip((ds1.shuffle(4).repeat(2),
                        ds2.shuffle(4).repeat(2))
                      )
run_through(ds)

(1, 2)
(2, 0)
(3, 1)
(0, 3)
(1, 2)
(2, 0)
(3, 1)
(0, 3)


### Repeat -> Shuffle -> Zip

This looks much better. No guarantee that we'll have each example from each source in each epoch, though.

In [20]:
ds = tfdat.Dataset.zip((ds1.repeat(2).shuffle(4),
                        ds2.repeat(2).shuffle(4))
                      )
run_through(ds)

(2, 2)
(0, 0)
(3, 1)
(1, 2)
(2, 1)
(0, 3)
(3, 0)
(1, 3)


In [21]:
ds = tfdat.Dataset.zip((ds1.repeat(2).shuffle(4),
                        ds2.repeat(2).shuffle(4))
                      )
run_through(ds)

(3, 2)
(0, 0)
(0, 0)
(2, 3)
(1, 1)
(2, 1)
(1, 3)
(3, 2)


### Shuffle -> Concat -> Zip

Here I create many datasets and shuffle them, then [reduce](https://docs.python.org/3/library/functools.html#functools.reduce) and zip them.

In [27]:
from functools import reduce

def create_pairs_dataset(num_epochs):
    ds1_list = [tfdat.Dataset.from_tensor_slices(dat).shuffle(4) for _ in range(num_epochs)]
    ds2_list = [tfdat.Dataset.from_tensor_slices(dat).shuffle(4) for _ in range(num_epochs)]
    ds1 = reduce(lambda a, b: a.concatenate(b), ds1_list)
    ds2 = reduce(lambda a, b: a.concatenate(b), ds2_list)
    ds = tfdat.Dataset.zip((ds1, ds2))
    return ds
    
run_through(create_pairs_dataset(3))

(3, 2)
(2, 3)
(1, 1)
(0, 0)
(0, 2)
(1, 0)
(2, 1)
(3, 3)
(0, 2)
(1, 3)
(3, 1)
(2, 0)


In [28]:
def create_tuple_dataset(dat, tuple_size, num_epochs, buffer_size):
    ds_list = [None] * tuple_size
    for i in range(tuple_size):
        set_i_list = [tfdat.Dataset.from_tensor_slices(dat).shuffle(buffer_size) for _ in range(num_epochs)]
        set_i = reduce(lambda a, b: a.concatenate(b), set_i_list)
        ds_list[i] = set_i
    return tfdat.Dataset.zip(tuple(ds_list))

In [30]:
run_through(create_tuple_dataset(dat, 3, 3, 4))

(0, 1, 0)
(1, 2, 2)
(2, 0, 3)
(3, 3, 1)
(0, 1, 3)
(2, 2, 1)
(3, 3, 0)
(1, 0, 2)
(2, 2, 0)
(0, 0, 1)
(3, 1, 3)
(1, 3, 2)
