# Fuel

## History

* Started as a part of **Blocks**, a framework for building and managing **Theano** graphs in the context of neural networks.
* Became its own project when we realized it was distinct enough that it could be used by other frameworks too.

## Goal

*Simplify downloading, storing, iterating over and preprocessing data used to train machine learning models.*

## Quick start

We'll go over a quick example to see how we can load arbitrary data into a dataset and and set up a basic preprocessing pipeline to iterate over the data while transforming it on the fly.

Let's start by creating some random data to act as features and targets. We'll pretend that we have eight 2x2 grayscale images separated into four classes.

In [1]:
import numpy
seed = 1234
rng = numpy.random.RandomState(seed)
features = rng.randint(256, size=(8, 2, 2))
targets = rng.randint(4, size=(8, 1))

The first thing we need to do is to store this data in a dataset.

### IterableDataset

There are many dataset classes to choose from. We'll look at the simplest one, `IterableDataset`.

It is created by passing a `dict` mapping source names to the data they contain and, optionally, a `dict` mapping source names to tuples of axis labels.

In [2]:
from fuel.datasets import IterableDataset
dataset = IterableDataset(
    iterables={'features': features, 'targets': targets},
    axis_labels={'features': ('batch', 'height', 'width'),
                 'targets': ('batch', 'index')})

We can ask the dataset what sources of data it provides by accessing its `sources` attribute. We can also know which axes correspond to what by accessing its `axis_labels` attribute. It also has a `num_examples` property telling us the number of examples it contains.

In [3]:
print('Sources are {}.'.format(dataset.sources))
print('Axis labels are {}.'.format(dataset.axis_labels))
print('Dataset contains {} examples.'.format(dataset.num_examples))

Sources are ('features', 'targets').
Axis labels are {'features': ('batch', 'height', 'width'), 'targets': ('batch', 'index')}.
Dataset contains 8 examples.


Datasets themselves are stateless objects (as opposed to, say, an open file handle, or an iterator object). In order to request data from the dataset, we need to ask it to instantiate some stateful object with which it will interact. This is done through the `open` method:

In [4]:
state = dataset.open()
print(state.__class__.__name__)

imap


We see that in `IterableDataset`'s case the state is an iterator object. We can now visit the examples this dataset contains using its `get_data` method.

In [5]:
print(dataset.get_data(state=state))

(array([[ 47, 211],
       [ 38,  53]]), array([0]))


Note that the return order depends on the order of `dataset.sources`, which is nondeterministic if you use `dict` instances. In order to have deterministic behaviour, it is recommended that you use `OrderedDict` instances instead.

Eventually, the iterator is depleted and it raises a `StopIteration` exception. We can iterate over the dataset again by requesting a fresh iterator through the dataset's `reset` method.

In [6]:

while True:
    try:
        dataset.get_data(state=state)
    except StopIteration:
        print('Iteration over')
        break
state = dataset.reset(state=state)
print(dataset.get_data(state=state))
dataset.close(state=state)

Iteration over
(array([[ 47, 211],
       [ 38,  53]]), array([0]))


### IndexableDataset

The `IterableDataset` implementation is pretty minimal. For instance, it only lets you iterate sequentially and examplewise over your data.

If your data happens to be indexable (e.g. a list, or a numpy array), then `IndexableDataset` will let you do much more.

We instantiate `IndexableDataset` just like we would for `IterableDataset`.

In [7]:
from fuel.datasets import IndexableDataset
from collections import OrderedDict

dataset = IndexableDataset(
    indexables=OrderedDict([('features', features), ('targets', targets)]),
    axis_labels={'features': ('batch', 'height', 'width'), 'targets': ('batch', 'index')})

The main advantage of `IndexableDataset` over `IterableDataset` is that it allows random access of the data it contains. In order to do so, we need to pass an additional `request` argument to `get_data` in the form of a list of indices.

In [8]:
state = dataset.open()
print('State is {}'.format(state))
print(dataset.get_data(state=state, request=[0, 1]))
dataset.close(state=state)

State is None
(array([[[ 47, 211],
        [ 38,  53]],

       [[204, 116],
        [152, 249]]]), array([[0],
       [3]]))


See how `IndexableDataset` returns a `None` state: this is because there's no actual state to maintain in this case.

### Iteration schemes

Encapsulating and accessing our data is good, but if we're to integrate it into a training loop, we need to be able to iterate over the data. For that, we need to decide *which* indices to request and in *which order*. This is accomplished via an `IterationScheme` subclass.

At its most basic level, an iteration scheme is responsible, through its `get_request_iterator` method, for building an iterator that will return requests. Here are some examples:

In [9]:
from fuel.schemes import (SequentialScheme, ShuffledScheme,
                          SequentialExampleScheme, ShuffledExampleScheme)

schemes = [SequentialScheme(examples=8, batch_size=4),
           ShuffledScheme(examples=8, batch_size=4),
           SequentialExampleScheme(examples=8),
           ShuffledExampleScheme(examples=8)]
for scheme in schemes:
    print([request for request in scheme.get_request_iterator()])

[[0, 1, 2, 3], [4, 5, 6, 7]]
[[7, 2, 1, 6], [0, 4, 3, 5]]
[0, 1, 2, 3, 4, 5, 6, 7]
[7, 2, 1, 6, 0, 4, 3, 5]


We can therefore use an iteration scheme to visit a dataset in some order.

In [10]:
state = dataset.open()
scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
for request in scheme.get_request_iterator():
    data = dataset.get_data(state=state, request=request)
    print(data[0].shape, data[1].shape)
dataset.close(state)

(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)


### Data streams

Iteration schemes offer a more convenient way to visit the dataset than accessing the data by hand, but we can do better: the act of getting a fresh state from the dataset, getting a request iterator from the iteration scheme, using both to access the data and closing the state is repetitive. To automate this, we have data streams.

The most common data stream class is `DataStream`. It is instantiated with a dataset and an iteration scheme, and returns an epoch iterator through its `get_epoch_iterator` method, which iterates over the dataset in the order defined by the iteration scheme.

In [11]:
from fuel.streams import DataStream

data_stream = DataStream(dataset=dataset, iteration_scheme=scheme)
for data in data_stream.get_epoch_iterator():
    print(data[0].shape, data[1].shape)

(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)


### Transformers

Some data streams take data streams as input. We call them *transformers*, and they enable us to build complex data preprocessing pipelines.

Let's standardize the images we have by substracting their mean and dividing by their standard deviation.

In [12]:
from fuel.transformers import ScaleAndShift
# Note: ScaleAndShift applies (batch * scale) + shift, as
# opposed to (batch + shift) * scale.
scale = 1.0 / features.std()
shift = - scale * features.mean()
standardized_stream = ScaleAndShift(data_stream=data_stream,
                                    scale=scale, shift=shift,
                                    which_sources=('features',))

The resulting data stream can be used to iterate over the dataset just like before, but this time features will be standardized on-the-fly.

In [13]:
for batch in standardized_stream.get_epoch_iterator():
    print(batch)

(array([[[ 0.18530572, -1.54479571],
        [ 0.42249705,  0.24111545]],

       [[-1.30760439,  0.98059429],
        [-1.43317627, -1.2238898 ]],

       [[ 1.46892937,  1.58054882],
        [ 0.47830677, -1.2657471 ]],

       [[ 0.63178351, -0.28907693],
        [-0.40069638,  1.10616617]]]), array([[1],
       [0],
       [3],
       [2]]))
(array([[[ 1.32940506, -0.2332672 ],
        [-1.60060544, -0.31698179]],

       [[ 0.03182898,  0.50621164],
        [-1.64246273,  1.28754777]],

       [[ 0.88292727, -0.34488665],
        [ 0.15740086,  1.51078666]],

       [[-1.00065091, -0.84717417],
        [ 0.84106998, -0.19140991]]]), array([[2],
       [0],
       [3],
       [2]]))


Now, let's imagine that for some reason (e.g. running Theano code on GPU) we **need** features to have a data type of `float32`.

In [14]:
from fuel.transformers import Cast
cast_standardized_stream = Cast(data_stream=standardized_stream,
                                dtype='float32', which_sources=('features',))

As you can see, Fuel makes it easy to chain transformations to form a preprocessing pipeline. The complete pipeline now looks like this:

In [15]:
data_stream = Cast(
    ScaleAndShift(
        DataStream(
            dataset=dataset, iteration_scheme=scheme),   
        scale=scale, shift=shift, which_sources=('features',)),
    dtype='float32', which_sources=('features',))
for batch in data_stream.get_epoch_iterator():
    print(batch)              

(array([[[ 0.63178349, -0.28907692],
        [-0.40069637,  1.10616612]],

       [[-1.00065088, -0.84717417],
        [ 0.84107   , -0.19140992]],

       [[ 0.8829273 , -0.34488666],
        [ 0.15740086,  1.51078665]],

       [[-1.30760443,  0.98059428],
        [-1.43317628, -1.22388983]]], dtype=float32), array([[2],
       [2],
       [3],
       [0]]))
(array([[[ 0.03182898,  0.50621164],
        [-1.64246273,  1.28754783]],

       [[ 1.46892941,  1.58054876],
        [ 0.47830677, -1.26574707]],

       [[ 0.18530573, -1.54479575],
        [ 0.42249706,  0.24111545]],

       [[ 1.32940507, -0.2332672 ],
        [-1.60060549, -0.31698179]]], dtype=float32), array([[0],
       [3],
       [1],
       [2]]))


## Large datasets

Sometimes, the dataset you're working on is too big to fit in memory. In that case, you'll want to use another common dataset class, `H5PYDataset`.

### H5PYDataset

As the name implies, `H5PYDataset` is a dataset class that interfaces with HDF5 files using the `h5py` library.

HDF5 is a wonderful storage format, as it is organizable and self-documentable. This allows us to make a basic set of assumptions about the structure of an HDF5 file which, if met, greatly simplify creating new datasets and interacting with them. These assumptions are:

* All data is stored into a single HDF5 file.
* Data sources reside in the root group, and their names define the source names.
* Data sources are not explicitly split into separate HDF5 datasets or separate HDF5 files. Instead, splits are defined in the `split` attribute of the root group.
    
Don't worry about that too much; Fuel has some built-in functions to take care of that.

Let's create new random data. This time, we'll pretend that we're given a training set and a test set.

In [16]:
train_image_features = rng.randint(256, size=(90, 3, 32, 32)).astype('uint8')
train_vector_features = rng.normal(size=(90, 16))
train_targets = rng.randint(10, size=(90, 1)).astype('uint8')

test_image_features = rng.randint(256, size=(10, 3, 32, 32)).astype('uint8')
test_vector_features = rng.normal(size=(10, 16))
test_targets = rng.randint(10, size=(10, 1)).astype('uint8')

We would normally need to
* open an HDF5 file for writing,
* create three datasets in the root group, one for each data source,
* fill the datasets with our training and test data and
* build the split array describing how the file is laid out.

Fortunately for us, Fuel has a built-in function that automates the process.

In [17]:
import h5py
from fuel.converters.base import fill_hdf5_file
f = h5py.File('dataset.hdf5', mode='w')
data = (('train', 'image_features', train_image_features),
        ('train', 'vector_features', train_vector_features),
        ('train', 'targets', train_targets),
        ('test', 'image_features', test_image_features),
        ('test', 'vector_features', test_vector_features),
        ('test', 'targets', test_targets))
fill_hdf5_file(f, data)

Before closing the file, let's also tag axes with their label.

In [18]:
for i, label in enumerate(('batch', 'channel', 'height', 'width')):
    f['image_features'].dims[i].label = label
for i, label in enumerate(('batch', 'feature')):
    f['vector_features'].dims[i].label = label
for i, label in enumerate(('batch', 'index')):
    f['targets'].dims[i].label = label
f.flush()
f.close()

We now have everything we need to load this HDF5 file in Fuel.

We'll instantiate `H5PYDataset` by passing it the path to our HDF5 file as well as a tuple of splits to use. For now, we'll just load the train and test sets separately, but note that it is also possible to concatenate splits that way (e.g. concatenate the training and validation sets).

In [19]:
from fuel.datasets import H5PYDataset
train_dataset = H5PYDataset('dataset.hdf5', which_sets=('train',))
test_dataset = H5PYDataset('dataset.hdf5', which_sets=('test',))

`H5PYDataset` instances allow the same level of introspection as `IndexableDataset` instances.

In [20]:
print('Sources are {}.'.format(train_dataset.sources))
print('Axis labels are {}.'.format(train_dataset.axis_labels))
print('Training set contains {} examples.'.format(train_dataset.num_examples))
print('Test set contains {} examples.'.format(test_dataset.num_examples))

Sources are ('image_features', 'targets', 'vector_features').
Axis labels are {'vector_features': ('batch', 'feature'), 'image_features': ('batch', 'channel', 'height', 'width'), 'targets': ('batch', 'index')}.
Training set contains 90 examples.
Test set contains 10 examples.


We can iterate over data the same way as well.

In [21]:
train_stream = DataStream(
    dataset=train_dataset,
    iteration_scheme=ShuffledScheme(
        examples=train_dataset.num_examples, batch_size=10))
for batch in train_stream.get_epoch_iterator():
    print([source.shape for source in batch])

[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]
[(10, 3, 32, 32), (10, 1), (10, 16)]


## Built-in datasets

### Defining where Fuel looks for data

You can tell Fuel where to look for data by setting the `data_path` variable in `~/.fuelrc`:

You can override it by setting the `FUEL_DATA_PATH` environment variable.

In both cases, Fuel expects a sequence of paths separated by an OS-specific delimiter (`:` for Linux / Mac OS, `;` for Windows).

### Downloading raw data files

### Converting raw data files

### Using built-in datasets

## Parallelizing data processing

## Extending Fuel

### New dataset classes

### New transformers

### New iteration schemes

## Common tasks

### Preprocess once