Status | Proposed |
---|---|
Author(s) | Francois Chollet (fchollet@google.com), Frederic Branchaud-Charron (Frederic.Branchaud-Charron@usherbrooke.ca) |
Updated | 2019-08-21 |
tf.data.Dataset
is the main API for data loading and preprocessing in TensorFLow. It has two advantages:
- It supports GPU prefetching
- It supports distribution via the Distribution Strategies API
Meanwhile, keras.preprocessing
is a major API for data loading and preprocessing in Keras. It is based
on Numpy and Scipy, and it produces instances of the keras.utils.Sequence
class, which are finite-length,
resettable Python generators that yield batches of data.
Some features of keras.preprocessing
are highly useful and don't have straightforward equivalents in tf.data
(in particular image data augmentation and dynamic time series iteration).
Ideally, the utilities in keras.preprocessing
should be made compatible with tf.data
.
This presents the opportunity to improve on the existing API. In particular we don't have good support
for image segmentation use cases today.
Some features are also being supplanted by preprocessing layers, in particular text processing. As a result we may want move the current API to an API similar to Layers.
- Unify "keras.preprocessing" and the recently-introduced Preprocessing Layers API.
- Make all features of
keras.preprocessing
compatible withtf.data
. - As a by-product, add required ops to TensorFlow (
tf.image
).
- Deprecate
ImagePipelineGenerator
in favor of newImagePipeline
class similar to aSequential
model. - Inherits from
keras.layers.PreprocessingLayer
for all image transformations. - Deprecate
Tokenizer
class in favor ofTextVectorization
preprocessing layer. - Replace
TimeseriesGenerator
with a function-based API.
ImagePipeline
inherits from PreprocessingLayer
(or alternatively keras.model.Sequential
, whose behavior is similar) and takes a list of layers as inputs. In the future it will inherit from PreprocessingStage
.
ImagePipeline
is a preprocessing layer that encapsulate a series of image transformations. Since some of these transformations may be trained (featurewise normalization), it exposes the method adapt
, like all other preprocessing layers.
class ImagePipeline(Sequential):
def __init__(self, layers:List[Layer]):
...
preprocessor = ImagePipeline([
RandomFlip(horizontal=True),
RandomRotation(0.2, fill_mode='constant'),
RandomZoom(0.2, fill_mode='constant'),
RandomTranslation(0.2, fill_mode='constant'),
Normalization(), # This is the same Normalization introduced in preprocessing layers
])
preprocessor.adapt(sample_data) # optional step in case the object needs to be trained
dataset = preprocessor.from_directory(dir_name, image_size=(512, 512))
model.fit(dataset, epochs=10)
def from_directory(
self,
directory,
targets='inferred',
target_mode='categorical',
class_names='inferred',
color_mode='rgb',
batch_size=32,
image_size=(255, 255),
shuffle=True,
seed=None,
follow_links=False,
validation_split=None,
subset='training',
subset=None):
"""Generates a Dataset from files in a directory.
# Arguments:
directory: Directory where the data is located.
If `targets` is "inferred", it should contain
subdirectories, each containing images for a class.
Otherwise, the directory structure is ignored.
targets: Either
"inferred" (targets are generated from the directory structure),
None (no targets),
or a list of integer labels of the same size as the number of image
files found in the directory.
target_mode:
- 'categorical' means that the inferred labels are
encoded as a categorical vector (e.g. for categorical_crossentropy).
- 'binary' means that the inferred labels (there can be only 2)
are encoded as binary scalars (e.g. for binary_crossentropy).
class_names: Only valid if "targets" is "inferred". This is the explict
list of class names (must match names of subdirectories). Used
to control the order of the classes (otherwise alphanumerical order is used).
color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb".
Whether the images will be converted to
have 1, 3, or 4 channels.
batch_size: Size of the batches of data (default: 32).
image_size: Size to resize images to after they are read from disk.
Since the pipeline processes batches of images that must all have the same size,
this must be provided.
shuffle: Whether to shuffle the data (default: True)
If set to False, sorts the data in alphanumeric order.
seed: Optional random seed for shuffling and transformations.
follow_links: Whether to follow links inside
subdirectories (default: False).
validation_split: Optional float between 0 and 1,
fraction of data to reserve for validation.
subset: One of "training" or "validation". Only used if `validation_split` is set.
"""
def from_dataframe(
self,
dataframe,
directory=None,
data_column='filename',
target_column='class',
target_mode='categorical',
weight_column=None,
color_mode='rgb',
batch_size=32,
image_size=(255, 255),
shuffle=True,
seed=None,
validation_split=None,
subset=None):
"""Generates a Dataset from a Pandas dataframe.
# Arguments:
dataframe: Pandas dataframe instance.
directory: The directory that image paths refer to.
data_column: Name of column with the paths for the input images.
target_column: Name of column with the class information.
target_mode:
- 'categorical' means that the inferred labels are
encoded as a categorical vector (e.g. for categorical_crossentropy).
- 'binary' means that the inferred labels (there can be only 2)
are encoded as binary scalars (e.g. for binary_crossentropy).
weight_column: Name of column with sample weight information.
color_mode: One of "grayscale", "rgb", "rgba". Default: "rgb".
Whether the images will be converted to
have 1, 3, or 4 channels.
batch_size: Size of the batches of data (default: 32).
image_size: Size to resize images to after they are read from disk.
Since the pipeline processes batches of images that must all have the same size,
this must be provided.
shuffle: Whether to shuffle the data (default: True)
If set to False, sorts the data in alphanumeric order.
seed: Optional random seed for shuffling and transformations.
validation_split: Optional float between 0 and 1,
fraction of data to reserve for validation.
subset: One of "training" or "validation". Only used if `validation_split` is set.
"""
def preview(self, data, save_to_directory=None, save_prefix=None, save_format='png'):
"""Enables users to preview the image augmentation configuration.
# Arguments
data: Image data. Could be strings (a list of image paths), a list of PIL image instances,
a list of arrays, or a list of eager tensors.
save_to_directory: Directory to save transformed images. Mandatory if not in a notebook.
If in a notebook and this is not specified, images are displayed in-line.
save_prefix: String, filename prefix for saved images.
save_format: String, extension for saved images.
"""
Note: from_arrays
is not included since it is possible to transform Numpy data simply by calling the ImagePipeline
object (like a layer).
The new data augmentation layers will inherit keras.layers.Layer
and work in a similar way.
Resizing(height, width) # Resize while distorting aspect ratio
CenterCrop(height, width) # Resize without distorting aspect ratio
RandomCrop(height, width, seed=None) # Return a (height, width) crop from a random location
Rescaling(value) # Divide by `value`
RandomFlip(horizontal=False, vertical=False, seed=None)
RandomTranslation(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomRotation(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomZoom(amplitude=0., fill_mode='constant', fill_value=0., seed=None)
RandomBrightness(amplitude=0., seed=None)
RandomContrast(amplitude=0., seed=None)
RandomSaturation(amplitude=0., seed=None)
RandomWidth(amplitude=0., seed=None) # Expand / shrink width while distorting aspect ratio
RandomHeight(amplitude=0., seed=None) # Expand / shrink height while distorting aspect ratio
The amplitude
argument may be:
- a positive float: it is understood as "fraction of total" (total is the current width, or height, or 180 degrees in the case
RandomRotation
). E.g.0.2
results in variations in the [-20%, +20%] range. If larger than 1, it is rounded to one for the lower boundary (but not the higher boundary). - a tuple of 2 positive floats: understood as a fractional range, e.g.
(0.2, 0.4)
is interpreted as the [-20%, +40%] range. The first float may not be larger than 1.
To do a random center crop that zooms in and discards part of the image, you would do:
preprocessor = ImagePipeline([
RandomZoom([0., 0.2]),
CenterCrop(height, width),
])
- We are dropping support for ZCA whitening as it is no longer popular in the computer vision community.
- We don't have immediate support for random translations along only one axis.
- We only plan on implementing support for
data_format='channels_last'
. As such this argument does not appear in the API.
class RandomFlip(PreprocessingLayer):
def __init__(self, horizontal=False, vertical=False, seed=None):
self.horizontal = horizontal
self.vertical = vertical
self.seed = seed or random_int()
self._rng = rng_from_seed(seed)
def call(self, inputs, training=None, seed=None):
seed = seed or self._rng.sample()
if training:
if self.horizontal:
inputs = tf.image.random_flip_left_right(inputs, seed=seed)
if self.vertical:
inputs = tf.image.random_flip_up_down(inputs, seed=seed)
return inputs
Requirements:
- Image loading and image augmentation should be synced across inputs and targets
- It should be possible to use different standardization preprocessing (outside of augmentation) across inputs and targets
Proposal:
# Shared spatial transformations for inputs and targets
augmenter = ImagePipeline([
RandomRotation(0.5),
RandomFlip(vertical=True)
])
input_pipeline = ImagePipeline([
augmenter,
RandomBrightness(0.2),
RandomContrast(0.2),
RandomSaturation(0.2),
])
target_pipeline = ImagePipeline([
augmenter,
OneHot(num_classes)
])
input_ds = input_pipeline.from_directory(
input_dir, targets=None, image_size=(150, 150), batch_size=32,
seed=123) # This seed supercedes the per-layer seed in all transformations
target_ds = target_pipeline.from_directory(
target_dir, # target_dir should have same structure as input_dir.
targets=None, image_size=(150, 150), batch_size=32, seed=123)
ds = tf.data.Dataset.zip((input_ds, target_ds))
model.fit(ds)
Note that the behavior of having the seed
argument in from_directory
supercedes the per-layer argument is achieved by using the seed
to sample new random ints (scalar tensors from tf.random.experimental.Generator
) to serve as the call
argument to each underlying layer.
- Deprecate existing
TimeSeriesGenerator
class - Introduce functional replacement
timeseries_dataset
:
def timeseries_dataset(
data, targets, length,
sampling_rate=1,
stride=1,
start_index=0,
end_index=None,
shuffle=False,
reverse=False,
batch_size=128):
"""Utility function for generating batches of temporal data.
This function takes in a sequence of data-points gathered at
equal intervals, along with time series parameters such as
stride, length of history, etc., to produce batches for
training/validation.
# Arguments
data: Indexable generator (such as list or Numpy array)
containing consecutive data points (timesteps).
The data should be at 2D, and axis 0 is expected
to be the time dimension.
targets: Targets corresponding to timesteps in `data`.
It should have same length as `data`.
length: Length of the output sequences (in number of timesteps).
sampling_rate: Period between successive individual timesteps
within sequences. For rate `r`, timesteps
`data[i]`, `data[i-r]`, ... `data[i - length]`
are used for create a sample sequence.
stride: Period between successive output sequences.
For stride `s`, consecutive output samples would
be centered around `data[i]`, `data[i+s]`, `data[i+2*s]`, etc.
start_index: Data points earlier than `start_index` will not be used
in the output sequences. This is useful to reserve part of the
data for test or validation.
end_index: Data points later than `end_index` will not be used
in the output sequences. This is useful to reserve part of the
data for test or validation.
shuffle: Whether to shuffle output samples,
or instead draw them in chronological order.
reverse: Boolean: if `true`, timesteps in each output sample will be
in reverse chronological order.
batch_size: Number of timeseries samples in each batch
(except maybe the last one).
# Returns
A Dataset instance.
"""