`image_dataset_from_directory()` should return both training and validation datasets #15985

AdityaKane2001 · 2022-01-31T09:14:48Z

System information.

TensorFlow version (you are using): 2.7
Are you willing to contribute it (Yes/No) : Yes

Describe the feature and the current behavior/state.

Currently, image_dataset_from_directory() needs subset and seed arguments in addition to validation_split. The result is as follows. The user needs to call the same function twice, which is slightly counterintuitive and confusing in my opinion.

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./,
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1,
    subset="training",
    seed=1024
)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./",
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1,
    subset="validation",
    seed=1024
)

Instead, I propose to do the following. This is inline (albeit vaguely) with the sklearn's famous train_test_split function.

train_ds, val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./",
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1
)

I believe this is more intuitive for the user.

Who will benefit from this feature?
Any and all beginners looking to use image_dataset_from_directory to load image datasets.

Contributing

Do you want to contribute a PR? (yes/no): Yes
My candidate solution:

Add a function get_training_and_validation_split in dataset_utils.py
Change image_dataset_from_directory in image_dataset.py accordingly.

/cc @jvishnuvardhan @qlzh727 @fchollet

The text was updated successfully, but these errors were encountered:

fchollet · 2022-02-02T18:46:15Z

Thanks for the suggestion, this is a good idea! Unfortunately it is non-backwards compatible (when a seed is set), we would need to modify the proposal to ensure backwards compatibility.

Add a function get_training_and_validation_split

What API would it have? How would it work?

AdityaKane2001 · 2022-02-03T04:45:04Z

@fchollet

Thanks for the reply! Please let me know your thoughts on the following.

we would need to modify the proposal to ensure backwards compatibility.

We can keep image_dataset_from_directory as it is to ensure backwards compatibility. Declare a new function to cater this requirement (its name could be decided later, coming up with a good name might be tricky).

What API would it have? How would it work?

Please take a look at the following existing code:

keras/keras/preprocessing/dataset_utils.py

Lines 164 to 177 in b4dca51

    
           def get_training_or_validation_split(samples, labels, validation_split, subset): 
        
             """Potentially restict samples & labels to a training or validation split. 
        
             Args: 
        
               samples: List of elements. 
        
               labels: List of corresponding labels. 
        
               validation_split: Float, fraction of data to reserve for validation. 
        
               subset: Subset of the data to return. 
        
                 Either "training", "validation", or None. If None, we return all of the 
        
                 data. 
        
             Returns: 
        
               tuple (samples, labels), potentially restricted to the specified subset. 
        
             """

I propose to add a function get_training_and_validation_split which will return both splits.

Alternatively, we could have a function which returns all (train, val, test) splits (perhaps get_dataset_splits()? ). I'm just thinking out loud here, so please let me know if this is not viable.

In any case, the implementation can be as follows:

# keras/preprocessing/dataset_utils.py
def get_dataset_splits(samples, labels, splits):
	"""
	Divides given samples into train, validation and test sets.

	Args:
		samples: List of elements.
    	        labels: List of corresponding labels.
    	        splits: tuple of floats containing two or three elements

        Returns:
    	         Train, validation and test splits
	"""
	# Note: This function can be modified to return only train and val split, as proposed with `get_training_and_validation_split`

	if len(splits) == 2:
		train_split, val_split = splits
		test_split = 0.0
	elif len(splits) == 3:
		train_split, val_split, test_split = splits
	else:
		raise ValueError(f"`splits` must have exactly two or three elements corresponding to (train, val) or (train, val, test) splits respectively. Got {splits}.")

	assert train_split + val_split + test_split == 1.0, f"Train, val and test splits must add up to 1. Got {train_split, val_split, test_split} respectively."

	num_train_samples = int(train_split * len(samples))
	num_val_samples = int(val_split * len(samples))
	num_test_samples = int(test_split * len(samples))

	train_samples, train_labels = samples[:num_train_samples], labels[:num_train_samples]
	val_samples, val_labels = samples[:num_val_samples], labels[:num_val_samples]
	test_samples, test_labels = samples[:num_test_samples], labels[:num_test_samples]

	return (train_samples, train_labels), (val_samples, val_labels), (test_samples, test_labels)

haifeng-jin · 2022-02-03T18:22:59Z

This also applies to text_dataset_from_directory and timeseries_dataset_from_directory.

fchollet · 2022-02-09T00:01:46Z

What we could do here for backwards compatibility is add a possible string value for subset: subset="both", which would return both the training and validation datasets. Does that sound acceptable?

In addition, I agree it would be useful to have a utility in keras.utils in the spirit of get_train_test_split(). Let's call it split_dataset(dataset, split=0.2) perhaps? It could take either a list, an array, an iterable of list/arrays of the same length, or a tf.data Dataset.

When it's a Dataset, we would not have an easy way to execute the split efficiently since Datasets of non-indexable. In this case I would suggest... assuming that the data fits in memory, and simply extracting the data by iterating once over the dataset, then doing the split, then repackaging the output value as two Datasets.

What do you think?

haifeng-jin · 2022-02-09T00:25:43Z

I think it is a good solution.
How do we warn the user when the tf.data.Dataset doesn't fit into the memory and takes a long time to use after split?

AdityaKane2001 · 2022-02-09T04:54:56Z

@fchollet

Thanks a lot for the comprehensive answer. Following are my thoughts on the same. Please let me know what you think.

subset="both"

Sounds great. However, I would also like to bring up that we can also have the possibility to provide train, val and test splits of the dataset. The user can ask for (train, val) splits or (train, val, test) splits. Please share your thoughts on this.

get_train_test_split() ... It could take either a list, an array, an iterable of list/arrays of the same length, or a tf.data Dataset.

I have two things to say here.
Firstly, actually I was suggesting to have get_train_test_splits as an internal utility, to accompany the existing get_training_or_validation_split

Secondly, a public get_train_test_splits utility will be of great help. However, most people who will use this utility will depend upon Keras to make a tf.data.Dataset for them. Generally, users who create a tf.data.Dataset themselves have a fixed pipeline (and mindset) to do so. Hence, I'm not sure whether get_train_test_splits would be of much use to the latter group. Please correct me if I'm wrong.

I agree that partitioning a tf.data.Dataset would not be easy without significant side effects and performance overhead.

fchollet · 2022-02-11T17:45:39Z

Secondly, a public get_train_test_splits utility will be of great help. However, most people who will use this utility will depend upon Keras to make a tf.data.Dataset for them. Generally, users who create a tf.data.Dataset themselves have a fixed pipeline (and mindset) to do so. Hence, I'm not sure whether get_train_test_splits would be of much use to the latter group. Please correct me if I'm wrong.

The corresponding sklearn utility seems very widely used, and this is a use case that has come up often in keras.io code examples. If we cover both numpy use cases and tf.data use cases, it should be useful to our users. In the tf.data case, due to the difficulty there is in efficiently slicing a Dataset, it will only be useful for small-data use cases, where the data fits in memory. This will still be relevant to many users.

fchollet · 2022-02-11T17:47:19Z

Sounds great. However, I would also like to bring up that we can also have the possibility to provide train, val and test splits of the dataset. The user can ask for (train, val) splits or (train, val, test) splits. Please share your thoughts on this.

This is something we had initially considered but we ultimately rejected it. A single validation_split covers most use cases, and supporting arbitrary numbers of subsets (each with a different size) would add a lot of complexity. For such use cases, we recommend splitting the test set in advance and moving it to a separate folder.

AdityaKane2001 · 2022-02-11T17:55:51Z

@fchollet

I see. In that case, I'll go for a publicly usable get_train_test_split() supporting list, arrays, an iterable of lists/arrays and tf.data.Dataset as you said.

After that, I'll work on changing the image_dataset_from_directory aligning with that. Will this be okay?

fchollet · 2022-02-11T23:12:41Z

Sounds great -- thank you. About the first utility: what should be the name and arguments signature?

AdityaKane2001 · 2022-02-12T03:56:59Z

@fchollet

I was thinking get_train_test_split(). How about the following:

def get_train_test_split(arr, test_split=0.2, seed=1024, shuffle=True):
    """
    Args:
        arr: One of Python List, Numpy arrays, Iterable generating Python Lists or Numpy arrays of same length, or a tf.data.Dataset which fits in memory.
        test_split: Portion of `arr` to separate as test set.
        seed: Seed for reproducibility with random ops.
        shuffle: Whether to shuffle input dataset.
    Returns:
        A tuple of the same data structure passed as input, divided into two parts according to `test_split`.
    """

To be honest, I have not yet worked out the details of this implementation, so I'll do that first before moving on. My primary concern is the speed.

AdityaKane2001 · 2022-06-10T07:22:56Z

@fchollet

Could you please take a look at the above API design? If that's fine I'll start working on the actual implementation.

fchollet · 2022-06-24T18:32:10Z

Hey @AdityaKane2001,

A bunch of updates happened since February.

We released the split_dataset utility: https://www.tensorflow.org/api_docs/python/tf/keras/utils/split_dataset -- it does something similar to sklearn's train_test_split
We added arguments to our dataset creation utilities to make it possible to return both the training and validation datasets at the same time (subset="both"): https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory?version=nightly

This answers all questions in this issue, I believe.

AdityaKane2001 · 2022-06-24T18:34:25Z

@fchollet

Yes I saw those later. I'm glad that they are now a part of Keras! They were much needed utilities.

frigeriomtt · 2022-07-17T09:14:45Z

@fchollet Good morning, thanks for mentioning that couple of features; however, despite upgrading tensorflow to the latest version in my colab notebook, the interpreter can neither find split_dataset as part of the utils module, nor accept "both" as value for image_dataset_from_directory's subset parameter ("must be 'train' or 'validation'" error is returned). I checked tensorflow version and it was succesfully updated. Any idea for the reason behind this problem? Thank you

AdityaKane2001 added the type:feature The user is asking for a new feature. label Jan 31, 2022

AdityaKane2001 changed the title ~~image_dataset_from_directory() should return both training and validation dataset~~ image_dataset_from_directory() should return both training and validation datasets Jan 31, 2022

jvishnuvardhan added the keras-team-review-pending Pending review by a Keras team member. label Feb 1, 2022

jvishnuvardhan self-assigned this Feb 1, 2022

haifeng-jin assigned haifeng-jin and unassigned jvishnuvardhan Feb 3, 2022

haifeng-jin removed the keras-team-review-pending Pending review by a Keras team member. label Feb 3, 2022

fchollet assigned fchollet and unassigned haifeng-jin Feb 9, 2022

fchollet closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`image_dataset_from_directory()` should return both training and validation datasets #15985

`image_dataset_from_directory()` should return both training and validation datasets #15985

AdityaKane2001 commented Jan 31, 2022 •

edited

fchollet commented Feb 2, 2022

AdityaKane2001 commented Feb 3, 2022 •

edited

haifeng-jin commented Feb 3, 2022

fchollet commented Feb 9, 2022

haifeng-jin commented Feb 9, 2022

AdityaKane2001 commented Feb 9, 2022

fchollet commented Feb 11, 2022

fchollet commented Feb 11, 2022

AdityaKane2001 commented Feb 11, 2022

fchollet commented Feb 11, 2022

AdityaKane2001 commented Feb 12, 2022

AdityaKane2001 commented Jun 10, 2022 •

edited

fchollet commented Jun 24, 2022

AdityaKane2001 commented Jun 24, 2022

frigeriomtt commented Jul 17, 2022

image_dataset_from_directory() should return both training and validation datasets #15985

image_dataset_from_directory() should return both training and validation datasets #15985

Comments

AdityaKane2001 commented Jan 31, 2022 • edited

fchollet commented Feb 2, 2022

AdityaKane2001 commented Feb 3, 2022 • edited

haifeng-jin commented Feb 3, 2022

fchollet commented Feb 9, 2022

haifeng-jin commented Feb 9, 2022

AdityaKane2001 commented Feb 9, 2022

fchollet commented Feb 11, 2022

fchollet commented Feb 11, 2022

AdityaKane2001 commented Feb 11, 2022

fchollet commented Feb 11, 2022

AdityaKane2001 commented Feb 12, 2022

AdityaKane2001 commented Jun 10, 2022 • edited

fchollet commented Jun 24, 2022

AdityaKane2001 commented Jun 24, 2022

frigeriomtt commented Jul 17, 2022

`image_dataset_from_directory()` should return both training and validation datasets #15985

`image_dataset_from_directory()` should return both training and validation datasets #15985

AdityaKane2001 commented Jan 31, 2022 •

edited

AdityaKane2001 commented Feb 3, 2022 •

edited

AdityaKane2001 commented Jun 10, 2022 •

edited