Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_dataset_from_directory() should return both training and validation datasets #15985

Closed
AdityaKane2001 opened this issue Jan 31, 2022 · 15 comments
Assignees
Labels
type:feature The user is asking for a new feature.

Comments

@AdityaKane2001
Copy link
Contributor

AdityaKane2001 commented Jan 31, 2022

System information.

TensorFlow version (you are using): 2.7
Are you willing to contribute it (Yes/No) : Yes

Describe the feature and the current behavior/state.

Currently, image_dataset_from_directory() needs subset and seed arguments in addition to validation_split. The result is as follows. The user needs to call the same function twice, which is slightly counterintuitive and confusing in my opinion.

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./,
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1,
    subset="training",
    seed=1024
)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./",
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1,
    subset="validation",
    seed=1024
)

Instead, I propose to do the following. This is inline (albeit vaguely) with the sklearn's famous train_test_split function.

train_ds, val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory="./",
    labels='inferred',
    label_mode='categorical',
    batch_size=32,
    image_size=(256, 256),
    validation_split=0.1
)

I believe this is more intuitive for the user.

Who will benefit from this feature?
Any and all beginners looking to use image_dataset_from_directory to load image datasets.

Contributing

  • Do you want to contribute a PR? (yes/no): Yes
  • My candidate solution:
  1. Add a function get_training_and_validation_split in dataset_utils.py
  2. Change image_dataset_from_directory in image_dataset.py accordingly.

/cc @jvishnuvardhan @qlzh727 @fchollet

@AdityaKane2001 AdityaKane2001 added the type:feature The user is asking for a new feature. label Jan 31, 2022
@AdityaKane2001 AdityaKane2001 changed the title image_dataset_from_directory() should return both training and validation dataset image_dataset_from_directory() should return both training and validation datasets Jan 31, 2022
@jvishnuvardhan jvishnuvardhan added the keras-team-review-pending Pending review by a Keras team member. label Feb 1, 2022
@jvishnuvardhan jvishnuvardhan self-assigned this Feb 1, 2022
@fchollet
Copy link
Member

fchollet commented Feb 2, 2022

Thanks for the suggestion, this is a good idea! Unfortunately it is non-backwards compatible (when a seed is set), we would need to modify the proposal to ensure backwards compatibility.

Add a function get_training_and_validation_split

What API would it have? How would it work?

@AdityaKane2001
Copy link
Contributor Author

AdityaKane2001 commented Feb 3, 2022

@fchollet

Thanks for the reply! Please let me know your thoughts on the following.

we would need to modify the proposal to ensure backwards compatibility.

We can keep image_dataset_from_directory as it is to ensure backwards compatibility. Declare a new function to cater this requirement (its name could be decided later, coming up with a good name might be tricky).

What API would it have? How would it work?

Please take a look at the following existing code:

def get_training_or_validation_split(samples, labels, validation_split, subset):
"""Potentially restict samples & labels to a training or validation split.
Args:
samples: List of elements.
labels: List of corresponding labels.
validation_split: Float, fraction of data to reserve for validation.
subset: Subset of the data to return.
Either "training", "validation", or None. If None, we return all of the
data.
Returns:
tuple (samples, labels), potentially restricted to the specified subset.
"""

I propose to add a function get_training_and_validation_split which will return both splits.

Alternatively, we could have a function which returns all (train, val, test) splits (perhaps get_dataset_splits()? ). I'm just thinking out loud here, so please let me know if this is not viable.

In any case, the implementation can be as follows:

# keras/preprocessing/dataset_utils.py
def get_dataset_splits(samples, labels, splits):
	"""
	Divides given samples into train, validation and test sets.

	Args:
		samples: List of elements.
    	        labels: List of corresponding labels.
    	        splits: tuple of floats containing two or three elements

        Returns:
    	         Train, validation and test splits
	"""
	# Note: This function can be modified to return only train and val split, as proposed with `get_training_and_validation_split`

	if len(splits) == 2:
		train_split, val_split = splits
		test_split = 0.0
	elif len(splits) == 3:
		train_split, val_split, test_split = splits
	else:
		raise ValueError(f"`splits` must have exactly two or three elements corresponding to (train, val) or (train, val, test) splits respectively. Got {splits}.")

	assert train_split + val_split + test_split == 1.0, f"Train, val and test splits must add up to 1. Got {train_split, val_split, test_split} respectively."

	num_train_samples = int(train_split * len(samples))
	num_val_samples = int(val_split * len(samples))
	num_test_samples = int(test_split * len(samples))

	train_samples, train_labels = samples[:num_train_samples], labels[:num_train_samples]
	val_samples, val_labels = samples[:num_val_samples], labels[:num_val_samples]
	test_samples, test_labels = samples[:num_test_samples], labels[:num_test_samples]

	return (train_samples, train_labels), (val_samples, val_labels), (test_samples, test_labels)

@haifeng-jin
Copy link
Contributor

This also applies to text_dataset_from_directory and timeseries_dataset_from_directory.

@haifeng-jin haifeng-jin removed the keras-team-review-pending Pending review by a Keras team member. label Feb 3, 2022
@fchollet
Copy link
Member

fchollet commented Feb 9, 2022

What we could do here for backwards compatibility is add a possible string value for subset: subset="both", which would return both the training and validation datasets. Does that sound acceptable?

In addition, I agree it would be useful to have a utility in keras.utils in the spirit of get_train_test_split(). Let's call it split_dataset(dataset, split=0.2) perhaps? It could take either a list, an array, an iterable of list/arrays of the same length, or a tf.data Dataset.

When it's a Dataset, we would not have an easy way to execute the split efficiently since Datasets of non-indexable. In this case I would suggest... assuming that the data fits in memory, and simply extracting the data by iterating once over the dataset, then doing the split, then repackaging the output value as two Datasets.

What do you think?

@fchollet fchollet assigned fchollet and unassigned haifeng-jin Feb 9, 2022
@haifeng-jin
Copy link
Contributor

I think it is a good solution.
How do we warn the user when the tf.data.Dataset doesn't fit into the memory and takes a long time to use after split?

@AdityaKane2001
Copy link
Contributor Author

@fchollet

Thanks a lot for the comprehensive answer. Following are my thoughts on the same. Please let me know what you think.

subset="both"

Sounds great. However, I would also like to bring up that we can also have the possibility to provide train, val and test splits of the dataset. The user can ask for (train, val) splits or (train, val, test) splits. Please share your thoughts on this.

get_train_test_split() ... It could take either a list, an array, an iterable of list/arrays of the same length, or a tf.data Dataset.

I have two things to say here.
Firstly, actually I was suggesting to have get_train_test_splits as an internal utility, to accompany the existing get_training_or_validation_split

Secondly, a public get_train_test_splits utility will be of great help. However, most people who will use this utility will depend upon Keras to make a tf.data.Dataset for them. Generally, users who create a tf.data.Dataset themselves have a fixed pipeline (and mindset) to do so. Hence, I'm not sure whether get_train_test_splits would be of much use to the latter group. Please correct me if I'm wrong.

I agree that partitioning a tf.data.Dataset would not be easy without significant side effects and performance overhead.

@fchollet
Copy link
Member

Secondly, a public get_train_test_splits utility will be of great help. However, most people who will use this utility will depend upon Keras to make a tf.data.Dataset for them. Generally, users who create a tf.data.Dataset themselves have a fixed pipeline (and mindset) to do so. Hence, I'm not sure whether get_train_test_splits would be of much use to the latter group. Please correct me if I'm wrong.

The corresponding sklearn utility seems very widely used, and this is a use case that has come up often in keras.io code examples. If we cover both numpy use cases and tf.data use cases, it should be useful to our users. In the tf.data case, due to the difficulty there is in efficiently slicing a Dataset, it will only be useful for small-data use cases, where the data fits in memory. This will still be relevant to many users.

@fchollet
Copy link
Member

Sounds great. However, I would also like to bring up that we can also have the possibility to provide train, val and test splits of the dataset. The user can ask for (train, val) splits or (train, val, test) splits. Please share your thoughts on this.

This is something we had initially considered but we ultimately rejected it. A single validation_split covers most use cases, and supporting arbitrary numbers of subsets (each with a different size) would add a lot of complexity. For such use cases, we recommend splitting the test set in advance and moving it to a separate folder.

@AdityaKane2001
Copy link
Contributor Author

@fchollet

I see. In that case, I'll go for a publicly usable get_train_test_split() supporting list, arrays, an iterable of lists/arrays and tf.data.Dataset as you said.

After that, I'll work on changing the image_dataset_from_directory aligning with that. Will this be okay?

@fchollet
Copy link
Member

Sounds great -- thank you. About the first utility: what should be the name and arguments signature?

@AdityaKane2001
Copy link
Contributor Author

@fchollet

I was thinking get_train_test_split(). How about the following:

def get_train_test_split(arr, test_split=0.2, seed=1024, shuffle=True):
    """
    Args:
        arr: One of Python List, Numpy arrays, Iterable generating Python Lists or Numpy arrays of same length, or a tf.data.Dataset which fits in memory.
        test_split: Portion of `arr` to separate as test set.
        seed: Seed for reproducibility with random ops.
        shuffle: Whether to shuffle input dataset.
    Returns:
        A tuple of the same data structure passed as input, divided into two parts according to `test_split`.
    """

To be honest, I have not yet worked out the details of this implementation, so I'll do that first before moving on. My primary concern is the speed.

@AdityaKane2001
Copy link
Contributor Author

AdityaKane2001 commented Jun 10, 2022

@fchollet

Could you please take a look at the above API design? If that's fine I'll start working on the actual implementation.

@fchollet
Copy link
Member

Hey @AdityaKane2001,

A bunch of updates happened since February.

This answers all questions in this issue, I believe.

@AdityaKane2001
Copy link
Contributor Author

@fchollet

Yes I saw those later. I'm glad that they are now a part of Keras! They were much needed utilities.

@frigeriomtt
Copy link

@fchollet Good morning, thanks for mentioning that couple of features; however, despite upgrading tensorflow to the latest version in my colab notebook, the interpreter can neither find split_dataset as part of the utils module, nor accept "both" as value for image_dataset_from_directory's subset parameter ("must be 'train' or 'validation'" error is returned). I checked tensorflow version and it was succesfully updated. Any idea for the reason behind this problem? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature The user is asking for a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants