-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image_dataset_from_directory()
should return both training and validation datasets
#15985
Comments
image_dataset_from_directory()
should return both training and validation datasetimage_dataset_from_directory()
should return both training and validation datasets
Thanks for the suggestion, this is a good idea! Unfortunately it is non-backwards compatible (when a seed is set), we would need to modify the proposal to ensure backwards compatibility.
What API would it have? How would it work? |
Thanks for the reply! Please let me know your thoughts on the following.
We can keep
Please take a look at the following existing code: keras/keras/preprocessing/dataset_utils.py Lines 164 to 177 in b4dca51
I propose to add a function get_training_and_validation_split which will return both splits. Alternatively, we could have a function which returns all (train, val, test) splits (perhaps In any case, the implementation can be as follows: # keras/preprocessing/dataset_utils.py
def get_dataset_splits(samples, labels, splits):
"""
Divides given samples into train, validation and test sets.
Args:
samples: List of elements.
labels: List of corresponding labels.
splits: tuple of floats containing two or three elements
Returns:
Train, validation and test splits
"""
# Note: This function can be modified to return only train and val split, as proposed with `get_training_and_validation_split`
if len(splits) == 2:
train_split, val_split = splits
test_split = 0.0
elif len(splits) == 3:
train_split, val_split, test_split = splits
else:
raise ValueError(f"`splits` must have exactly two or three elements corresponding to (train, val) or (train, val, test) splits respectively. Got {splits}.")
assert train_split + val_split + test_split == 1.0, f"Train, val and test splits must add up to 1. Got {train_split, val_split, test_split} respectively."
num_train_samples = int(train_split * len(samples))
num_val_samples = int(val_split * len(samples))
num_test_samples = int(test_split * len(samples))
train_samples, train_labels = samples[:num_train_samples], labels[:num_train_samples]
val_samples, val_labels = samples[:num_val_samples], labels[:num_val_samples]
test_samples, test_labels = samples[:num_test_samples], labels[:num_test_samples]
return (train_samples, train_labels), (val_samples, val_labels), (test_samples, test_labels) |
This also applies to |
What we could do here for backwards compatibility is add a possible string value for In addition, I agree it would be useful to have a utility in When it's a Dataset, we would not have an easy way to execute the split efficiently since Datasets of non-indexable. In this case I would suggest... assuming that the data fits in memory, and simply extracting the data by iterating once over the dataset, then doing the split, then repackaging the output value as two Datasets. What do you think? |
I think it is a good solution. |
Thanks a lot for the comprehensive answer. Following are my thoughts on the same. Please let me know what you think.
Sounds great. However, I would also like to bring up that we can also have the possibility to provide train, val and test splits of the dataset. The user can ask for
I have two things to say here. Secondly, a public I agree that partitioning a |
The corresponding sklearn utility seems very widely used, and this is a use case that has come up often in keras.io code examples. If we cover both numpy use cases and tf.data use cases, it should be useful to our users. In the tf.data case, due to the difficulty there is in efficiently slicing a Dataset, it will only be useful for small-data use cases, where the data fits in memory. This will still be relevant to many users. |
This is something we had initially considered but we ultimately rejected it. A single validation_split covers most use cases, and supporting arbitrary numbers of subsets (each with a different size) would add a lot of complexity. For such use cases, we recommend splitting the test set in advance and moving it to a separate folder. |
I see. In that case, I'll go for a publicly usable After that, I'll work on changing the |
Sounds great -- thank you. About the first utility: what should be the name and arguments signature? |
I was thinking
To be honest, I have not yet worked out the details of this implementation, so I'll do that first before moving on. My primary concern is the speed. |
Could you please take a look at the above API design? If that's fine I'll start working on the actual implementation. |
Hey @AdityaKane2001, A bunch of updates happened since February.
This answers all questions in this issue, I believe. |
Yes I saw those later. I'm glad that they are now a part of Keras! They were much needed utilities. |
@fchollet Good morning, thanks for mentioning that couple of features; however, despite upgrading tensorflow to the latest version in my colab notebook, the interpreter can neither find split_dataset as part of the utils module, nor accept "both" as value for image_dataset_from_directory's subset parameter ("must be 'train' or 'validation'" error is returned). I checked tensorflow version and it was succesfully updated. Any idea for the reason behind this problem? Thank you |
System information.
TensorFlow version (you are using): 2.7
Are you willing to contribute it (Yes/No) : Yes
Describe the feature and the current behavior/state.
Currently,
image_dataset_from_directory()
needssubset
andseed
arguments in addition tovalidation_split
. The result is as follows. The user needs to call the same function twice, which is slightly counterintuitive and confusing in my opinion.Instead, I propose to do the following. This is inline (albeit vaguely) with the sklearn's famous
train_test_split
function.I believe this is more intuitive for the user.
Who will benefit from this feature?
Any and all beginners looking to use
image_dataset_from_directory
to load image datasets.Contributing
get_training_and_validation_split
in dataset_utils.pyimage_dataset_from_directory
in image_dataset.py accordingly./cc @jvishnuvardhan @qlzh727 @fchollet
The text was updated successfully, but these errors were encountered: