-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset method returns Unknown split "validation" even if this dir exists #4895
Comments
I don't know the main problem but it looks like, it is ignoring the last directory in your case. So, create a directory called 'zzz' in the same folder as train, validation and test. if it doesn't work, create a directory called "aaa". It worked for me. |
@SamSamhuns could you please try to load it with the current main-branch version of |
I have a similar problem. The @polinaeterna Could you help here please? You can find the code here: https://huggingface.co/datasets/sberbank-ai/Peter/tree/add_splits (add_splits branch) |
@skalinin It seems the |
This code indeed behaves as expected on @polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if |
yes that makes sense ! |
Looks like the
I agree with this as well. I would expect higher precedence to the directory name over the file name. Right now if I place a single file named |
Thanks for the reply I've created a separate issue for my problem. |
Sounds good to me! opened a PR: #4985 |
Hi there @polinaeterna @mariosasko ! I have installed 5.2.3.dev0, which should have this fix. Unfortunately, I am still getting the error: Any help would be greatly appreciated! |
hi @shaneacton ! could you please show your dataset structure? |
Hi there @polinaeterna . My local CSV files are stored as follows:
|
@shaneacton do you have |
@polinaeterna no, does the name of the split need to match the name of the file exactly? But my train file is not actually named 'train.py' its called 'XXXXXXXXX_train_XXXXXXXX.csv' |
@shaneacton what files do you expect to be included in "validation" split? yes, you should somehow indicate that a file belongs to a certain split - either by including split name in a filename or by putting it into a folder with split name, you can also check out this documentation page :) |
@polinaeterna I have specified my train/test/tune files via the |
@polinaeterna I have solved the issue. The solution was to call: |
For me it resolved by adding the verification_mode param:
|
Describe the bug
The
datasets.load_dataset
returns aValueError: Unknown split "validation". Should be one of ['train', 'test'].
when runningload_dataset(local_data_dir_path, split="validation")
even if thevalidation
sub-directory exists in the local data path.The data directories are as follows and attached to this issue:
They contain the same image files and
metadata.jsonl
but the images intest_data2
have the split names prepended i.e.train_1012.png, val_234.png
and the images intest_data1
do not have the split names prepended to the image names i.e.1012.png, 234.png
I actually saw in another issue
val
was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?Steps to reproduce the bug
Expected results
Actual results
Environment info
datasets
version:Data files
test_data1.zip
test_data2.zip
The text was updated successfully, but these errors were encountered: