Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset method returns Unknown split "validation" even if this dir exists #4895

Closed
SamSamhuns opened this issue Aug 25, 2022 · 18 comments · Fixed by #4985
Closed

load_dataset method returns Unknown split "validation" even if this dir exists #4895

SamSamhuns opened this issue Aug 25, 2022 · 18 comments · Fixed by #4985
Labels
bug Something isn't working

Comments

@SamSamhuns
Copy link

SamSamhuns commented Aug 25, 2022

Describe the bug

The datasets.load_dataset returns a ValueError: Unknown split "validation". Should be one of ['train', 'test']. when running load_dataset(local_data_dir_path, split="validation") even if the validation sub-directory exists in the local data path.

The data directories are as follows and attached to this issue:

test_data1
              |_ train
                  |_ 1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ 234.png
                  |_ metadata.jsonl
                  ...
test_data2
              |_ train
                  |_ train_1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ val_234.png
                  |_ metadata.jsonl
                  ...

They contain the same image files and metadata.jsonl but the images in test_data2 have the split names prepended i.e.
train_1012.png, val_234.png and the images in test_data1 do not have the split names prepended to the image names i.e. 1012.png, 234.png

I actually saw in another issue val was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?

Steps to reproduce the bug

import datasets
datasets.logging.set_verbosity_error()
from datasets import load_dataset, get_dataset_split_names


# the following only finds train, validation and test splits correctly
path = "./test_data1"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)


# the following only finds train and test splits
path = "./test_data2"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)

Expected results

###################### ['train', 'test', 'validation'] ######################
###################### ['train', 'test', 'validation'] ######################

Actual results

Traceback (most recent call last):
  File "test_data_loader.py", line 11, in <module>

    dataset = load_dataset(path, split=spt)
  File "/home/venv/lib/python3.8/site-packages/datasets/load.py", line 1758, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 893, in as_dataset
    datasets = map_nested(
  File "/home/venv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 385, in map_nested
    return function(data_struct)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 924, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 993, in _as_dataset
    dataset_kwargs = ArrowReader(self._cache_dir, self.info).read(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read
    files = self.get_file_instructions(name, instructions, split_infos)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions
    file_instructions = make_file_instructions(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions
    absolute_instructions = instruction.to_absolute(name2len)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in to_absolute
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in <listcomp>
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr
    raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "validation". Should be one of ['train', 'test'].

Environment info

  • datasets version:
  • Platform: Linux Ubuntu 18.04
  • Python version: 3.8.12
  • PyArrow version: 9.0.0

Data files

test_data1.zip
test_data2.zip

@SamSamhuns SamSamhuns added the bug Something isn't working label Aug 25, 2022
@MarkosMuche
Copy link

I don't know the main problem but it looks like, it is ignoring the last directory in your case. So, create a directory called 'zzz' in the same folder as train, validation and test. if it doesn't work, create a directory called "aaa". It worked for me.

@polinaeterna
Copy link
Contributor

@SamSamhuns could you please try to load it with the current main-branch version of datasets? I suppose the problem is that it tries to get splits names from filenames in this case, ignoring directories names, but val wasn't in keywords at that time, but it was fixed recently in this PR #4844.

@skalinin
Copy link

skalinin commented Sep 15, 2022

I have a similar problem.
When I try to create data_infos.json using datasets-cli test Peter.py --save_infos --all_configs I get an error:
ValueError: Unknown split "test". Should be one of ['train'].

The data_infos.json is created perfectly fine when I use only one split - datasets.Split.TRAIN

@polinaeterna Could you help here please?

You can find the code here: https://huggingface.co/datasets/sberbank-ai/Peter/tree/add_splits (add_splits branch)

@mariosasko
Copy link
Collaborator

mariosasko commented Sep 15, 2022

@skalinin It seems the dataset_infos.json of your dataset is missing the info on the test split (and datasets-cli doesn't ignore the cached infos at the moment, which is a known bug), so your issue is not related to this one. I think you can fix your issue by deleting all the cached dataset_infos.json (in the local repo and in ~/.cache/huggingface/modules) before running the datasets-cli test command. Let us know if that doesn't help, and I can try to generate it myself.

@mariosasko
Copy link
Collaborator

This code indeed behaves as expected on main. But suppose the val_234.png is renamed to some other value not containing one of these keywords, in that case, this issue becomes relevant again because the real cause of it is the order in which we check the predefined split patterns to assign data files to each split - first we assign data files based on filenames, and only if this fails meaning not a single split found (val is not recognized here in the older versions of datasets, which results in an empty validation split), do we assign based on directory names.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

@lhoestq
Copy link
Member

lhoestq commented Sep 15, 2022

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

yes that makes sense !

@SamSamhuns
Copy link
Author

Looks like the val/validation dir name issue is fixed with the current main-branch version of the datasets repository.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

I agree with this as well. I would expect higher precedence to the directory name over the file name. Right now if I place a single file named train_00001.jpg under the validation directory, load_dataset cannot find the validation split.

@skalinin
Copy link

Thanks for the reply

I've created a separate issue for my problem.

@polinaeterna
Copy link
Contributor

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

Sounds good to me! opened a PR: #4985

@shaneacton
Copy link

shaneacton commented Oct 6, 2022

Hi there @polinaeterna @mariosasko ! I have installed 5.2.3.dev0, which should have this fix. Unfortunately, I am still getting the error:
ValueError: Unknown split "validation". Should be one of ['train']. When I call load_dataset("csv", data_files=files, split=split)

Any help would be greatly appreciated!

@polinaeterna
Copy link
Contributor

hi @shaneacton ! could you please show your dataset structure?

@shaneacton
Copy link

shaneacton commented Oct 6, 2022

Hi there @polinaeterna . My local CSV files are stored as follows:
binding:
---------- tune.csv
---------- public_data:
--------------------------- train.csv

self.list_shards(split) sucessfully finds the relevant data files

@polinaeterna
Copy link
Contributor

@shaneacton do you have validation.csv/val.csv/valid.csv/dev.csv file in your data folder? I can't find it in the structure you provided

@shaneacton
Copy link

shaneacton commented Oct 6, 2022

@polinaeterna no, does the name of the split need to match the name of the file exactly?

But my train file is not actually named 'train.py' its called 'XXXXXXXXX_train_XXXXXXXX.csv'
And the code works fine for train, but fails for validation.
Does the file name need to contain the split name?

@polinaeterna
Copy link
Contributor

@shaneacton what files do you expect to be included in "validation" split? yes, you should somehow indicate that a file belongs to a certain split - either by including split name in a filename or by putting it into a folder with split name, you can also check out this documentation page :)
by default all the data goes to a single train split

@shaneacton
Copy link

@polinaeterna I have specified my train/test/tune files via the split_to_filepattern argument when initialising my FileDataSource class. This is how list_shards is able to find the right files.
After your last message, I have tried renaminig my data files to simply train.csv and validation.csv, however I am still getting the same error: Unknown split "validation". Should be one of ['train']

@shaneacton
Copy link

@polinaeterna I have solved the issue. The solution was to call:
load_dataset("csv", data_files={split: files}, split=split)

@mehran66
Copy link

mehran66 commented Mar 26, 2024

For me it resolved by adding the verification_mode param:

  imdb_ds = load_dataset(
      "imdb", verification_mode="no_checks"
  )
  imdb_ds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants