load_dataset method returns Unknown split "validation" even if this dir exists #4895

SamSamhuns · 2022-08-25T12:11:00Z

Describe the bug

The datasets.load_dataset returns a ValueError: Unknown split "validation". Should be one of ['train', 'test']. when running load_dataset(local_data_dir_path, split="validation") even if the validation sub-directory exists in the local data path.

The data directories are as follows and attached to this issue:

test_data1
              |_ train
                  |_ 1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ 234.png
                  |_ metadata.jsonl
                  ...
test_data2
              |_ train
                  |_ train_1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ val_234.png
                  |_ metadata.jsonl
                  ...

They contain the same image files and metadata.jsonl but the images in test_data2 have the split names prepended i.e.
train_1012.png, val_234.png and the images in test_data1 do not have the split names prepended to the image names i.e. 1012.png, 234.png

I actually saw in another issue val was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?

Steps to reproduce the bug

import datasets
datasets.logging.set_verbosity_error()
from datasets import load_dataset, get_dataset_split_names


# the following only finds train, validation and test splits correctly
path = "./test_data1"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)


# the following only finds train and test splits
path = "./test_data2"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)

Expected results

###################### ['train', 'test', 'validation'] ######################
###################### ['train', 'test', 'validation'] ######################

Actual results

Traceback (most recent call last):
  File "test_data_loader.py", line 11, in <module>

    dataset = load_dataset(path, split=spt)
  File "/home/venv/lib/python3.8/site-packages/datasets/load.py", line 1758, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 893, in as_dataset
    datasets = map_nested(
  File "/home/venv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 385, in map_nested
    return function(data_struct)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 924, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 993, in _as_dataset
    dataset_kwargs = ArrowReader(self._cache_dir, self.info).read(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read
    files = self.get_file_instructions(name, instructions, split_infos)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions
    file_instructions = make_file_instructions(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions
    absolute_instructions = instruction.to_absolute(name2len)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in to_absolute
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in <listcomp>
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr
    raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "validation". Should be one of ['train', 'test'].

Environment info

datasets version:
Platform: Linux Ubuntu 18.04
Python version: 3.8.12
PyArrow version: 9.0.0

Data files

test_data1.zip
test_data2.zip

The text was updated successfully, but these errors were encountered:

MarkosMuche · 2022-09-15T07:37:21Z

I don't know the main problem but it looks like, it is ignoring the last directory in your case. So, create a directory called 'zzz' in the same folder as train, validation and test. if it doesn't work, create a directory called "aaa". It worked for me.

polinaeterna · 2022-09-15T09:51:27Z

@SamSamhuns could you please try to load it with the current main-branch version of datasets? I suppose the problem is that it tries to get splits names from filenames in this case, ignoring directories names, but val wasn't in keywords at that time, but it was fixed recently in this PR #4844.

skalinin · 2022-09-15T11:30:05Z

I have a similar problem.
When I try to create data_infos.json using datasets-cli test Peter.py --save_infos --all_configs I get an error:
ValueError: Unknown split "test". Should be one of ['train'].

The data_infos.json is created perfectly fine when I use only one split - datasets.Split.TRAIN

@polinaeterna Could you help here please?

You can find the code here: https://huggingface.co/datasets/sberbank-ai/Peter/tree/add_splits (add_splits branch)

mariosasko · 2022-09-15T15:00:59Z

@skalinin It seems the dataset_infos.json of your dataset is missing the info on the test split (and datasets-cli doesn't ignore the cached infos at the moment, which is a known bug), so your issue is not related to this one. I think you can fix your issue by deleting all the cached dataset_infos.json (in the local repo and in ~/.cache/huggingface/modules) before running the datasets-cli test command. Let us know if that doesn't help, and I can try to generate it myself.

mariosasko · 2022-09-15T15:33:11Z

This code indeed behaves as expected on main. But suppose the val_234.png is renamed to some other value not containing one of these keywords, in that case, this issue becomes relevant again because the real cause of it is the order in which we check the predefined split patterns to assign data files to each split - first we assign data files based on filenames, and only if this fails meaning not a single split found (val is not recognized here in the older versions of datasets, which results in an empty validation split), do we assign based on directory names.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

lhoestq · 2022-09-15T15:39:21Z

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

yes that makes sense !

SamSamhuns · 2022-09-16T06:25:52Z

Looks like the val/validation dir name issue is fixed with the current main-branch version of the datasets repository.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

I agree with this as well. I would expect higher precedence to the directory name over the file name. Right now if I place a single file named train_00001.jpg under the validation directory, load_dataset cannot find the validation split.

skalinin · 2022-09-16T08:22:03Z

Thanks for the reply

I've created a separate issue for my problem.

polinaeterna · 2022-09-16T11:21:10Z

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?

Sounds good to me! opened a PR: #4985

shaneacton · 2022-10-06T16:55:40Z

Hi there @polinaeterna @mariosasko ! I have installed 5.2.3.dev0, which should have this fix. Unfortunately, I am still getting the error:
ValueError: Unknown split "validation". Should be one of ['train']. When I call load_dataset("csv", data_files=files, split=split)

Any help would be greatly appreciated!

polinaeterna · 2022-10-06T16:58:54Z

hi @shaneacton ! could you please show your dataset structure?

shaneacton · 2022-10-06T17:02:51Z

Hi there @polinaeterna . My local CSV files are stored as follows:
binding:
---------- tune.csv
---------- public_data:
--------------------------- train.csv

self.list_shards(split) sucessfully finds the relevant data files

polinaeterna · 2022-10-06T17:09:18Z

@shaneacton do you have validation.csv/val.csv/valid.csv/dev.csv file in your data folder? I can't find it in the structure you provided

shaneacton · 2022-10-06T17:10:18Z

@polinaeterna no, does the name of the split need to match the name of the file exactly?

But my train file is not actually named 'train.py' its called 'XXXXXXXXX_train_XXXXXXXX.csv'
And the code works fine for train, but fails for validation.
Does the file name need to contain the split name?

polinaeterna · 2022-10-06T17:25:02Z

@shaneacton what files do you expect to be included in "validation" split? yes, you should somehow indicate that a file belongs to a certain split - either by including split name in a filename or by putting it into a folder with split name, you can also check out this documentation page :)
by default all the data goes to a single train split

shaneacton · 2022-10-06T17:36:12Z

@polinaeterna I have specified my train/test/tune files via the split_to_filepattern argument when initialising my FileDataSource class. This is how list_shards is able to find the right files.
After your last message, I have tried renaminig my data files to simply train.csv and validation.csv, however I am still getting the same error: Unknown split "validation". Should be one of ['train']

shaneacton · 2022-10-06T17:49:28Z

@polinaeterna I have solved the issue. The solution was to call:
load_dataset("csv", data_files={split: files}, split=split)

mehran66 · 2024-03-26T16:47:34Z

For me it resolved by adding the verification_mode param:

  imdb_ds = load_dataset(
      "imdb", verification_mode="no_checks"
  )
  imdb_ds

SamSamhuns added the bug Something isn't working label Aug 25, 2022

skalinin mentioned this issue Sep 16, 2022

Create dataset_infos.json with VALIDATION and TEST splits #4982

Closed

polinaeterna mentioned this issue Sep 16, 2022

Prefer split patterns from directories over split patterns from filenames #4985

Merged

SamSamhuns mentioned this issue Sep 20, 2022

ValueError: Unknown split "validation". Should be one of ['train']. clovaai/donut#57

Closed

mariosasko linked a pull request Sep 23, 2022 that will close this issue

Prefer split patterns from directories over split patterns from filenames #4985

Merged

mariosasko mentioned this issue Sep 27, 2022

Split is inferred from filename and overrides metadata.jsonl #5021

Closed

polinaeterna closed this as completed in #4985 Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset method returns Unknown split "validation" even if this dir exists #4895

load_dataset method returns Unknown split "validation" even if this dir exists #4895

SamSamhuns commented Aug 25, 2022 •

edited

MarkosMuche commented Sep 15, 2022

polinaeterna commented Sep 15, 2022

skalinin commented Sep 15, 2022 •

edited

mariosasko commented Sep 15, 2022 •

edited

mariosasko commented Sep 15, 2022

lhoestq commented Sep 15, 2022

SamSamhuns commented Sep 16, 2022

skalinin commented Sep 16, 2022

polinaeterna commented Sep 16, 2022

shaneacton commented Oct 6, 2022 •

edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022 •

edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022 •

edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022

shaneacton commented Oct 6, 2022

mehran66 commented Mar 26, 2024 •

edited

load_dataset method returns Unknown split "validation" even if this dir exists #4895

load_dataset method returns Unknown split "validation" even if this dir exists #4895

Comments

SamSamhuns commented Aug 25, 2022 • edited

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

MarkosMuche commented Sep 15, 2022

polinaeterna commented Sep 15, 2022

skalinin commented Sep 15, 2022 • edited

mariosasko commented Sep 15, 2022 • edited

mariosasko commented Sep 15, 2022

lhoestq commented Sep 15, 2022

SamSamhuns commented Sep 16, 2022

skalinin commented Sep 16, 2022

polinaeterna commented Sep 16, 2022

shaneacton commented Oct 6, 2022 • edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022 • edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022 • edited

polinaeterna commented Oct 6, 2022

shaneacton commented Oct 6, 2022

shaneacton commented Oct 6, 2022

mehran66 commented Mar 26, 2024 • edited

SamSamhuns commented Aug 25, 2022 •

edited

skalinin commented Sep 15, 2022 •

edited

mariosasko commented Sep 15, 2022 •

edited

shaneacton commented Oct 6, 2022 •

edited

shaneacton commented Oct 6, 2022 •

edited

shaneacton commented Oct 6, 2022 •

edited

mehran66 commented Mar 26, 2024 •

edited