Skip to content

IterableDataset raises FileNotFoundError instead of retrying #7440

@bauwenst

Description

@bauwenst

Describe the bug

In #6843 it was noted that the streaming feature of datasets is highly susceptible to outages and doesn't back off for long (or even at all).

I was training a model while streaming SlimPajama and training crashed with a FileNotFoundError. I can only assume that this was due to a momentary outage considering the file in question, train/chunk9/example_train_3889.jsonl.zst, exists like all other files in SlimPajama.

...
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 2226, in __iter__
    for key, example in ex_iterable:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1499, in __iter__
    for x in self.ex_iterable:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1067, in __iter__
    yield from self._iter()
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1231, in _iter
    for key, transformed_example in iter_outputs():
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1207, in iter_outputs
    for i, key_example in inputs_iterator:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1111, in iter_inputs
    for key, example in iterator:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 371, in __iter__
    for key, pa_table in self.generate_tables_fn(**gen_kwags):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
    for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/track.py", line 50, in __iter__
    for x in self.generator(*self.args):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 1378, in _iter_from_urlpaths
    raise FileNotFoundError(urlpath)
FileNotFoundError: zstd://example_train_3889.jsonl::hf://datasets/cerebras/SlimPajama-627B@2d0accdd58c5d5511943ca1f5ff0e3eb5e293543/train/chunk9/example_train_3889.jsonl.zst

That final raise is at the bottom of the following snippet:

class FilesIterable(TrackedIterableFromGenerator):
"""An iterable of paths from a list of directories or files"""
@classmethod
def _iter_from_urlpaths(
cls, urlpaths: Union[str, list[str]], download_config: Optional[DownloadConfig] = None
) -> Generator[str, None, None]:
if not isinstance(urlpaths, list):
urlpaths = [urlpaths]
for urlpath in urlpaths:
if xisfile(urlpath, download_config=download_config):
yield urlpath
elif xisdir(urlpath, download_config=download_config):
for dirpath, dirnames, filenames in xwalk(urlpath, download_config=download_config):
# in-place modification to prune the search
dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
if xbasename(dirpath).startswith((".", "__")):
# skipping hidden directories
continue
for filename in sorted(filenames):
if filename.startswith((".", "__")):
# skipping hidden files
continue
yield xjoin(dirpath, filename)
else:
raise FileNotFoundError(urlpath)

So clearly, something choked up in xisfile.

Steps to reproduce the bug

This happens when streaming a dataset and iterating over it. In my case, that iteration is done in Trainer's inner_training_loop, but this is not relevant to the iterator.

  File "/miniconda3/envs/draft/lib/python3.11/site-packages/accelerate/data_loader.py", line 835, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)

Expected behavior

This bug and the linked issue have one thing in common: when streaming fails to retrieve an example, the entire program gives up and crashes. As users, we cannot even protect ourselves from this: when we are iterating over a dataset, we can't make datasets skip over a bad example or wait a little longer to retry the iteration, because when a Python generator/iterator raises an error, it loses all its context.

In other words: if you have something that looks like for b in a: for c in b: for d in c:, errors in the innermost loop can only be caught by a try ... except in c.__iter__(). There should be such exception handling in datasets and it should have a configurable exponential back-off: first wait and retry after 1 minute, then 2 minutes, then 4 minutes, then 8 minutes, ... and after a given amount of retries, skip the bad example, and only after skipping a given amount of examples, give up and crash. This was requested in #6843 too, since currently there is only linear backoff and it is clearly not applied to xisfile.

Environment info

  • datasets version: 3.3.2 (the latest version)
  • Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
  • Python version: 3.11.7
  • huggingface_hub version: 0.26.5
  • PyArrow version: 15.0.0
  • Pandas version: 2.2.0
  • fsspec version: 2024.10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions