-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Describe the bug
In #6843 it was noted that the streaming feature of datasets is highly susceptible to outages and doesn't back off for long (or even at all).
I was training a model while streaming SlimPajama and training crashed with a FileNotFoundError. I can only assume that this was due to a momentary outage considering the file in question, train/chunk9/example_train_3889.jsonl.zst, exists like all other files in SlimPajama.
...
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 2226, in __iter__
for key, example in ex_iterable:
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1499, in __iter__
for x in self.ex_iterable:
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1067, in __iter__
yield from self._iter()
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1231, in _iter
for key, transformed_example in iter_outputs():
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1207, in iter_outputs
for i, key_example in inputs_iterator:
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1111, in iter_inputs
for key, example in iterator:
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 371, in __iter__
for key, pa_table in self.generate_tables_fn(**gen_kwags):
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/track.py", line 50, in __iter__
for x in self.generator(*self.args):
File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 1378, in _iter_from_urlpaths
raise FileNotFoundError(urlpath)
FileNotFoundError: zstd://example_train_3889.jsonl::hf://datasets/cerebras/SlimPajama-627B@2d0accdd58c5d5511943ca1f5ff0e3eb5e293543/train/chunk9/example_train_3889.jsonl.zstThat final raise is at the bottom of the following snippet:
datasets/src/datasets/utils/file_utils.py
Lines 1354 to 1379 in f693f4e
| class FilesIterable(TrackedIterableFromGenerator): | |
| """An iterable of paths from a list of directories or files""" | |
| @classmethod | |
| def _iter_from_urlpaths( | |
| cls, urlpaths: Union[str, list[str]], download_config: Optional[DownloadConfig] = None | |
| ) -> Generator[str, None, None]: | |
| if not isinstance(urlpaths, list): | |
| urlpaths = [urlpaths] | |
| for urlpath in urlpaths: | |
| if xisfile(urlpath, download_config=download_config): | |
| yield urlpath | |
| elif xisdir(urlpath, download_config=download_config): | |
| for dirpath, dirnames, filenames in xwalk(urlpath, download_config=download_config): | |
| # in-place modification to prune the search | |
| dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))]) | |
| if xbasename(dirpath).startswith((".", "__")): | |
| # skipping hidden directories | |
| continue | |
| for filename in sorted(filenames): | |
| if filename.startswith((".", "__")): | |
| # skipping hidden files | |
| continue | |
| yield xjoin(dirpath, filename) | |
| else: | |
| raise FileNotFoundError(urlpath) |
So clearly, something choked up in xisfile.
Steps to reproduce the bug
This happens when streaming a dataset and iterating over it. In my case, that iteration is done in Trainer's inner_training_loop, but this is not relevant to the iterator.
File "/miniconda3/envs/draft/lib/python3.11/site-packages/accelerate/data_loader.py", line 835, in __iter__
next_batch, next_batch_info = self._fetch_batches(main_iterator)Expected behavior
This bug and the linked issue have one thing in common: when streaming fails to retrieve an example, the entire program gives up and crashes. As users, we cannot even protect ourselves from this: when we are iterating over a dataset, we can't make datasets skip over a bad example or wait a little longer to retry the iteration, because when a Python generator/iterator raises an error, it loses all its context.
In other words: if you have something that looks like for b in a: for c in b: for d in c:, errors in the innermost loop can only be caught by a try ... except in c.__iter__(). There should be such exception handling in datasets and it should have a configurable exponential back-off: first wait and retry after 1 minute, then 2 minutes, then 4 minutes, then 8 minutes, ... and after a given amount of retries, skip the bad example, and only after skipping a given amount of examples, give up and crash. This was requested in #6843 too, since currently there is only linear backoff and it is clearly not applied to xisfile.
Environment info
datasetsversion: 3.3.2 (the latest version)- Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.7
huggingface_hubversion: 0.26.5- PyArrow version: 15.0.0
- Pandas version: 2.2.0
fsspecversion: 2024.10.0