IterableDataset raises FileNotFoundError instead of retrying

### Describe the bug

In https://github.com/huggingface/datasets/issues/6843 it was noted that the streaming feature of `datasets` is highly susceptible to outages and doesn't back off for long (or even *at all*).

I was training a model while streaming SlimPajama and training crashed with a `FileNotFoundError`. I can only assume that this was due to a momentary outage considering the file in question, `train/chunk9/example_train_3889.jsonl.zst`, [exists like all other files in SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B/blob/main/train/chunk9/example_train_3889.jsonl.zst).
```python
...
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 2226, in __iter__
    for key, example in ex_iterable:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1499, in __iter__
    for x in self.ex_iterable:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1067, in __iter__
    yield from self._iter()
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1231, in _iter
    for key, transformed_example in iter_outputs():
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1207, in iter_outputs
    for i, key_example in inputs_iterator:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1111, in iter_inputs
    for key, example in iterator:
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 371, in __iter__
    for key, pa_table in self.generate_tables_fn(**gen_kwags):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 99, in _generate_tables
    for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/track.py", line 50, in __iter__
    for x in self.generator(*self.args):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 1378, in _iter_from_urlpaths
    raise FileNotFoundError(urlpath)
FileNotFoundError: zstd://example_train_3889.jsonl::hf://datasets/cerebras/SlimPajama-627B@2d0accdd58c5d5511943ca1f5ff0e3eb5e293543/train/chunk9/example_train_3889.jsonl.zst
```

That final `raise` is at the bottom of the following snippet:
https://github.com/huggingface/datasets/blob/f693f4e93aabafa878470c80fd42ddb10ec550d6/src/datasets/utils/file_utils.py#L1354-L1379

So clearly, something choked up in `xisfile`.

### Steps to reproduce the bug

This happens when streaming a dataset and iterating over it. In my case, that iteration is done in Trainer's `inner_training_loop`, but this is not relevant to the iterator.
```python
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/accelerate/data_loader.py", line 835, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)
```


### Expected behavior


This bug and the linked issue have one thing in common: *when streaming fails to retrieve an example, the entire program gives up and crashes*. As users, we cannot even protect ourselves from this: when we are iterating over a dataset, we can't make `datasets` skip over a bad example or wait a little longer to retry the iteration, because when a Python generator/iterator raises an error, it loses all its context. 

In other words: if you have something that looks like `for b in a: for c in b: for d in c:`, errors in the innermost loop can only be caught by a `try ... except` in `c.__iter__()`. There should be such exception handling in `datasets` and it should have a **configurable exponential back-off**: first wait and retry after 1 minute, then 2 minutes, then 4 minutes, then 8 minutes, ... and after a given amount of retries, **skip the bad example**, and **only after** skipping a given amount of examples, give up and crash. This was requested in https://github.com/huggingface/datasets/issues/6843 too, since currently there is only linear backoff *and* it is clearly not applied to `xisfile`.


### Environment info

- `datasets` version: 3.3.2 *(the latest version)*
- Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.7
- `huggingface_hub` version: 0.26.5
- PyArrow version: 15.0.0
- Pandas version: 2.2.0
- `fsspec` version: 2024.10.0

	class FilesIterable(TrackedIterableFromGenerator):
	"""An iterable of paths from a list of directories or files"""

	@classmethod
	def _iter_from_urlpaths(
	cls, urlpaths: Union[str, list[str]], download_config: Optional[DownloadConfig] = None
	) -> Generator[str, None, None]:
	if not isinstance(urlpaths, list):
	urlpaths = [urlpaths]
	for urlpath in urlpaths:
	if xisfile(urlpath, download_config=download_config):
	yield urlpath
	elif xisdir(urlpath, download_config=download_config):
	for dirpath, dirnames, filenames in xwalk(urlpath, download_config=download_config):
	# in-place modification to prune the search
	dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
	if xbasename(dirpath).startswith((".", "__")):
	# skipping hidden directories
	continue
	for filename in sorted(filenames):
	if filename.startswith((".", "__")):
	# skipping hidden files
	continue
	yield xjoin(dirpath, filename)
	else:
	raise FileNotFoundError(urlpath)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterableDataset raises FileNotFoundError instead of retrying #7440

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IterableDataset raises FileNotFoundError instead of retrying #7440

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions