Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_from_disk with large dataset from S3 runs into botocore.exceptions.ClientError #6611

Open
zotroneneis opened this issue Jan 23, 2024 · 0 comments

Comments

@zotroneneis
Copy link

Describe the bug

When loading a large dataset (>1000GB) from S3 I run into the following error:

Traceback (most recent call last):
  File "/home/alp/.local/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File "/home/alp/.local/lib/python3.10/site-packages/aiobotocore/client.py", line 383, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestTimeTooSkewed) when calling the GetObject operation: The difference between the request time and the current time is too large.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/alp/phoneme-classification.monorepo/aws_sagemaker/data_processing/inspect_final_dataset.py", line 13, in <module>
    dataset = load_from_disk("s3://speech-recognition-processed-data/whisper/de/train_data/", storage_options=storage_options)
  File "/home/alp/.local/lib/python3.10/site-packages/datasets/load.py", line 1902, in load_from_disk
    return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
  File "/home/alp/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1686, in load_from_disk
    fs.download(src_dataset_path, [dest_dataset_path.as](http://dest_dataset_path.as/)_posix(), recursive=True)
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1480, in download
    return self.get(rpath, lpath, recursive=recursive, **kwargs)
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 604, in _get
    return await _run_coros_in_chunks(
  File "/home/alp/.local/lib/python3.10/site-packages/fsspec/asyn.py", line 257, in _run_coros_in_chunks
    await asyncio.gather(*chunk, return_exceptions=return_exceptions),
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/alp/.local/lib/python3.10/site-packages/s3fs/core.py", line 1193, in _get_file
    body, content_length = await _open_file(range=0)
  File "/home/alp/.local/lib/python3.10/site-packages/s3fs/core.py", line 1184, in _open_file
    resp = await self._call_s3(
  File "/home/alp/.local/lib/python3.10/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
  File "/home/alp/.local/lib/python3.10/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
PermissionError: The difference between the request time and the current time is too large.

The usual problem for this error is that the time on my local machine is out of sync with the current time. However, this is not the case here. I checked the time and even reset it with no success. See resources here:

The error does not appear when loading a smaller dataset (e.g. our test set) from the same s3 path.

Steps to reproduce the bug

  1. Create large dataset
  2. Try loading it from s3 using:
dataset = load_from_disk("s3://...", storage_options=storage_options)  

Expected behavior

Load dataset without running into this error.

Environment info

  • datasets version: 2.13.1
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.19.3
  • PyArrow version: 12.0.1
  • Pandas version: 2.0.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant