Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with downloading The Pile #5604

Closed
sentialx opened this issue Mar 3, 2023 · 7 comments
Closed

Problems with downloading The Pile #5604

sentialx opened this issue Mar 3, 2023 · 7 comments

Comments

@sentialx
Copy link

sentialx commented Mar 3, 2023

Describe the bug

The downloads in the screenshot seem to be interrupted after some time and the last download throws a "Read timed out" error.

image

Here are the downloaded files:
image

They should be all 14GB like here (https://the-eye.eu/public/AI/pile/train/).

Alternatively, can I somehow download the files by myself and use the datasets preparing script?

Steps to reproduce the bug

dataset = load_dataset('the_pile', split='train', cache_dir='F:\datasets')

Expected behavior

The files should be downloaded correctly.

Environment info

  • datasets version: 2.10.1
  • Platform: Windows-10-10.0.22623-SP0
  • Python version: 3.10.5
  • PyArrow version: 9.0.0
  • Pandas version: 1.4.2
@mariosasko
Copy link
Collaborator

Hi!

You can specify download_config=DownloadConfig(resume_download=True)) in load_dataset to resume the download when re-running the code after the timeout error:

from datasets import load_dataset, DownloadConfig
dataset = load_dataset('the_pile', split='train', cache_dir='F:\datasets', download_config=DownloadConfig(resume_download=True))

@karanveersingh5623
Copy link

@mariosasko , I used your suggestion but its not saving anything , just stops and runs from the same point .
below is the script to download and save on disk .

from datasets import load_dataset, DownloadConfig


#load the Pile dataset from Hugging Face Datasets
#dataset = load_dataset('the_pile')
dataset = load_dataset('the_pile', split='train', cache_dir='datasets', download_config=DownloadConfig(resume_download=True))


# save each file in the dataset to disk
for i, example in enumerate(dataset['train']):
    filename = f'pile_file_{i}.json'
    with open(filename, 'w') as f:
        f.write(str(example))

print("Finished saving Pile dataset files to disk.")

@karanveersingh5623
Copy link

@mariosasko , it shows nothing in dataset folder

 du -sh /mnt/nlp/hugging_face/*
20K     /mnt/nlp/hugging_face/datasets
4.0K    /mnt/nlp/hugging_face/download_pile.py

@karanveersingh5623
Copy link

@mariosasko

root@d20f0ab8f4f8:/mnt/hugging_face# python3 download_pile.py
No config specified, defaulting to: the_pile/all
Downloading and preparing dataset the_pile/all to /mnt/hugging_face/datasets/the_pile/all/0.0.0/6fadc480ecb32470826cbf5900a9558b791ce55d5e9a0fdc8ad653e7b64bb349...
Downloading data files:   0%|                                                                                                           | 0/3 [00:00<?, ?it/s]





Downloading data:  70%|████████████████████████████████████████████████████████████████████▊                             | 10.7G/15.2G [12:09<11:53, 6.36MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [22:15<00:00, 7.25MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [46:17<00:00, 5.48MB/s]
Downloading data:  40%|██████████████████████████████████████▏                                                         | 6.07G/15.3G [50:49<1:17:02, 1.99MB/s]
Traceback (most recent call last):██████████████████████████▊                                                         | 6.07G/15.3G [50:49<25:35:23, 99.9kB/s]
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 567, in read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 525, in _fp_read
    data = self._fp.read(chunk_amt)
  File "/usr/lib/python3.8/http/client.py", line 459, in read
    n = self.readinto(b)
  File "/usr/lib/python3.8/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 628, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 593, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_pile.py", line 6, in <module>
    dataset = load_dataset('the_pile', split='train', cache_dir='datasets', download_config=DownloadConfig(resume_download=True))
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 945, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/the_pile/6fadc480ecb32470826cbf5900a9558b791ce55d5e9a0fdc8ad653e7b64bb349/the_pile.py", line 192, in _split_generators
    data_dir = dl_manager.download(_DATA_URLS[self.config.name])
  File "/usr/local/lib/python3.8/dist-packages/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 443, in map_nested
    mapped = [
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 363, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 363, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/usr/local/lib/python3.8/dist-packages/datasets/download/download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 182, in cached_path
    output_path = get_from_cache(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 575, in get_from_cache
    http_get(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 379, in http_get
    for chunk in response.iter_content(chunk_size=1024):
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 818, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@sentialx
Copy link
Author

Users with slow internet speed are doomed (4MB/s). The dataset downloads fine at minimum speed 10MB/s.

Also, when the train splits were generated and then I removed the downloads folder to save up disk space, it started redownloading the whole dataset. Is there any way to use the already generated splits instead?

@karanveersingh5623
Copy link

@sentialx @mariosasko , anytime on my above script , am I downloading and saving dataset correctly . Please suggest :)

@ghost
Copy link

ghost commented Oct 14, 2023

@sentialx probably worth noting that resume_download=True doesn't directly save the dataset to disk, but instead just helps in resuming the dataset resume on interruption as @mariosasko mentions. resolving resumptions after a crash is an open issue at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants