Problems with downloading The Pile #5604

sentialx · 2023-03-03T09:52:08Z

Describe the bug

The downloads in the screenshot seem to be interrupted after some time and the last download throws a "Read timed out" error.

Here are the downloaded files:

They should be all 14GB like here (https://the-eye.eu/public/AI/pile/train/).

Alternatively, can I somehow download the files by myself and use the datasets preparing script?

Steps to reproduce the bug

dataset = load_dataset('the_pile', split='train', cache_dir='F:\datasets')

Expected behavior

The files should be downloaded correctly.

Environment info

datasets version: 2.10.1
Platform: Windows-10-10.0.22623-SP0
Python version: 3.10.5
PyArrow version: 9.0.0
Pandas version: 1.4.2

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-03-14T15:30:32Z

Hi!

You can specify download_config=DownloadConfig(resume_download=True)) in load_dataset to resume the download when re-running the code after the timeout error:

from datasets import load_dataset, DownloadConfig
dataset = load_dataset('the_pile', split='train', cache_dir='F:\datasets', download_config=DownloadConfig(resume_download=True))

karanveersingh5623 · 2023-03-28T08:11:15Z

@mariosasko , I used your suggestion but its not saving anything , just stops and runs from the same point .
below is the script to download and save on disk .

from datasets import load_dataset, DownloadConfig


#load the Pile dataset from Hugging Face Datasets
#dataset = load_dataset('the_pile')
dataset = load_dataset('the_pile', split='train', cache_dir='datasets', download_config=DownloadConfig(resume_download=True))


# save each file in the dataset to disk
for i, example in enumerate(dataset['train']):
    filename = f'pile_file_{i}.json'
    with open(filename, 'w') as f:
        f.write(str(example))

print("Finished saving Pile dataset files to disk.")

karanveersingh5623 · 2023-03-28T08:12:51Z

@mariosasko , it shows nothing in dataset folder

 du -sh /mnt/nlp/hugging_face/*
20K     /mnt/nlp/hugging_face/datasets
4.0K    /mnt/nlp/hugging_face/download_pile.py

karanveersingh5623 · 2023-03-28T08:14:08Z

@mariosasko

root@d20f0ab8f4f8:/mnt/hugging_face# python3 download_pile.py
No config specified, defaulting to: the_pile/all
Downloading and preparing dataset the_pile/all to /mnt/hugging_face/datasets/the_pile/all/0.0.0/6fadc480ecb32470826cbf5900a9558b791ce55d5e9a0fdc8ad653e7b64bb349...
Downloading data files:   0%|                                                                                                           | 0/3 [00:00<?, ?it/s]





Downloading data:  70%|████████████████████████████████████████████████████████████████████▊                             | 10.7G/15.2G [12:09<11:53, 6.36MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [22:15<00:00, 7.25MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 15.2G/15.2G [46:17<00:00, 5.48MB/s]
Downloading data:  40%|██████████████████████████████████████▏                                                         | 6.07G/15.3G [50:49<1:17:02, 1.99MB/s]
Traceback (most recent call last):██████████████████████████▊                                                         | 6.07G/15.3G [50:49<25:35:23, 99.9kB/s]
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 567, in read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 525, in _fp_read
    data = self._fp.read(chunk_amt)
  File "/usr/lib/python3.8/http/client.py", line 459, in read
    n = self.readinto(b)
  File "/usr/lib/python3.8/http/client.py", line 503, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 628, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 593, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_pile.py", line 6, in <module>
    dataset = load_dataset('the_pile', split='train', cache_dir='datasets', download_config=DownloadConfig(resume_download=True))
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 945, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/the_pile/6fadc480ecb32470826cbf5900a9558b791ce55d5e9a0fdc8ad653e7b64bb349/the_pile.py", line 192, in _split_generators
    data_dir = dl_manager.download(_DATA_URLS[self.config.name])
  File "/usr/local/lib/python3.8/dist-packages/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 443, in map_nested
    mapped = [
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 363, in _single_map_nested
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 363, in <listcomp>
    mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/usr/local/lib/python3.8/dist-packages/datasets/download/download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 182, in cached_path
    output_path = get_from_cache(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 575, in get_from_cache
    http_get(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py", line 379, in http_get
    for chunk in response.iter_content(chunk_size=1024):
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 818, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

sentialx · 2023-03-28T08:24:10Z

Users with slow internet speed are doomed (4MB/s). The dataset downloads fine at minimum speed 10MB/s.

Also, when the train splits were generated and then I removed the downloads folder to save up disk space, it started redownloading the whole dataset. Is there any way to use the already generated splits instead?

karanveersingh5623 · 2023-03-29T01:44:05Z

@sentialx @mariosasko , anytime on my above script , am I downloading and saving dataset correctly . Please suggest :)

ghost · 2023-10-14T02:15:52Z

@sentialx probably worth noting that resume_download=True doesn't directly save the dataset to disk, but instead just helps in resuming the dataset resume on interruption as @mariosasko mentions. resolving resumptions after a crash is an open issue at the moment.

mariosasko closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with downloading The Pile #5604

Problems with downloading The Pile #5604

sentialx commented Mar 3, 2023

mariosasko commented Mar 14, 2023

karanveersingh5623 commented Mar 28, 2023

karanveersingh5623 commented Mar 28, 2023

karanveersingh5623 commented Mar 28, 2023

sentialx commented Mar 28, 2023

karanveersingh5623 commented Mar 29, 2023

ghost commented Oct 14, 2023

Problems with downloading The Pile #5604

Problems with downloading The Pile #5604

Comments

sentialx commented Mar 3, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Mar 14, 2023

karanveersingh5623 commented Mar 28, 2023

karanveersingh5623 commented Mar 28, 2023

karanveersingh5623 commented Mar 28, 2023

sentialx commented Mar 28, 2023

karanveersingh5623 commented Mar 29, 2023

ghost commented Oct 14, 2023