Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection error while half-downloading metadata #45

Open
sedol1339 opened this issue Aug 13, 2023 · 3 comments
Open

Connection error while half-downloading metadata #45

sedol1339 opened this issue Aug 13, 2023 · 3 comments

Comments

@sedol1339
Copy link

sedol1339 commented Aug 13, 2023

Hello! I'm running the command:

python download_upstream.py --scale medium --data_dir medium --skip_shards

After downloading some files it interrupts with the error:

  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

As you can see, there is not too much details in error message. May this be caused some files missing on server? Or just connection problems? If the last, how can I resume thedownload? Flag --overwrite_metadata seems not suitable because it removes all already downloaded files.

@sedol1339
Copy link
Author

I've reduced the download code to the following:

from huggingface_hub import snapshot_download
snapshot_download(**{'repo_id': 'mlfoundations/datacomp_medium', 'allow_patterns': '*.parquet', 'local_dir': 'medium/metadata', 'cache_dir': 'medium/hf', 'local_dir_use_symlinks': False, 'repo_type': 'dataset', 'resume_download': True})

Still see that even with resume_download=True it keeps downloading the same files every time after error

@LengSicong
Copy link

same here, any solution?

@simon-ging
Copy link

simon-ging commented Mar 2, 2024

A temporary solution would be to catch the URLs that are downloaded and then download them manually.

Change download_upstream.py

# add at the beginning
class QuietTqdm(tqdm):
    def __init__(self, *a, **kw):
        kw["disable"] = True
        super().__init__(*a, **kw)

# change
    hf_snapshot_args = dict(
        repo_id=hf_repo,
        allow_patterns=f"*.parquet",
        local_dir=metadata_dir,
        cache_dir=cache_dir,
        local_dir_use_symlinks=False,
        repo_type="dataset",
        max_workers=1,
        tqdm_class=QuietTqdm,
    )

# delete this line:  print(f"Downloading metadata to {metadata_dir}...")

Find and change the file site-packages/huggingface_hub/file_download.py

Find the line 1245 and add the print and return statement

# find this line (1245)
    url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
# add the print and return statement
    print(url)
    return "none"

Finally call the downloader

HF_HUB_DISABLE_PROGRESS_BARS=1 python download_upstream.py --scale xlarge --data_dir data/datacomp --skip_shards > urls.txt

This gives you a list of ~24K URLs to manually download. Now you just need some sort of download utility that can batch-download URLs and you have the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants