Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp download to local dir process #2223

Merged
merged 43 commits into from
Apr 29, 2024
Merged

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Apr 12, 2024

Implements #1738 (and especially #1738 (comment)) πŸ™ˆ

What does this PR do?

  • Downloading files with local_dir do not use the cache but rely on a .huggingface/ folder instead
  • Prevent .huggingface/ from being committed
  • Adapt CLI
  • Revampt all "local dir" tests
  • All downloads (to local dir or to cache) will not resume by default any incomplete downloads. This should prove useful on slow connections. IMO there is very little benefit to make this optional. Only counter-example: if hf_transfer is enabled, we do not resume download (not supported). One can use force_download to force a download from scratch.

How it works?

When downloading a file file.txt to local dir data/:

  1. Checks if:
    1. data/file.txt exists
    2. and data/.huggingface/download/file.txt.metadata exists
    3. and data/file.txt has been modified before the metadata file was saved (metadata contains a timestamp)
  2. If (1.) is true, then we consider the metadata as valid. Metadata contains commit_hash and etag. Otherwise we consider that we don't have any info on the local file.
  3. if revision == metadata.commit_hash, then the file is valid => return
  4. otherwise, we fetch remote metadata
    1. If remote etag == metadata.etag => update local metadata => return
    2. if remote tag is a sha256 and we don't have local metadata => we hash local file. If sha256 == remote etag => it's a valid LFS file => return
    3. Otherwise (etag mismatch and not LFS or LFS but sha256 mismatch) => download file

If force_download=True is passed, all of the above is skipped => we download the file no matter what.

What to review?

This is a large PR (+1,265 βˆ’760) as it touches in depth the download logic. However a lot of the changes are about moving parts of code into private helpers to avoid duplicating the logic between _hf_hub_download_local_dir and _hf_hub_download_cache_dir.

Important changes are:

  • logic inside _local_folder.py => handles metadata in the local folder (i.e. inside ./huggingface/)
  • file_download.py => where everything happens. Best to read the file instead of raw changes. Most important part is _hf_hub_download_local_dir while hf_hub_download and _hf_hub_download_cache_dir are iso-feature compared to before.
  • test_file_download.py: all the new test cases in HfHubDownloadToLocalDir

Doc changes:

  • cli.md, download.md, environment_variables.md
  • snapshot_download.py => only some docs + few tweaks, no real update
  • hf_api.py => only some docs + few tweaks, no real update

Less important:

  • logic to prevent from committing huggingface/ folder (_commit_api.py + test_commit_api.py + test_utils_paths.py)
  • test_cli.py => not relevant
  • command/download.py => deprecated --local-dir-use-symlinks in CLI
  • constants.py/hub_mixin.py/keras_mixin.py => some deprecation

Example

Download README.md and model.safetensors from gpt2 repo into ./data/gpt2 folder:

huggingface-cli download gpt2 README.md model.safetensors --local-dir=data/gpt2

Resulting tree:

# tree -alh data/
[4.0K]  data/
└── [4.0K]  gpt2
    β”œβ”€β”€ [4.0K]  .huggingface
    β”‚   β”œβ”€β”€ [4.0K]  download
    β”‚   β”‚   β”œβ”€β”€ [   0]  model.safetensors.lock
    β”‚   β”‚   β”œβ”€β”€ [ 182]  model.safetensors.metadata
    β”‚   β”‚   β”œβ”€β”€ [   0]  README.md.lock
    β”‚   β”‚   └── [ 158]  README.md.metadata
    β”‚   └── [   1]  .gitignore
    β”œβ”€β”€ [523M]  model.safetensors
    └── [7.9K]  README.md

3 directories, 7 files

How to try it?

pip install git+https://github.com/huggingface/huggingface_hub@1738-revampt-download-local-dir

TODO

  • add an explanation guide on "how .huggingface/ works?" (similar to the "Cache" guide?)
  • extensively test corner cases

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin Wauplin changed the title [wip ] Revampt download to local dir process Revampt download to local dir process Apr 24, 2024
@Wauplin Wauplin marked this pull request as ready for review April 24, 2024 16:25
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Super nice side effect of having resume_downloads on by default.
I played with it locally and we already discussed potential improvements.

The rest LGTM!

src/huggingface_hub/_local_folder.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/download.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/download.py Outdated Show resolved Hide resolved
src/huggingface_hub/file_download.py Outdated Show resolved Hide resolved
src/huggingface_hub/_snapshot_download.py Outdated Show resolved Hide resolved
docs/source/en/package_reference/environment_variables.md Outdated Show resolved Hide resolved
Wauplin and others added 2 commits April 25, 2024 14:29
Co-authored-by: Lysandre Debut <hi@lysand.re>
Co-authored-by: Lysandre Debut <hi@lysand.re>
@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 25, 2024

Thanks for the review @LysandreJik! I've addressed all of your comments.

I've also pushed new tests specific to the .huggingface/ folder (see 0eacbc9). I will keep this PR open for a few days to gather feedback, and then merge it if everything's fine :)

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! πŸŽ‰

docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/download.md Outdated Show resolved Hide resolved
docs/source/en/guides/download.md Outdated Show resolved Hide resolved
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! (from a distance :))

docs/source/en/guides/download.md Outdated Show resolved Hide resolved
src/huggingface_hub/commands/download.py Show resolved Hide resolved
src/huggingface_hub/file_download.py Show resolved Hide resolved
src/huggingface_hub/commands/download.py Outdated Show resolved Hide resolved
Wauplin and others added 3 commits April 29, 2024 08:04
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 29, 2024

Thanks @stevhliu @pcuenca for the review! I like it better like this :D

@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 29, 2024

Addressed all comments and fixed the CI (failures were not only due to this PR but I fixed them here anyway).

Let's get this merged! πŸŽ‰

@Wauplin Wauplin merged commit 4df59b4 into main Apr 29, 2024
16 checks passed
@Wauplin Wauplin deleted the 1738-revampt-download-local-dir branch April 29, 2024 08:50
@Wauplin Wauplin changed the title Revampt download to local dir process Revamp download to local dir process May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants