Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revampt workflow when downloading to local dir #1738

Closed
Wauplin opened this issue Oct 13, 2023 · 27 comments
Closed

Revampt workflow when downloading to local dir #1738

Wauplin opened this issue Oct 13, 2023 · 27 comments
Assignees

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Oct 13, 2023

EDIT: conversation in this issue derived from initial topic to a revampt of the download mechanism when downloading to a local folder. More details below.


(previous title: do not download when local_dir=True and file is identical)

Especially useful for
When using hf_hub_download(..., local_dir=...) and users not using symlinks (Windows or local_dir_use_symlink=False), the file is redownloaded from the Hub which can take a very long time. This happens even if the file did not change. What we can do for LFS files only is to compute the sha256 of the local file and download only if it changed. Computing sha256 is not instant but still faster than downloading.

@Wauplin Wauplin added this to the in next release? milestone Oct 13, 2023
@Wauplin Wauplin self-assigned this Oct 13, 2023
@julien-c
Copy link
Member

not using symlinks is still a bit of an edge case, no?

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 16, 2023

not using symlinks is still a bit of an edge case, no?

I'm afraid less and less. Also it's the case by default on windows non-admin non-dev (e.g. the "image generation" community). I'm not advocating for too many additional features for non-symlink users but this one is a pretty low hanging fruit IMO.

(current problem is that every LFS files is redownloaded even if nothing changed on the repo)

@julien-c
Copy link
Member

ok, fair enough (still think we should discourage local_dir_use_symlink=False as much as we can)

@rwightman
Copy link

rwightman commented Jan 18, 2024

I just came here due to running into this issue (and also a user)...

So, from my perspective and for my main use case for huggingface-cli download, --local_dir_use_symlink=False would be a better default when --local-dir is specified.

Why? Because it's important for downloading large datasets, when a user is downloading a 1TB dataset they need to know exactly where it's going, the cache should be skipped.

  • cache is usually in home
  • home is rarely acceptable for large downloads
  • needing to override HF home isn't ideal, and it's global
  • many clusters and cloud users will usually want to organize diff datasets -> diff destinations (/mnt/largedata, /mnt/fastdata, /mnt/flufflyclouddata), not have them all in .cache

Other points:

  • when we download terabytes of data onto systems, often administred clusters, etc we usually want to specify the end destination and have the downloads go direclty there
  • symlinking to the cache is not desirable
  • downloading to a tempfile into the .cache dir and moving to destination (as appears to be current --local_dir_use_symlink=False behaviour) is also not desirable as it incurs a copy when the .cache is on a different filesystem from final destination

Is there no way to obtain the hashes from LFS and check them locally in the destination before each get?

If huggingface-cli isn't going to serve this use case, what's the alternative? can we get some sort of rsync-lfs type tool working?

EDIT: Also, I can't think of a use case in my arena where I'd want to specify a local dir and have the files symlinked? So curious why it's preferred? I may be particular but when I want to have something in a local dir, I want it there. I often just rm -rf .cache with impunity and I'd never want to wipe out several terabytes of symlinked data. You'd need hardlinks to avoid that situation.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 19, 2024

Hi @rwightman, thanks a lot for the long feedback!

  • cache is usually in home
  • home is rarely acceptable for large downloads
  • needing to override HF home isn't ideal, and it's global

Just to let you know, it's possible to specify the cache directory without setting the value globally:

# download gpt2 inside `/data/.cache` and symlink files to `/data/models`
huggingface-cli download gpt2 --local-dir /data/models --cache-dir /data/.cache

^ this is already doable to avoid copies between volumes, although not ideal. You can also do that with --local-dir-use-symlinks False if you don't want symlinks.


That been said, I agree the current solution is not satisfying for a CLI and we should update it. One solution as you said would be that using --local-dir completely disables the cache. No cache. No symlinks. No copies while downloading. This also means that if the user runs huggingface-cli download ... twice, all the files will be re-downloaded, no matter if they already exist. For context, we wanted to avoid this in scripts since Python scripts can call methods like hf_hub_download/snapshot_download multiple times without the user knowing about it. But CLI usage is slightly different so we can adapt our policy.

Another solution we talked about with @julien-c was a in-between solution where a local cache would be used inside the --local-dir. So if you do huggingface-cli download gpt2 --local-dir /data/hub, you end up with:

/data/hub/
/data/hub/model.safetensors
/data/hub/README.md
/data/hub/onnx/
/data/hub/onnx/tokenizers.json
/data/hub/onnx/decoder_model.onnx
/data/hub/onnx/... etc.
/data/hub/... etc.

/data/hub/.cache/
/data/hub/.cache/snapshot/....
/data/hub/.cache/blobs/...
/data/hub/.cache/refs/...

This way you still have the "if you run twice, only new files are downloaded" at the cost of internal symlinks.
We can also have huggingface-cli download gpt2 --local-dir /data/hub --no-symlinks if symlinks are really a no-go even inside of local dir.
Otherwise we could build a semi-cache where we only cache the sha256 of the downloaded files (to avoid redownloading). Or we don't do cache at all for local dirs.


^ Note: All of the above would introduce a breaking change compared to the current logic. That's fine but let's all agree on what's the best compromise between usability and expectations from users. And in any case, we would not change the current behavior when --local-dir is not passed (i.e. huggingface-cli download gpt2 + AutoModel.from_pretrained("gpt2") should still work as now with cache enabled).

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 19, 2024

(I'm linking this (internal) slack thread where we discussed this in December 2023 with @pcuenca @lhoestq and @osanseviero as well)

@julien-c
Copy link
Member

supportive of the @Wauplin proposal

@rwightman
Copy link

TLDR is there a path to providing the desired 'sync like' behaviour? Were we get just the files in the repo at the destination folder, no cache used, no symlinks, able to checksum dest and resume? As mentioned on slack that could be a huggingface-cli sync command and leave existing download alone.

My thoughts on proposals...

using --cache-dir instead of --local-dir works around the resume issue for now, but does require the user to cleanup after if they want only the repo files at the specified location in the end

symlinks are still a concern, sysadmins in large clusters using lustre, etc filesystems are often pushing users to keep the inode count down, so using symlinks as a rule runs counter to that

also, for the datasets use case that's currently my primary concern, moving large sharded datasets efficiently, having the symlink is a small performance concern, for training @ scale we're usually going to great lengths to keep the IOPS low, reduce seeks, we do this by sharding data in to chunks that should be read contiguously, the symlink adds an extra re-direction that will not be in cache by the time you wrap around on a given dataset, so adds a seek

@pcuenca
Copy link
Member

pcuenca commented Jan 19, 2024

In terms of user expectations, if --local-dir is specified then it sounds reasonable to me that the global cache is not required. Totally agree with keeping the current behaviour in the absence of --local-dir.

It'd be awesome if --local-dir could work just like rsync, so no symlinks and no extra downloads. I understand this is not easy because we need to check the refs. For the most part I'd be fine with symlinks to a local-dir cache location, but we've found situations where symlinks don't work.

Computing the sha of local files sounds like a good compromise. Another idea would be to use --local-dir in the way you last proposed (local-dir-internal cache location), and something else (--snapshot, let's say) to just download the files. But this comes at the cost of additional complexity and options for the user.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 19, 2024

Not opposed to "rsync"-like features. Computing the hash of existing files to compare to remote ref comes at a cost. Still lower than downloading but not instant (especially for large datasets). What we can do is to store the sha256 of the files in path/to/local/dir/.cache so that we don't compute them on each run. If a sysadmin deletes it, we recompute it and that's fine. With this solution, no need for symlinks => you get a real snapshot as expected. If we keep this scope, it's fine to keep it under huggingface-cli download IMO. For context, we wanted to avoid the wording huggingface-cli sync as users might expect it to "sync" files as dropbox (e.g. download new files but also upload local changes). Hence the huggingface-cli download and huggingface-cli upload commands.

@rwightman
Copy link

rwightman commented Jan 19, 2024

@Wauplin I think that'd be reasonable, files are downloaded to the specified dest w/o symlinks, for resume w/o recomputing hash, some bookkeeping in a .cache subfolder in the dest folder that can be easily removed with no adverse effects (other than needing to recompute) ...

@Wauplin Wauplin changed the title Do not download when local_dir=True and file is identical Revampt workflow when downloading to local dir Jan 24, 2024
@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 24, 2024

[Request For Comments]

I've make a first draft of how I envision a workflow to download files to a local directory (as discussed above) that would be self-contained (no use of symlinks), efficient (avoid (re-)downloading files whenever possible) and resilient (still work if metadata cache is deleted or files are updated).

1. Requirements / scope

Here are the requirements I gathered / I've set to myself for the revampt:

  1. Default behavior of huggingface-cli download should not change
  2. Default behavior of hf_hub_download and snapshot_download should not change.
  3. snapshot_download/hf_hub_download and huggingface-cli download should be consistent
  4. if local_dir (in script) or --local-dir (in CLI) is provided:
    1. files are downloaded to the provided directory
    2. structure inside directory is exactly the same as in the repo
    3. the HF default cache is not used
    4. no symlinks
    5. if command is run twice, files should not be re-downloaded
    6. if command is run twice, execution time should be minimal
    7. it is ok to have a local .cache folder to keep some "state" information
    8. if the .cache folder is deleted, we should be able to recompute it locally

2. Folder structure

Here is the folder structure I'm thinking about. If local_dir="data/", files will be downloaded to it respecting the nested files if any. At the root of the folder, in data/.cache/, the same structure is kept to store metadata about the downloaded files. All files in data/.cache are suffixed with .metadata to avoid any confusion.

Caveat: what if the repo contains a .cache folder? I would suggest to raise an exception "please contact the repo owner" as we shouldn't support this. Or at least a warning + do not download the .cache folder. Note: if .cache is found while uploading a folder with huggingface-cli upload or upload_folder, let's ignore it (same as what we do by default with the .git folder.

[4.0K]  data
├── [4.0K]  .cache
│   ├── [  16]  file.parquet.metadata
│   ├── [  16]  file.txt.metadata
│   └── [4.0K]  folder
│       └── [  16]  file.parquet.metadata
│
├── [6.5G]  file.parquet
├── [1.5K]  file.txt
└── [4.0K]  folder
    └── [   16]  file.parquet

In the rest of this issue, let's use this naming:

  • metadata file => file inside .cache folder and ending with .metadata
  • downloaded file (or local file) => file inside local folder
  • remote file => file on the Hub

2.1 metadata file

The .metadata files contains:

  • a commit_hash => the revision of the downloaded file
  • an etag => the etag on the hub of the downloaded

We will also use the 'last modified' timestamp provided by the os to check the file validity (see below). This can be obtained with os.stat(path).st_mtime (float value).

Metadata file examples:

# file.txt.metadata
11c5a3d5811f50298f278a704980280950aedb10 a16a55fda99d2f2e7b69cce5cf93ff4ad3049930
# file.parquet.metadata
11c5a3d5811f50298f278a704980280950aedb10 7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421

Metadata can be parsed with commit_hash, etag = metadata_path.read_text().split().

For the record, etag for an LFS file is its sha256 hash (can be computed from the file) while the etag of a regular file is a git hash (cannot be computed from the file).

3. Workflow (when local_dir is passed)

3.1. snapshot_download

snapshot_download does not change:

  1. we make a call to hf.co to list repo files + get the commit_hash corresponding to the revision
  2. we filter which files should be downloaded (using allow_patterns/ignore_patterns)
  3. for each file to download, we call hf_hub_download with revision=commit_hash

3.2 hf_hub_download

  1. check if downloaded file exists
    • => if not, download remote file + create metadata file + return path
  2. check if metadata file exists and has been last modified after the downloaded file (we compare os.stat(path).st_mtime)
    • => if yes, we consider the metadata file as valid => we can trust its content
    • => if not, we consider the metadata file as outdated => same as if missing

3.a. if metadata file is valid, we parse it to get last_revision + etag.

  • if input revision is a commit hash and revision == last_revision => downloaded file is up to date => return path
  • if input revision is not a commit hash, make a HEAD call to remote file to retrieve (commit hash + file etag).
    • if commit hash matches the cached revision => downloaded file is up to date => return path
    • if etag matches the cached revision => downloaded file is up to date but cached file is not => we update the commit hash in metadata file + return downloaded file path
  • if etag doesn't match, the downloaded file is outdated => download it + create correct metadata file + return newly downloaded file path

3.b if metadata file is outdated or missing, we make a HEAD call to remote file to retrieve (commit hash + file etag).

  • if etag is a sha256 hash, it means the file is tracked with LFS. Let's compute the sha256 of the local downloaded file. It might take some time but should be less than downloading the file again (in theory, it depends on bandwidth but let's make this assumption).
    • if local sha256 and etag are identical, we don't need to redownload the file. Let's update/create a metadata file with commit hash + etag and that's it.
    • if local sha256 and etag differs, let's download the remote file + create metadata file + return
  • if etag is not a sha256 hash, it means the file is a regular file (not LFS-tracked). We can't know if the local file is up to date with remote so let's download it again. Since the file is <5MB, it's not a big deal anyway. Once downloaded, we create/update the metadata file to avoid redownloading next time.

Sum-up

  • if file doesn't exist => download
  • if file exist + valid metadata + same revision => no HEAD call, no download
  • if file exist + valid metadata + different revision + same etag => 1 HEAD call, no download
  • if file exist + valid metadata + different revision + different etag => download
  • if file exist + missing metadata + LFS + same hash => 1 HEAD call, 1 sha256 compute, no download
  • if file exist + missing metadata + LFS + different hash => 1 HEAD call, 1 sha256 compute, download
  • if file exist + missing metadata + regular => 1 HEAD call, download

Caveats / things to remember

  • breaking change 😕 How to deal with existing "local dir" that have symlinks in them. I would advocate for a duplicate local file + remove symlink.
  • caveat if .cache exists on remote repo

@Narsil
Copy link
Contributor

Narsil commented Jan 24, 2024

How to deal with existing "local dir" that have symlinks in them. I would advocate for a duplicate local file + remove symlink.

Given the anticipated flow, isn't that just going to trigger 1 redownload (given lack of metadata and mismatching sha+etag), and just update everything ? It could be tested, but I think this is what is going to happen iiuc.

caveat if .cache exists on remote repo

Overall this .cache folder seems like a bad idea to me, nothing critical, but potentially causing more headaches than necessary.
You mention .git folder, but we're never downloading it ourselves are we ? This .cache feels a bit redundant with a .git folder, where existing practices for etags/hashes and blobs already exist (not sure if we can entirely leverage, but git pull seems to be doing a similar job.

Another idea, would be to use the actual cache folder ~/.cache/hub/local/DIR_HASH or something. This is a cache after all, if it's not writeable, then just don't write a cache, I think it's fine, the user asked for a download, our ability to prevent that through a cache is entirely optional IMO (because of --local-dir). Also it will only mean a few HEAD requests.

Sha hashing for large files should be separated decision, that the user has to ask for (given the actual speed cost).
If users ask for sha checking, you should never skip it, metadata, file time creations and so on can all be faked. If a user wants to check a file is consistent with a remote, we should always hash it.

Also why use N files instead of 1 ? I am not sure about this, there are probably tradeoffs depending on the access patterns, just creating many files with a single line seems slightly off intuitively. We're currently not doing that in the git layout, are we ?

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 24, 2024

Thanks for reading through it and for the feedback @Narsil!

Given the anticipated flow, isn't that just going to trigger 1 redownload (given lack of metadata and mismatching sha+etag), and just update everything ? It could be tested, but I think this is what is going to happen iiuc.

Yes this is what should happen. I was thinking of if we can avoid the redownload of the files but maybe not worth overthinking it.

You mention .git folder, but we're never downloading it ourselves are we ? This .cache feels a bit redundant with a .git folder, where existing practices for etags/hashes and blobs already exist (not sure if we can entirely leverage, but git pull seems to be doing a similar job.

Kind of redundant on the principle yes but with a few majors differences compared to .git/ folder:

  • local files are not duplicated. When using git lfs, the blobs are stored under .git/lfs/objects/ and copied to the main directory when necessary. It's possible to use git lfs prune to remove the objects from the .git folder but that's a separate command to run + LFS files from the current revision are still kept
  • with git pull you cannot chose to download only a subset of the repo. This is important in our case since a model repo can contains weights with different precision, different formats, store training logs, etc. We want to be able to download only some of the remote files.
  • using git is currently much slower than huggingface-cli download/upload (both with and without hf_transfer enabled). We could try to optimize this but given we will always be limited by the point above (cannot download a subset), I don't think it's good idea to invest in that. Also we don't need IMO the full scope of git for most ML use cases.

I'm not against checking git/git lfs internals though. The main use case we should be optimizing for IMO is "I want to download some or all files from a repo and always pull updates from the same ref". Anything fancier (e.g. switching between revisions) can be performed with the default cache.

Another idea, would be to use the actual cache folder ~/.cache/hub/local/DIR_HASH or something.

The idea was to have everything self-contained to avoid filling a cache on a different volume. Especially useful on a cluster where users might have different caches but want to access data from a shared volume.

then just don't write a cache, I think it's fine, the user asked for a download, our ability to prevent that through a cache is entirely optional IMO (because of --local-dir). Also it will only mean a few HEAD requests.

Agree that if --local-dir is used, it's fine not to have the most optimal workflow. However, if we don't have some metadata cache at all, we need more than just a few HEAD requests. For regular files, there is no way to know if the local file is up to date with the remote file. Redownloading regular files is not so problematic anyway. For LFS files, we can compute the local sha256 but that comes at some extra cost.

Sha hashing for large files should be separated decision, that the user has to ask for (given the actual speed cost).
If users ask for sha checking, you should never skip it, metadata, file time creations and so on can all be faked. If a user wants to check a file is consistent with a remote, we should always hash it.

I'm fine to have a separate flag to explicitly compute the sha256 of files without relying on the metadata. It's a bit similar as removing the whole .cache folder manually (though a flag would be more explicit/better documented).

Also why use N files instead of 1 ?

Mostly because I wanted all the atomic downloads to be independent. I thought about having everything in a single index file + having a lock to read/write from it but it seemed more complex to me than relying on the OS to build the index (meaning N metadata files for N downloaded files).

@LysandreJik
Copy link
Member

LysandreJik commented Jan 24, 2024

From my perspective and having little experience with the situations that @rwightman mentions, I would agree with @Narsil that having the .cache centralized in one place to leave the downloaded folder intact and a 1-to-1 copy of the remote repo seems like the cleanest solution.

Having a local .cache seems dangerous to me in terms of third-party libraries that use the Hub/assumptions that we make of what to put in and the eventual bloating of that cache's content.

The rest of the assumptions (including the breaking changes) sound fair to me; sounds like a much needed refactor!

@severo
Copy link
Contributor

severo commented Jan 24, 2024

Having a local .cache seems dangerous to me in terms of third-party libraries that use the Hub/assumptions that we make of what to put in and the eventual bloating of that cache's content.

unsolicited comment: if we ever do create such a folder, I think it should be named .huggingface/, meaning we own that namespace.

@rwightman
Copy link

rwightman commented Jan 24, 2024

re @LysandreJik @Narsil and the central cache, that's a tough one, I see perspective here, but also remember in a cluster / multi-user environement central caches are a PITA, those caches often default to $HOME and that is by default not accessible between users, teams sometimes coordinate and try to assign a shared env var for their cache, but extra steps and easy to fudge.

  • alice downloads big dataset to /mnt/fastdata, signs off for the day and it fails
  • bob starts working, oh let's finish the download
  • cache was in alice's $HOME/.cache, oh crap!

I'm not sure it's helpful to call this 'local .cache' a cache at all, it's metadata specific to the files in the containing download folder.

rsync's protocol for file skipping is pretty effective but not sure that's feasible to implement here...

@Narsil
Copy link
Contributor

Narsil commented Jan 24, 2024

@rwightman Your use case would just mean:

  • bob has no cache, makes a bunch of HEAD requests to the hub to recreate the metadata (in whatever format), and picks up wherever alice left off.

It wouldn't mean any "desync" or forced redownloads, if the files are actually there.

@julien-c
Copy link
Member

if it's only from the CLI and only when you specify a --local-dir, then for me it's cleaner if the .cache is in the local dir.

It's not even completely necessary, its only goal being to avoid re-downloading identical files (so a nice to have optimization)

@rwightman
Copy link

rwightman commented Jan 24, 2024

@rwightman Your use case would just mean:

  • bob has no cache, makes a bunch of HEAD requests to the hub, and picks up wherever alice left off.

It wouldn't mean any "desync" or forced redownloads, if the files are actually there.

So in my scenario here, we're talking TB of data. So that'd mean no cache, have to fall back to checksum calculation. Which is doable but should at least be implemented efficiently, ie no calculating all checksums up front, but in parallel, checksum per file, skip if the same, otherwise download. rsync has a block checksum protocol but that requires a specific protocol client-server.

EDIT: in one of the other scenarios re pushing big datasets to the hub, ran into a resume gotcha where a big commit with thousands of files, > 1TB of data, if it wasn't broken down into small commits, it would re-calc hashes of ever file before starting to figure out what to sync, that's not a great UX because you sit in a hung state not doing anything noticable for possibly an hr or moe. Where possible, it's better to check/skip at a more granular level, overlap checksums w/ transfers, than calculate the full set of diffs up front.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 24, 2024

EDIT: in one of the other scenarios re pushing big datasets to the hub, ran into a resume gotcha where a big commit with thousands of files, > 1TB of data, if it wasn't broken down into small commits, it would re-calc hashes of ever file before starting to figure out what to sync, that's not a great UX because you sit in a hung state not doing anything noticable for possibly an hr or moe. Where possible, it's better to check/skip at a more granular level, overlap checksums w/ transfers, than calculate the full set of diffs up front.

Yes, this is a different topic we have in mind 👍

@rwightman
Copy link

@Wauplin yup, upload case is different, was just pointing out for this download, if there is going to be a checksum fallback, or we end up deciding metadata is too messy and always want to use checksums, checksumming / skipping should be done during the download progress (per-file) rather than up front (checksum all files, then start downloading the different ones)...

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 25, 2024

EDIT: not so sure relying on a distant cache + optionally --cache-dir is a good idea in the end... #1738 (comment)


Thanks everyone for the discussion :). I think we can find a good tradeoff that would solve most use cases:

  1. we put the metadata in the HF cache as mentioned by @Narsil and @LysandreJik. This way the local_dir will be exactly a copy of the remote repo without additional files.
  2. We support the --cache-dir flag (which is already the case) so that users can set a custom cache path without having to set environment variables. This would be useful to mitigate issues in a setup like you are describing @rwightman. Teams won't have to coordinate to set their environment themselves but just have to use the correct command line among the project.

For example to download the no_robots dataset, you would do:

huggingface-cli download HuggingFaceH4/no_robots --repo-type dataset --local-dir /path/to/data

which would download the files to /path/to/data. The metadata information related to this download would live in ~/.cache/huggingface/hub/.metadata/datasets--HuggingFaceH4--no_robots/<path-hash>/, where <path-hash> is the hash of /path/to/data (as absolute path). Most users won't care about this cached metadata and it should just work fine.

For use cases where you need the metadata to be cached together with the downloaded files, you would do:

huggingface-cli download HuggingFaceH4/no_robots --repo-type dataset --local-dir /path/to/data --cache-dir /path/to/data 

Metadata will therefore be stored in /path/to/data/.metadata/datasets--HuggingFaceH4--no_robots/<path-hash>/.


In this scenario, ~/.cache/huggingface/hub/.metadata is a new entry in the cache completely separated from the canonical cache. It can be completely wiped out by the user without causing any issue (just some hash compute + head calls to make). Also we are not introducing a cache folder inside the destination folder by default but still support this case if needed. The double flag is a bit redundant but at least users have full flexibility.

@julien-c
Copy link
Member

re-reading this whole thread, i still think the benefits of keeping the "cache" (to be clear, it's very small, it's just a mapping of filenames to hashes) in the local folder outweigh the drawbacks of having a unrelated folder in the same dir

Everything else, LGTM 🔥

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 25, 2024

Re-discussing it, proposition from #1738 (comment) feels clunky. Having metadata cached in a separate location makes it less robust to users moving/renaming their folders structure (also the extra --cache-dir flag is not so convenient).

Let's have a .huggingface/metadata folder inside the local dir so that everything is self-contained + it's explicit that it comes from huggingface. We can add a global .gitignore "*" in it to prevent it from been tracked by git by mistake (same as what .ruff_cache is doing for instance).

@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 24, 2024

I just opened a PR to address what we discussed in this issue! #2223 Would love some feedback if someone wants to try it.

Installation:

pip install git+https://github.com/huggingface/huggingface_hub@1738-revampt-download-local-dir

Usage:

huggingface-cli download gpt2 README.md model.safetensors --local-dir=data/gpt2

@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 29, 2024

Closed by #2223 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants