New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revampt workflow when downloading to local dir #1738
Comments
not using symlinks is still a bit of an edge case, no? |
I'm afraid less and less. Also it's the case by default on windows non-admin non-dev (e.g. the "image generation" community). I'm not advocating for too many additional features for non-symlink users but this one is a pretty low hanging fruit IMO. (current problem is that every LFS files is redownloaded even if nothing changed on the repo) |
ok, fair enough (still think we should discourage |
I just came here due to running into this issue (and also a user)... So, from my perspective and for my main use case for Why? Because it's important for downloading large datasets, when a user is downloading a 1TB dataset they need to know exactly where it's going, the cache should be skipped.
Other points:
Is there no way to obtain the hashes from LFS and check them locally in the destination before each get? If huggingface-cli isn't going to serve this use case, what's the alternative? can we get some sort of rsync-lfs type tool working? EDIT: Also, I can't think of a use case in my arena where I'd want to specify a local dir and have the files symlinked? So curious why it's preferred? I may be particular but when I want to have something in a local dir, I want it there. I often just rm -rf .cache with impunity and I'd never want to wipe out several terabytes of symlinked data. You'd need hardlinks to avoid that situation. |
Hi @rwightman, thanks a lot for the long feedback!
Just to let you know, it's possible to specify the cache directory without setting the value globally:
^ this is already doable to avoid copies between volumes, although not ideal. You can also do that with That been said, I agree the current solution is not satisfying for a CLI and we should update it. One solution as you said would be that using Another solution we talked about with @julien-c was a in-between solution where a local cache would be used inside the
This way you still have the "if you run twice, only new files are downloaded" at the cost of internal symlinks. ^ Note: All of the above would introduce a breaking change compared to the current logic. That's fine but let's all agree on what's the best compromise between usability and expectations from users. And in any case, we would not change the current behavior when |
(I'm linking this (internal) slack thread where we discussed this in December 2023 with @pcuenca @lhoestq and @osanseviero as well) |
supportive of the @Wauplin proposal |
TLDR is there a path to providing the desired 'sync like' behaviour? Were we get just the files in the repo at the destination folder, no cache used, no symlinks, able to checksum dest and resume? As mentioned on slack that could be a My thoughts on proposals... using --cache-dir instead of --local-dir works around the resume issue for now, but does require the user to cleanup after if they want only the repo files at the specified location in the end symlinks are still a concern, sysadmins in large clusters using lustre, etc filesystems are often pushing users to keep the inode count down, so using symlinks as a rule runs counter to that also, for the datasets use case that's currently my primary concern, moving large sharded datasets efficiently, having the symlink is a small performance concern, for training @ scale we're usually going to great lengths to keep the IOPS low, reduce seeks, we do this by sharding data in to chunks that should be read contiguously, the symlink adds an extra re-direction that will not be in cache by the time you wrap around on a given dataset, so adds a seek |
In terms of user expectations, if It'd be awesome if Computing the sha of local files sounds like a good compromise. Another idea would be to use |
Not opposed to "rsync"-like features. Computing the hash of existing files to compare to remote ref comes at a cost. Still lower than downloading but not instant (especially for large datasets). What we can do is to store the sha256 of the files in |
@Wauplin I think that'd be reasonable, files are downloaded to the specified dest w/o symlinks, for resume w/o recomputing hash, some bookkeeping in a .cache subfolder in the dest folder that can be easily removed with no adverse effects (other than needing to recompute) ... |
local_dir=True
and file is identical
[Request For Comments] I've make a first draft of how I envision a workflow to download files to a local directory (as discussed above) that would be self-contained (no use of symlinks), efficient (avoid (re-)downloading files whenever possible) and resilient (still work if metadata cache is deleted or files are updated). 1. Requirements / scopeHere are the requirements I gathered / I've set to myself for the revampt:
2. Folder structureHere is the folder structure I'm thinking about. If Caveat: what if the repo contains a
In the rest of this issue, let's use this naming:
2.1 metadata fileThe
We will also use the 'last modified' timestamp provided by the os to check the file validity (see below). This can be obtained with Metadata file examples:
Metadata can be parsed with For the record, 3. Workflow (when
|
Given the anticipated flow, isn't that just going to trigger 1 redownload (given lack of metadata and mismatching sha+etag), and just update everything ? It could be tested, but I think this is what is going to happen iiuc.
Overall this Another idea, would be to use the actual cache folder Sha hashing for large files should be separated decision, that the user has to ask for (given the actual speed cost). Also why use N files instead of 1 ? I am not sure about this, there are probably tradeoffs depending on the access patterns, just creating many files with a single line seems slightly off intuitively. We're currently not doing that in the git layout, are we ? |
Thanks for reading through it and for the feedback @Narsil!
Yes this is what should happen. I was thinking of if we can avoid the redownload of the files but maybe not worth overthinking it.
Kind of redundant on the principle yes but with a few majors differences compared to
I'm not against checking
The idea was to have everything self-contained to avoid filling a cache on a different volume. Especially useful on a cluster where users might have different caches but want to access data from a shared volume.
Agree that if
I'm fine to have a separate flag to explicitly compute the sha256 of files without relying on the metadata. It's a bit similar as removing the whole
Mostly because I wanted all the atomic downloads to be independent. I thought about having everything in a single index file + having a lock to read/write from it but it seemed more complex to me than relying on the OS to build the index (meaning N metadata files for N downloaded files). |
From my perspective and having little experience with the situations that @rwightman mentions, I would agree with @Narsil that having the Having a local The rest of the assumptions (including the breaking changes) sound fair to me; sounds like a much needed refactor! |
unsolicited comment: if we ever do create such a folder, I think it should be named |
re @LysandreJik @Narsil and the central cache, that's a tough one, I see perspective here, but also remember in a cluster / multi-user environement central caches are a PITA, those caches often default to $HOME and that is by default not accessible between users, teams sometimes coordinate and try to assign a shared env var for their cache, but extra steps and easy to fudge.
I'm not sure it's helpful to call this 'local .cache' a cache at all, it's metadata specific to the files in the containing download folder. rsync's protocol for file skipping is pretty effective but not sure that's feasible to implement here... |
@rwightman Your use case would just mean:
It wouldn't mean any "desync" or forced redownloads, if the files are actually there. |
if it's only from the CLI and only when you specify a It's not even completely necessary, its only goal being to avoid re-downloading identical files (so a nice to have optimization) |
So in my scenario here, we're talking TB of data. So that'd mean no cache, have to fall back to checksum calculation. Which is doable but should at least be implemented efficiently, ie no calculating all checksums up front, but in parallel, checksum per file, skip if the same, otherwise download. rsync has a block checksum protocol but that requires a specific protocol client-server. EDIT: in one of the other scenarios re pushing big datasets to the hub, ran into a resume gotcha where a big commit with thousands of files, > 1TB of data, if it wasn't broken down into small commits, it would re-calc hashes of ever file before starting to figure out what to sync, that's not a great UX because you sit in a hung state not doing anything noticable for possibly an hr or moe. Where possible, it's better to check/skip at a more granular level, overlap checksums w/ transfers, than calculate the full set of diffs up front. |
Yes, this is a different topic we have in mind 👍 |
@Wauplin yup, upload case is different, was just pointing out for this download, if there is going to be a checksum fallback, or we end up deciding metadata is too messy and always want to use checksums, checksumming / skipping should be done during the download progress (per-file) rather than up front (checksum all files, then start downloading the different ones)... |
EDIT: not so sure relying on a distant cache + optionally Thanks everyone for the discussion :). I think we can find a good tradeoff that would solve most use cases:
For example to download the no_robots dataset, you would do:
which would download the files to For use cases where you need the metadata to be cached together with the downloaded files, you would do:
Metadata will therefore be stored in In this scenario, |
re-reading this whole thread, i still think the benefits of keeping the "cache" (to be clear, it's very small, it's just a mapping of filenames to hashes) in the local folder outweigh the drawbacks of having a unrelated folder in the same dir Everything else, LGTM 🔥 |
Re-discussing it, proposition from #1738 (comment) feels clunky. Having metadata cached in a separate location makes it less robust to users moving/renaming their folders structure (also the extra Let's have a |
I just opened a PR to address what we discussed in this issue! #2223 Would love some feedback if someone wants to try it. Installation:
Usage:
|
Closed by #2223 🎉 |
EDIT: conversation in this issue derived from initial topic to a revampt of the download mechanism when downloading to a local folder. More details below.
(previous title: do not download when
local_dir=True
and file is identical)Especially useful for
When using
hf_hub_download(..., local_dir=...)
and users not using symlinks (Windows orlocal_dir_use_symlink=False
), the file is redownloaded from the Hub which can take a very long time. This happens even if the file did not change. What we can do for LFS files only is to compute the sha256 of the local file and download only if it changed. Computing sha256 is not instant but still faster than downloading.The text was updated successfully, but these errors were encountered: