Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Proposal for a way to cache files in downstream libraries #1088

Merged
merged 8 commits into from Oct 13, 2022

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Sep 27, 2022

This is a proposal following discussions started with the datasets team (@lhoestq @albertvillanova).

The goal is to have a proper way to cache any kind of files from a downstream library and manage them (e.g.: scan and delete) from huggingface_hub. From hfh's perspective, there is not much work to do. We should have a canonical procedure to generate cache paths for a library. Then within a cache folder, the downstream library handles its files as it wants. Once this helper starts to be used, we can adapt the scan-cache and delete-cache commands.

I tried to document the cached_assets_path() helper to describe the way I see it. Any feedback is welcomed, this is really just a proposal. All the examples are very datasets-focused but I think this could benefit to other libraries as transformers (@sgugger @LysandreJik ), diffusers (@apolinario @patrickvonplaten) or skops (@adrinjalali @merveenoyan) to store any kind of intermediate files. IMO the difficulty mainly resides in making the feature used πŸ˜„.

EDIT: see generated documentation here.
EDIT 2: assets/ might be a better naming here (common naming in dev)

WDYT ?

(cc @julien-c @osanseviero as well)

Example:

>>> from huggingface_hub import cached_assets_path

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="download")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/download')

>>> cached_assets_path(library_name="datasets", namespace="SQuAD", subfolder="extracted")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/extracted')

>>> cached_assets_path(library_name="datasets", namespace="SQuAD")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/default')

>>> cached_assets_path(library_name="datasets", subfolder="modules")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/default/modules')

>>> cached_assets_path(library_name="datasets", cache_dir="/tmp/tmp123456")
PosixPath('/tmp/tmp123456/datasets/default/default')

And the generated tree:

    assets/
    β”œβ”€β”€ datasets/
    β”‚   β”œβ”€β”€ default/
    β”‚   β”‚   β”œβ”€β”€ modules/
    β”‚   β”œβ”€β”€ SQuAD/
    β”‚   β”‚   β”œβ”€β”€ downloaded/
    β”‚   β”‚   β”œβ”€β”€ extracted/
    β”‚   β”‚   └── processed/
    β”‚   β”œβ”€β”€ Helsinki-NLP--tatoeba_mt/
    β”‚       β”œβ”€β”€ downloaded/
    β”‚       β”œβ”€β”€ extracted/
    β”‚       └── processed/
    └── transformers/
        β”œβ”€β”€ default/
        β”‚   β”œβ”€β”€ something/
        β”œβ”€β”€ bert-base-cased/
        β”‚   β”œβ”€β”€ default/
        β”‚   └── training/
    hub/
    └── models--julien-c--EsperBERTo-small/
        β”œβ”€β”€ blobs/
        β”‚   β”œβ”€β”€ (...)
        β”‚   β”œβ”€β”€ (...)
        β”œβ”€β”€ refs/
        β”‚   └── (...)
        └── [ 128]  snapshots/
            β”œβ”€β”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/
            β”‚   β”œβ”€β”€ (...)
            └── bbc77c8132af1cc5cf678da3f1ddf2de43606d48/
                └── (...)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 27, 2022

The documentation is not available anymore as the PR was closed or merged.

@codecov
Copy link

codecov bot commented Sep 27, 2022

Codecov Report

Base: 84.91% // Head: 84.99% // Increases project coverage by +0.07% πŸŽ‰

Coverage data is based on head (1185dbe) compared to base (4bd9ebf).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1088      +/-   ##
==========================================
+ Coverage   84.91%   84.99%   +0.07%     
==========================================
  Files          38       39       +1     
  Lines        3931     3951      +20     
==========================================
+ Hits         3338     3358      +20     
  Misses        593      593              
Impacted Files Coverage Ξ”
src/huggingface_hub/constants.py 91.48% <100.00%> (+0.37%) ⬆️
src/huggingface_hub/utils/__init__.py 100.00% <100.00%> (ΓΈ)
src/huggingface_hub/utils/_cache_assets.py 100.00% <100.00%> (ΓΈ)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report at Codecov.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@lhoestq
Copy link
Member

lhoestq commented Sep 27, 2022

Cool thanks !

Can it even be possible to have the "datasets" folder one level higher ?

    datasets/
    └── squad/
        β”œβ”€β”€ downloaded/
        β”œβ”€β”€ extracted/
        └── arrow/
    hub/
    └── models--julien-c--EsperBERTo-small/
        β”œβ”€β”€ blobs/
        β”œβ”€β”€ refs/
        └── [ 128]  snapshots/

Then there's "modules" (top level as well) which is the directory where we copy python scripts. IMO this one doesn't need to be part of this change (for now ?)

For context:

The "modules" directory is added to the python path in datasets and evaluate at run-time once. Inside "modules" we define "datasets_modules" and "evaluate_modules" that can be imported with importlib.import_module("datasets_modules") for example. And finally inside "datasets_modules" we have the datasets scripts that are imported as submodules of "datasets_modules".

I think we keep using one directory "modules" at the top level, this way any HF lib only need to add one directory in the python path. For now it can be excluded from the HFH cache since it only contains small files

@sgugger
Copy link
Contributor

sgugger commented Sep 27, 2022

We also use the modules folder in the cache to put a transformers_module containing the code of dynamic models/tokenizers/pipelines (that have their code on the Hub).

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 27, 2022

Thanks for the feedback :) To answer your questions 1 by 1:

Can it even be possible to have the "datasets" folder one level higher ?

What would be the goal of having it at the root of the cache ? The main advantage when I thought about it was that since it is a new folder in the cache, it would not conflict with any of the existing libraries (datasets, transformers and others). In any case, the downstream library shouldn't care too much where the cache it is only a path returned from hugginface_hub. But of course, I can be convinced here if it makes sense.

Then there's "modules" (top level as well) which is the directory where we copy python scripts.

I am fine with keeping the modules folder at root level. This folder will in any case be quite small, right ? (only python script so no data to delete to save space). But in next iterations, I don't see huge difference with having them under cache/extra/datasets/default/modules/ and cache/extra/transformers/default/modules/ (except not changing the existing code).

IMO this one doesn't need to be part of this change (for now ?)

In any case, as I see it, the huggingface_hub update would just propose a canonical path where to cache stuff. Since stuff is already implemented in datasets, transformers (and cie), we don't have to change everything at once. Sections of the cache can be iteratively moved to the extra/ cache but there is no rush in that.

The "modules" directory is added to the python path in datasets and evaluate at run-time

Does datasets and evaluate share the same python module folder ? If (and only if) we move them to the extra/ cache, we could point evaluate to import for the extra/datasets/modules/ folder since we maintain both librairies. In general I think it's best if each library uses only its own cache but modules is a quite specific use case.

I think we keep using one directory "modules" at the top level, this way any HF lib only need to add one directory in the python path
We also use the modules folder in the cache to put a transformers_module containing the code of dynamic models/tokenizers/pipelines

Yep, very specific use case (and already existing) so no need to change IMO.

@lhoestq
Copy link
Member

lhoestq commented Sep 27, 2022

What would be the goal of having it at the root of the cache ?

Users of datasets that played with the cache before would expect their data to be in .cache/huggingface/datasets as before I think - or at least in a directory at the same level with a clear name ("datasets-cache" or something like that).

Therefore if the main concern is to avoid conflicts I think we can just rename it "datasets-cache"

I am fine with keeping the modules folder at root level. This folder will in any case be quite small, right ?

Yup it's quite small (usually a few KB per dataset)

Does datasets and evaluate share the same python module folder ?

Both add .cache/huggingface/modules in the python path. But in practice they use different folders inside "modules": "datasets_modules" for datasets and "evaluate_modules" for evaluate. So their caches are different, and each lib handles its cache its own way. And we only need to add one directory to python path.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 28, 2022

Therefore if the main concern is to avoid conflicts I think we can just rename it "datasets-cache"

I would also like to avoid conflicts with existing folders for the future scan-cache and delete-cache tools. If everything lives under the extra/ directory, I would not have to care about special folders that would not use the same structure like modules or doc_builder. For now I only have those 2 examples on my machine but idk if there are more. Having everything in a separate folder extra/ on which huggingface_hub is the only lib to deal with the structure (at least for the first 3 levels of depths, then under cache/extra/datasets/SQuAD/extracted you can put anything you want) would avoid edge cases.

Also, we it would allow to have a clear distinction between folders that can be scanned via huggingface-cli and the others. If we have datasets/ and datasets-cache/ together next to each other I'd find it more confusing actually. WDYT ?

Yup it's quite small (usually a few KB per dataset)
Both add .cache/huggingface/modules in the python path

Then let's not focus too much on that. It light-weight and already working so no need to touch it.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 29, 2022

(nit) I just thought about it but renaming extra/ to assets/ sounds better I'd say (I would expect users/devs to understand it better). And cached_assets_folder as helper name.

@lhoestq
Copy link
Member

lhoestq commented Sep 29, 2022

Ok I see ! I like assets more as well. I'm ok with going for assets then (I just remembered we had a quite explicit message printed every time a dataset is loaded to show its cache location anyway, so it should do the job)

EDIT: hmm actually let me think more about this

@lhoestq
Copy link
Member

lhoestq commented Oct 6, 2022

Here is another idea we had with @polinaeterna

    datasets-cache/
    └── squad/
        β”œβ”€β”€ repository/
        β”‚   β”œβ”€β”€ blobs/
        β”‚   β”œβ”€β”€ refs/
        β”‚   └── [ 128]  snapshots/
        β”œβ”€β”€ downloaded/
        β”œβ”€β”€ extracted/
        β”œβ”€β”€ python/
        └── arrow/

In my opinion this is ideal for a datasets users: you can delete all your cache in one command rm -rf datasets-cache and delete everything related to one specific dataset in one command rm -rf datasets-cache/<name>

This is a bit different from what's currently done in huggingface_hub but I think it would be awesome

@polinaeterna
Copy link
Contributor

polinaeterna commented Oct 6, 2022

yes this structure is good for us (datasets) and it means that scripts are stored not in modules/datasets_modules but in a specific dataset folder (like squad/python)

but it would require either adding both datasets-cache and evaluate's modules cache folders to python path which might lead to collisions, or higher level ~/.cache/huggingface directory, but this way it might lead to collision with transformers, though the latter seems unlikely because it shouldn't contain __init__.py, as far as I understand, so tell me if I'm wrong.

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 11, 2022

@lhoestq @polinaeterna I'm sorry but I think this will be unlikely to happen exactly like this. Repos from the Hub are now cached under ~/.cache/huggingface/hub no matter which library is using it. Even though I understand to goal of having everything at the same place, including the repo clone (blobs, refs and snapshots), I think this would be too datasets-specific.

I see at least 2 reasons to keep the repository/ folder in the hub/ folder:

  1. It is consistent with what is done with models and spaces (and maybe more in the future ?). This is less confusing for users and implies less case-specific code to maintain. Also benefit from the scan/delete-cache CLI commands to delete only specific revisions of a dataset (is it as common for a dataset than for a model to have multiple revisions with big changes ?).
  2. Others scripts/libraries could also cache datasets from the Hub without using the datasets library. If so, should we download twice the repo (once in datasets-cache/ and once in hub/) ? Or use datasets-cache/ even though the script is not specific from the datasets library ? Both feels wrong to me. I think one of the main reason to start having a unified hub/ folder for any repo was to specifically avoid that (no more transformers-specific cache for example).

That been said, it is still possible to move the python/ folder inside the datasets cache. I also found it cleaner but I thought that was not possible because it requires to add the folder to the python path.

In the end, you can would end up with something like this:

.cache/huggingface/
β”œβ”€ assets/
β”‚  └─ datasets/
β”‚     └── squad/
β”‚          β”œβ”€β”€ downloaded/
β”‚          β”œβ”€β”€ extracted/
β”‚          β”œβ”€β”€ python/
β”‚          └── arrow/
└─ hub/
   └── datasets--squad/
       β”œβ”€β”€ blobs/
       β”œβ”€β”€ refs/
       └── snapshots/

rm -rf ./cache/assets/datasets/squad is still possible but would require to delete the squad cached repo separately.

Would that be ok ?

@albertvillanova
Copy link
Member

albertvillanova commented Oct 11, 2022

Hi @Wauplin,

Indeed yesterday we had a meeting of the datasets team and we discussed about this topic.
IMO, I totally understand your point: so that we keep current behavior of the hub cache and just add the required specificity by datasets with as less impact as possible to current implementation (in a different directory).

The request by @lhoestq and @polinaeterna of putting everything in a per-dataset directory, has an underlying need: to be able to easily find/delete all cached files of a specific dataset:

  • with their approach, this can easily be done by listing/removing the corresponding dataset directory
  • with your approach, that will be possible by using the scan-cache and delete-cache functions:

Therefore, if we agree on your proposed directory structure:

  • can we list/delete all cached files of a specific datasets, regardless of whether the files are in one directory (hub) or another (extra or assets)?

Additionally, for the "modules" directory, I think you were right (@lhoestq could you confirm this?): they need to be in a separate parent directory (that will be added to the import path), for all datasets: datasets_modules/. Therefore, not possible to have it in a per-dataset directory (squad/python/). Also note that these files are a copy of the previously cached files in your "hub/" directory (scripts are downloaded from the Hub).

Another remark: please note the specificity of the "downloaded" directory:

  • until now, we put there whatever file we downloaded, either from the Hub or from 3rd party servers
  • with the new implementation, the "downloaded" should only be used for 3rd party servers; files downloaded from the Hub will be in your "hub/" directory

I think this difference must be implemented downstream by datasets team.

@lhoestq
Copy link
Member

lhoestq commented Oct 12, 2022

I see, thanks @Wauplin and @albertvillanova :) sounds good to me

super excited about this !

Additionally, for the "modules" directory, I think you were right (@lhoestq could you confirm this?): they need to be in a separate parent directory (that will be added to the import path), for all datasets: datasets_modules/. Therefore, not possible to have it in a per-dataset directory (squad/python/). Also note that these files are a copy of the previously cached files in your "hub/" directory (scripts are downloaded from the Hub).

We could also add the assets directory to sys.path, and name its subdirectory datasets-cache to avoid a module name collision with datasets, but I don't have a strong opinion on this - we can keep the modules dir if we want

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 12, 2022

We could also add the assets directory to sys.path, and name its subdirectory datasets-cache to avoid a module name collision with datasets, but I don't have a strong opinion on this - we can keep the modules dir if we want

Not so much opinionated about this but if you do so I think you should align with the transformers team as well.
Worth noticing that assets/ can be managed by any downstream library (e.g. any lib will be able to have its own directory under assets/) so maybe it's not ideal to add everything to the python path.

@lhoestq
Copy link
Member

lhoestq commented Oct 12, 2022

I see - let's keep modules as it is then

@Wauplin Wauplin marked this pull request as ready for review October 12, 2022 15:15
@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 12, 2022

Ok everyone, thanks for the feedback and discussions. I think we reached a common agreement for this PR then.
When it comes to the tools to scan/delete those assets, we can discuss them in a separate issue. They will come once we start to integrate this assets folder with datasets and other downstream libraries :)

@LysandreJik @lhoestq I've ping you for the review but other contributors are welcomed to review if wanted. Thanks in advance !

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive PR, thanks @Wauplin!

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 13, 2022

Thanks for the review @LysandreJik , I'm merging :)

@Wauplin Wauplin merged commit 6a4538b into main Oct 13, 2022
@Wauplin Wauplin deleted the rfc-external-cache-folder branch October 13, 2022 08:32
@lhoestq
Copy link
Member

lhoestq commented Oct 13, 2022

Awesome thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants