Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take namespace into account in caching #2938

Merged
merged 3 commits into from
Sep 29, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 17, 2021

Loading a dataset "username/dataset_name" hosted by a user on the hub used to cache the dataset only taking into account the dataset name, and ignorign the username. Because of this, if a user later loads "dataset_name" without specifying the username, it would reload the dataset from the cache instead of failing.

I changed the dataset cache and module cache mechanism to include the username in the name of the cache directory that is used:

~/.cache/huggingface/datasets/username/dataset_name for the data
~/.cache/huggingface/modules/datasets_modules/datasets/username/dataset_name for the python files

EDIT: actually using three underscores:
~/.cache/huggingface/datasets/username___dataset_name for the data
~/.cache/huggingface/modules/datasets_modules/datasets/username___dataset_name for the python files

This PR should fix the issue #2842

cc @stas00

@severo
Copy link
Contributor

severo commented Sep 17, 2021

We might have collisions if a username and a dataset_name are the same. Maybe instead serialize the dataset name by replacing / with some string, eg __SLASH__, that will hopefully never appear in a dataset or user name (it's what I did in https://github.com/huggingface/datasets-preview-backend/blob/master/benchmark/scripts/serialize.py. That way, all the datasets are one-level deep directories

@julien-c
Copy link
Member

julien-c commented Sep 18, 2021

IIRC we enforce that no repo id or username can contain ___ (exactly 3 underscores) specifically for this reason, so you can use that string (that we use in other projects)

cc @Pierrci

@severo
Copy link
Contributor

severo commented Sep 20, 2021

IIRC we enforce that no repo id or username can contain ___ (exactly 3 underscores) specifically for this reason, so you can use that string (that we use in other projects)

out of curiosity: where is it enforced?

@julien-c
Copy link
Member

where is it enforced?

Nowhere yet but we should :) feel free to track in internal tracker and/or implement, as this will be useful in the future

@lhoestq
Copy link
Member Author

lhoestq commented Sep 21, 2021

Thanks for the trick, I'm doing the change :)
We can use
~/.cache/huggingface/datasets/username___dataset_name for the data
~/.cache/huggingface/modules/datasets_modules/datasets/username___dataset_name for the python files

@lhoestq lhoestq marked this pull request as ready for review September 27, 2021 09:17
@lhoestq
Copy link
Member Author

lhoestq commented Sep 29, 2021

Merging, though it will have to be integrated again the refactor at #2986

@lhoestq lhoestq merged commit 9675a5a into master Sep 29, 2021
@lhoestq lhoestq deleted the take-namespace-into-account-in-caching branch September 29, 2021 13:01
@severo
Copy link
Contributor

severo commented Dec 17, 2021

@lhoestq we changed a bit the naming policy on the Hub, and the substring '--' is now forbidden, which makes it available for serializing the repo names (namespace/repo -> namespace--repo). See https://github.com/huggingface/moon-landing/pull/1657 and huggingface/huggingface_hub#545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants