-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take namespace into account in caching #2938
Conversation
We might have collisions if a username and a dataset_name are the same. Maybe instead serialize the dataset name by replacing |
IIRC we enforce that no repo id or username can contain cc @Pierrci |
out of curiosity: where is it enforced? |
Nowhere yet but we should :) feel free to track in internal tracker and/or implement, as this will be useful in the future |
Thanks for the trick, I'm doing the change :) |
Merging, though it will have to be integrated again the refactor at #2986 |
@lhoestq we changed a bit the naming policy on the Hub, and the substring '--' is now forbidden, which makes it available for serializing the repo names ( |
Loading a dataset "username/dataset_name" hosted by a user on the hub used to cache the dataset only taking into account the dataset name, and ignorign the username. Because of this, if a user later loads "dataset_name" without specifying the username, it would reload the dataset from the cache instead of failing.
I changed the dataset cache and module cache mechanism to include the username in the name of the cache directory that is used:
~/.cache/huggingface/datasets/username/dataset_name
for the data~/.cache/huggingface/modules/datasets_modules/datasets/username/dataset_name
for the python filesEDIT: actually using three underscores:
~/.cache/huggingface/datasets/username___dataset_name
for the data~/.cache/huggingface/modules/datasets_modules/datasets/username___dataset_name
for the python filesThis PR should fix the issue #2842
cc @stas00