Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hfh hf_hub_url function #5196

Merged
merged 26 commits into from
Nov 9, 2022

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Nov 3, 2022

Small refactoring to use hf_hub_url function from huggingface_hub.

This PR also creates the hub module that will contain all huggingface_hub functionalities relevant to datasets.

This is a necessary stage before implementing the use of the hfh caching system (which uses its hf_hub_url under the hood).

EDIT:
Finally, we use our config.HUB_DATASETS_URL when using hfh.hf_hub_url
There is a breaking change: the hfh hf_hub_url function uses

  • hfh HUGGINGFACE_CO_URL_TEMPLATE URL template, different from the datasets config.HUB_DATASETS_URL
  • also, hfh DEFAULT_REVISION, instead of datasets config.HUB_DEFAULT_VERSION

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lhoestq
lhoestq previously approved these changes Nov 4, 2022
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

tests/test_streaming_download_manager.py Outdated Show resolved Hide resolved
tests/fixtures/hub.py Outdated Show resolved Hide resolved
tests/fixtures/hub.py Outdated Show resolved Hide resolved
Comment on lines 12 to 16
with temporary_assignment(
huggingface_hub.file_download,
"HUGGINGFACE_CO_URL_TEMPLATE",
config.HUB_DATASETS_URL.replace("{path}", "{filename}"),
):
Copy link
Member

@lhoestq lhoestq Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Wauplin is it how you'd expect datasets to overwrite HUGGINGFACE_CO_URL_TEMPLATE ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what is the goal here ? Is it just to avoid passing repo_type="dataset" ?

For what I understand, HUB_DATASETS_URL from datasets and HUGGINGFACE_CO_URL_TEMPLATE from huggingface_hub are very similar, isn't it ?

HUGGINGFACE_CO_URL_TEMPLATE = ENDPOINT + "/{repo_id}/resolve/{revision}/{filename}"
HUB_DATASETS_URL = HF_ENDPOINT + "/datasets/{repo_id}/resolve/{revision}/{path}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal is to overwrite the hfh one with the datasets one, but idk if we should

Copy link
Contributor

@Wauplin Wauplin Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really avoid to as the "/{repo_id}/resolve/{revision}/{filename}" part is not specific to datasets.

If it is a matter of customizing the endpoint, both are defined as HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co") so it doesn't add something compared to existing hfh behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that datasets users might have override the variable config.HUB_DATASETS_URL.

If we don't pass this to hfh, this is a breaking change and we should warn datasets users that their setting config.HUB_DATASETS_URL is ignored.

Indeed I struggled to have this working:

  • in datasets we import the module config each time we need to use a variable: then we can modify afterwards one of the variable in it and that will be updated wherever the config module is used; we support dynamic update of the config
  • in hfh only the config constant is imported wherever it is used and this creates a "copy" of it: whenever we modify afterwards the variable value in huggingface_hub.constants has no effect on huggingface_hub.file_download imported constants: they were already "copied" at import time

In relation with the ENDPOINT, we set it also dynamically in our CI, using fixtures. However, it is useless to modify it dynamically in hfh: once the repo URL is defined from the "old" ENDPOINT, changing a new value for the ENDPOINT will not have any effect in the repo URL: this is set only once at import time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while tweaking this, I discovered our CI has some tests that request the production ENDPOINT: so the CI ENDPOINT cannot be set globally for all tests: https://github.com/huggingface/datasets/actions/runs/3386458817/jobs/5625907924

TokenizersDumpTest.test_hash_tokenizer

>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://hub-ci.huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @albertvillanova for the broader context. Since how long do you have this HUB_DATASETS_URL env variable ? It might be worth to deprecate it if it's only used to overwrite HUGGINGFACE_CO_URL_TEMPLATE (if it has a larger scope, maybe it's good to keep if).

I imagine you need this, @albertvillanova : huggingface/huggingface_hub#1082

Agree that this should be the clean solution to do it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in hfh only the config constant is imported wherever it is used and this creates a "copy" of it:

Totally get your point here. Might be worth opening an issue in hfh to fix that once and for all :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally get your point here. Might be worth opening an issue in hfh to fix that once and for all :)

@Wauplin that fix would make testing much easier... 🤗

@lhoestq lhoestq dismissed their stale review November 8, 2022 11:47

discussing if we want to overwrite hfh constants

@albertvillanova
Copy link
Member Author

@lhoestq I think we should first agree if datasets can introduce the breaking change of ignoring config.HUB_DATASETS_URL: some users may have override this.

If so, I then would suggest to initiate a deprecation cycle.

@albertvillanova
Copy link
Member Author

albertvillanova commented Nov 8, 2022

After a discussion with the rest of the datasets team, we agreed we can introduce the breaking change of ignoring config.HUB_DATASETS_URL: this will have minimal impact, only for private Hubs. We will address eventual possible impacts in the future.

Additionally, we also ignore config.HUB_DEFAULT_VERSION.

See explanation in this PR description: #5196 (comment)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks :)

monkeypatch.setattr(
"huggingface_hub.file_download.HUGGINGFACE_CO_URL_TEMPLATE", CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin, just please note that in our CI currently we need to do the patching in huggingface_hub.file_download because doing it in huggingface_hub.constants has no effect: at import time, the constant HUGGINGFACE_CO_URL_TEMPLATE was copied from the submodule constants to file_download.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for the reminder. I created an issue on huggingface_hub to keep track of this.
huggingface/huggingface_hub#1172

@severo
Copy link
Contributor

severo commented Nov 21, 2022

I'm trying to upgrade datasets to 2.7.0 in https://github.com/huggingface/datasets-server, and the tests fail due to this change. I think it's a breaking change (that was not listed in https://github.com/huggingface/datasets/releases/tag/2.7.0) since code that previously worked (by setting datasets.config.HUB_DATASETS_URL = CI_HUB_DATASETS_URL for example) does not work anymore.

I'm not sure what is the correct way to set up the tests; besides setting the env var "HF_ENDPOINT" before launching the tests (which, I think, is not a good way to do: the tests should not depend on the environment).

@severo
Copy link
Contributor

severo commented Nov 21, 2022

OK, I re-read this thread, and #5196 (comment) explicitely states that config.HUB_DATASETS_URL (as well as config.HUB_DEFAULT_VERSION) is now ignored. I was expecting the breaking changes to be listed in the release notes: https://github.com/huggingface/datasets/releases/tag/2.7.0.

@Wauplin
Copy link
Contributor

Wauplin commented Nov 21, 2022

I'm not sure what is the correct way to set up the tests; besides setting the env var "HF_ENDPOINT" before launching the tests (which, I think, is not a good way to do: the tests should not depend on the environment).

I think the current workaround of settings an env variable before launching the tests is "not so bad" when considering the fact that env variables are evaluated at import time in huggingface_hub (and most probable datasets as well). I think that when refactoring this in huggingface_hub (huggingface/huggingface_hub#1172) I'll opt for instantiating a Settings object (or Constants) that contains all the settings variables. This way it will not be possible to import attributes individually + tests would be easier. As I see it, it would be similar to what Pydantic does even though we most probably don't want Pydantic as a root dependency just for that.

@lhoestq
Copy link
Member

lhoestq commented Nov 21, 2022

You can use fixtures in your tests:

CI_HUB_ENDPOINT = "https://hub-ci.huggingface.co"
CI_HUB_DATASETS_URL = CI_HUB_ENDPOINT + "/datasets/{repo_id}/resolve/{revision}/{path}"
CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE = CI_HUB_ENDPOINT + "/{repo_id}/resolve/{revision}/{filename}"

@pytest.fixture
def ci_hfh_hf_hub_url(monkeypatch):
    monkeypatch.setattr(
        "huggingface_hub.file_download.HUGGINGFACE_CO_URL_TEMPLATE", CI_HFH_HUGGINGFACE_CO_URL_TEMPLATE
    )

@pytest.fixture
def ci_hub_config(monkeypatch):
    monkeypatch.setattr("datasets.config.HF_ENDPOINT", CI_HUB_ENDPOINT)
    monkeypatch.setattr("datasets.config.HUB_DATASETS_URL", CI_HUB_DATASETS_URL)

and use @pytest.fixture(autouse=True) if you want to always use the CI endpoints.

And when huggingface-hub and datasets change the way we can set the endpoint, we'll just need to update the fixtures.
I think ultimately you'll only have to change the huggingface-hub endpoint settings

@severo
Copy link
Contributor

severo commented Nov 21, 2022

OK.

In fact, in datasets-server we set config.HUB_DATASETS_URL (https://github.com/huggingface/datasets-server/blob/35a30dbcd687b26db1f02502ea8305f70c064473/workers/splits/src/splits/config.py#L26) at config time, before starting the workers. It's not an issue with how to launch the tests, but with the app in itself.

I understand that for now, the only way to fix this is to setup HF_ENDPOINT in the env when launching the app (currently, we set the endpoint with COMMON_HF_ENDPOINT, a custom env var I set to be sure not to have side-effects)

@severo
Copy link
Contributor

severo commented Nov 21, 2022

You can use fixtures in your tests:

Thanks, used in huggingface/dataset-viewer#644.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants