[RFC] Consider switching to `httpx` instead of `requests` #1368

Wauplin · 2023-03-01T12:34:55Z

1. `requests.Session` would be really cool

huggingface_hub is a library essentially to make calls to the Hub. For that we are using requests which is great but has its limitations. requests has the concept of Session to keep a connection alive when doing multiple requests in a row. I made some experiments this morning and the gain is very substantial. The following snippet uses hfh to create a repo, upload 2 (almost empty) files and delete it. It's not perfect but quite representative to what users could do.

repo_id = create_repo("tmp_repo_id_lucain").repo_id
upload_file(repo_id=repo_id, path_in_repo="text.txt", path_or_fileobj=b"content")
upload_file(repo_id=repo_id, path_in_repo="lfs.bin", path_or_fileobj=b"content")
delete_repo(repo_id=repo_id)

I tweaked huggingface_hub and ran this snippet to run with and without requests.Session. Average execution time goes down from 9.3s to 5.5s (-40%). This gain comes from the fact that we don't start a new HTTPS connection at each call.

I also tried a real example when loading weights with diffusers when they are already cached. Discussion has been started by @patrickvonplaten yesterday (internal link).

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)

=> it turns out I save around 1s (from 4.6s to 3.7s) which is again quite substantial. In this particular case, diffusers could also be improved but it still highlights that huggingface_hub can do better.

Note: besides improving a lot the performances, being able to pass a Session object is also useful for users that needs to configure their proxy (and headers/user-agents/cookies/...). At the proxies can be configured in hf_hub_download and snapshow_download but not in HfApi.

2. But `requests.Session` is not thread-safe

I was about to push a draft PR to add an (optional) session parameter everywhere when I realized that doing so is not entirely thread-safe. The issue has been raised in 2015 and referenced quite a lot but the issue is not solved. It seems that most problems have been fixed by the underlying urllib3 package but it's not enough. See for example this issue reported in Dec 2022. I haven't investigated the why behind it but I think it's quite a no-go for us. We use threads to upload/download files (though the Session is less needed in that case) and any user can also make concurrent calls using HfApi (for example loop over repos and do something). Since it would be nice to still guarantee that HfApi is thread-safe, let's not use requests.Session.

3. Use `httpx` instead

httpx self-describes itself as HTTPX is a fully featured HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. What's interesting for us is:

it's very similar to requests. They have a compatibility guide and nothing things too difficult to handle.
it's guaranteed to be thread-safe
it's hopefully compliant as much as possible with http standards
(not a priority) switching to httpx would also be a good start if we want to support some async stuff some day.

4. Caveats and workarounds

Switching to httpx still has some drawbacks:

Community might be less used to it. Also it adds a new dependency.
httpx.HTTPError and requests.HTTPError are not the same objects (same for Response and Request). It's not so much an issue since huggingface_hub never return a Request/Response directly. However I think what we should do is to make HfHubHttpError inherit from both httpx.HTTPError and requests.HTTPError. This way we don't break any try/except implemented by downstream libraries
httpx do not follow redirects by default but we can enforce to do it. To keep compatibility, I guess we should do so (at least to still handle hf.co => huggingface.co)
stream=True is used once in hfh and behavior has changed for it => should be ok

5. What's next?

WDYT about it? Any hard feeling about not doing it? My take is that it's a good idea to do it now, deal with the few issues that will happen and then we'll be fine to build on top of it.

So basically the plan is:

Make a PR to switch from requests to httpx without adding features. Keep requests as a dependency.
Once we have httpx, use a session everywhere (called httpx.Client). Also deprecate proxies parameters since configuration can be handled by the user directly
(later) start to use async capabilities

tagging @sgugger @LysandreJik @patrickvonplaten @lhoestq @julien-c @osanseviero for feedback

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2023-03-01T15:17:37Z

I don't know enough about httpx to have a good opinion here, but from your points above it seems like it's very much worth changing to httpx if we're able to bring down latencies down significantly when doing multiple calls to the Hub in rapid succession.

julien-c · 2023-03-07T12:07:41Z

maybe worth asking the community about this, too

Wauplin · 2023-03-27T07:34:23Z

As part of #1394 I used requests.Session object to keep the connection alive between HTTP calls. To tackle the thread-safety issue, we now have a get_session() helper to get the current opened session. 1 session per thread is created to avoid mixing them. Sessions can be globally configured using configure_http_backend.

So no need for httpx for now. It would have been cleaner to have a single httpx.Client but adding a new dependency was too heavy compared to the implemented workaround. I'm closing this issue as "wont-fix". Let's revisit httpx usage if we introduce async at some point.

Wauplin mentioned this issue Mar 2, 2023

[WIP] [just to trigger CI] share request session in hf api #1369

Closed

Wauplin mentioned this issue Mar 16, 2023

Keep-alive connection between requests #1394

Merged

4 tasks

Wauplin closed this as not planned Won't fix, can't repro, duplicate, stale Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Consider switching to `httpx` instead of `requests` #1368

[RFC] Consider switching to `httpx` instead of `requests` #1368

Wauplin commented Mar 1, 2023 •

edited

patrickvonplaten commented Mar 1, 2023

julien-c commented Mar 7, 2023

Wauplin commented Mar 27, 2023

[RFC] Consider switching to httpx instead of requests #1368

[RFC] Consider switching to httpx instead of requests #1368

Comments

Wauplin commented Mar 1, 2023 • edited

1. requests.Session would be really cool

2. But requests.Session is not thread-safe

3. Use httpx instead

4. Caveats and workarounds

5. What's next?

patrickvonplaten commented Mar 1, 2023

julien-c commented Mar 7, 2023

Wauplin commented Mar 27, 2023

[RFC] Consider switching to `httpx` instead of `requests` #1368

[RFC] Consider switching to `httpx` instead of `requests` #1368

Wauplin commented Mar 1, 2023 •

edited

1. `requests.Session` would be really cool

2. But `requests.Session` is not thread-safe

3. Use `httpx` instead