-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apparent_encoding should be cached since chardet can be slow #6250
Comments
It turns out that r.text makes calls to chardet each time it is called. That's not great because chardet can be slow and use a lot of memory, particularly when checking PDFs. Instead of doing that or checking if things are PDFs all the time, simply use the binary content instead of the text. Fixes: #564 Relates to: psf/requests#6250
You can do this today by setting |
Yeah, or by overriding the One other thought, I noticed that the |
No, I'm saying that while character detection can be slow, it usually isn't. If it's speed is a problem, the documented way if avoiding it is to set the encoding attribute yourself. This is you want it cached, set the encoding attribute yourself. It's easy and available today |
We have some scraper code that sometimes gets back PDFs and other times gets back HTML. Today we learned that if you access r.text in a large-ish PDF (40MB), chardet is called, which uses a lot of CPU (and a ton of memory):
That's more or less fine (best not to try to get the text of a PDF this way), but if you access
r.text
more than once, chardet gets run over and over.We have code like this that performs horribly:
When you access
r.text
, it checks if the encoding can come from the HTTP headers. If not, it runs theapparent_encoding
property, which looks like:I think that property should probably be cached since it's slow, so that repeated calls to r.text don't hurt so badly.
Expected Result
I expected the calls to the text property to only calculate the encoding once per request.
Actual Result
Each call to the text property re-calculates the encoding, which is slow and uses a lot of memory (this is probably a bug in chardet, but it uses hundreds of MB on a 40MB PDF right now).
System Information
The text was updated successfully, but these errors were encountered: