We have some scraper code that sometimes gets back PDFs and other times gets back HTML. Today we learned that if you access r.text in a large-ish PDF (40MB), chardet is called, which uses a lot of CPU (and a ton of memory):
r = requests.get(some_url)
r.text
That's more or less fine (best not to try to get the text of a PDF this way), but if you access r.text more than once, chardet gets run over and over.
We have code like this that performs horribly:
r = requests.get(some_url)
if bad_text in r.text:
continue
if other_bad_text in r.text:
continue
# ...many more tests...
When you access r.text, it checks if the encoding can come from the HTTP headers. If not, it runs the apparent_encoding property, which looks like:
@property
def apparent_encoding(self):
"""The apparent encoding, provided by the charset_normalizer or chardet libraries."""
return chardet.detect(self.content)["encoding"]
I think that property should probably be cached since it's slow, so that repeated calls to r.text don't hurt so badly.
Expected Result
I expected the calls to the text property to only calculate the encoding once per request.
Actual Result
Each call to the text property re-calculates the encoding, which is slow and uses a lot of memory (this is probably a bug in chardet, but it uses hundreds of MB on a 40MB PDF right now).
System Information
$ python -m requests.help
{
"chardet": {
"version": "5.0.0"
},
"charset_normalizer": {
"version": "2.0.12"
},
"cryptography": {
"version": "36.0.2"
},
"idna": {
"version": "2.10"
},
"implementation": {
"name": "CPython",
"version": "3.10.7"
},
"platform": {
"release": "5.4.209-116.363.amzn2.x86_64",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "101010ef",
"version": "20.0.1"
},
"requests": {
"version": "2.28.1"
},
"system_ssl": {
"version": "101010ef"
},
"urllib3": {
"version": "1.26.5"
},
"using_charset_normalizer": false,
"using_pyopenssl": true
}
We have some scraper code that sometimes gets back PDFs and other times gets back HTML. Today we learned that if you access r.text in a large-ish PDF (40MB), chardet is called, which uses a lot of CPU (and a ton of memory):
That's more or less fine (best not to try to get the text of a PDF this way), but if you access
r.textmore than once, chardet gets run over and over.We have code like this that performs horribly:
When you access
r.text, it checks if the encoding can come from the HTTP headers. If not, it runs theapparent_encodingproperty, which looks like:I think that property should probably be cached since it's slow, so that repeated calls to r.text don't hurt so badly.
Expected Result
I expected the calls to the text property to only calculate the encoding once per request.
Actual Result
Each call to the text property re-calculates the encoding, which is slow and uses a lot of memory (this is probably a bug in chardet, but it uses hundreds of MB on a 40MB PDF right now).
System Information