Skip to content

apparent_encoding should be cached since chardet can be slow #6250

@mlissner

Description

@mlissner

We have some scraper code that sometimes gets back PDFs and other times gets back HTML. Today we learned that if you access r.text in a large-ish PDF (40MB), chardet is called, which uses a lot of CPU (and a ton of memory):

r = requests.get(some_url)
r.text

That's more or less fine (best not to try to get the text of a PDF this way), but if you access r.text more than once, chardet gets run over and over.

We have code like this that performs horribly:

r = requests.get(some_url)
if bad_text in r.text:
    continue
if other_bad_text in r.text:
    continue
# ...many more tests...

When you access r.text, it checks if the encoding can come from the HTTP headers. If not, it runs the apparent_encoding property, which looks like:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the charset_normalizer or chardet libraries."""
        return chardet.detect(self.content)["encoding"]

I think that property should probably be cached since it's slow, so that repeated calls to r.text don't hurt so badly.

Expected Result

I expected the calls to the text property to only calculate the encoding once per request.

Actual Result

Each call to the text property re-calculates the encoding, which is slow and uses a lot of memory (this is probably a bug in chardet, but it uses hundreds of MB on a 40MB PDF right now).

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "5.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.12"
  },
  "cryptography": {
    "version": "36.0.2"
  },
  "idna": {
    "version": "2.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.10.7"
  },
  "platform": {
    "release": "5.4.209-116.363.amzn2.x86_64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "101010ef",
    "version": "20.0.1"
  },
  "requests": {
    "version": "2.28.1"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "1.26.5"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions