Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset detection in text() can be pathologically slow #2359

Closed
marcocova opened this issue Nov 27, 2014 · 10 comments
Closed

Charset detection in text() can be pathologically slow #2359

marcocova opened this issue Nov 27, 2014 · 10 comments
Labels

Comments

@marcocova
Copy link

@marcocova marcocova commented Nov 27, 2014

When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library"""
        return chardet.detect(self.content)['encoding']

Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:

"a" * (1024 * 1024) + "\xa9"

causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)

@sigmavirus24

This comment has been minimized.

Copy link
Contributor

@sigmavirus24 sigmavirus24 commented Nov 27, 2014

We're aware of this. Do you have a proposal to replace it with something better?

@rsnair2

This comment has been minimized.

Copy link

@rsnair2 rsnair2 commented Nov 28, 2014

How about using cChardet ?

@sigmavirus24

This comment has been minimized.

Copy link
Contributor

@sigmavirus24 sigmavirus24 commented Nov 28, 2014

We can't vendor anything that uses C extensions, so no @rsnair2 we can't use cChardet

@Terr

This comment has been minimized.

Copy link

@Terr Terr commented Nov 28, 2014

Couldn't it be the solution in @marcocova's case though? If instead of letting requests determine the charset, he could use cChardet and tell requests what charset to expect.

@sigmavirus24

This comment has been minimized.

Copy link
Contributor

@sigmavirus24 sigmavirus24 commented Nov 29, 2014

@Terr the solution of setting the encoding manually is well documented, so I would hope that @marcocova had considered that.

@marcocova

This comment has been minimized.

Copy link
Author

@marcocova marcocova commented Dec 2, 2014

@sigmavirus24, yes, workarounds are well understood. I was just concerned that the default behavior leaves the user exposed to this issue (with no warnings in the docs - that I could find at least).

@sigmavirus24

This comment has been minimized.

Copy link
Contributor

@sigmavirus24 sigmavirus24 commented Apr 7, 2015

Closing due to inactivity

@simion

This comment has been minimized.

Copy link

@simion simion commented Jul 1, 2015

@marcocova I had the same issue, then i found this: https://pypi.python.org/pypi/cchardet/
From 5 seconds (chardet), I got 1 milisecond :)

@mlissner

This comment has been minimized.

Copy link
Contributor

@mlissner mlissner commented Jul 30, 2015

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']
stevenschlansker added a commit to opentable/nagios-mesos that referenced this issue Oct 8, 2015
mhils added a commit to mitmproxy/netlib that referenced this issue Nov 14, 2015
This commit improves the semantics of request and response contents.
We now have:

  - message.raw_content: The unmodified, raw content as seen on the wire
  - message.content: The HTTP Content-Type-decoded content. That is, gzip
    or deflate compression are removed.
  - message.text: The HTTP Content-Type-decoded and charset-decoded
    content. This also takes into account the text encoding and returns a
    unicode string.

Both .content and .text are engineered in a way that they shall not fail
under any circumstances. If there is an invalid HTTP Content-Type,
accessing `.content` will return the verbatim `.raw_content`.
For the text encoding, we first try the provided charset and then
immediately fall back to surrogate-escaped utf8. Using chardet would be an
alternative, but chardet is terribly slow in some cases
(psf/requests#2359).
While `surrogateescape` may cause issues at other places
(http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/),
it nicely preserves "everything weird" if we apply mitmproxy replacements -
this property probably justifies keeping it.
kostas0 pushed a commit to DataDog/dd-agent that referenced this issue Feb 11, 2016
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response.

This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. psf/requests#2359

Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.
JustAnotherArchivist added a commit to JustAnotherArchivist/qwarc that referenced this issue Jul 25, 2019
chardet can be very slow (chardet/chardet#29 psf/requests#2359) and the decoding may be unnecessary if it's binary content.
@matrey

This comment has been minimized.

Copy link

@matrey matrey commented Nov 12, 2019

Following up on @sigmavirus24 and @mlissner comments (advising to set r.encoding by yourself before any call to .text is done)

Note that if the payload is binary (e.g. download a .gz file from S3), cchardet will quickly return an encoding of None.
But setting r.encoding = None is a no-op, so you still have to refrain from calling .text or .apparent_encoding afterwards, or these would trigger a new, slow, chardet detection

I took the path of... monkey-patching apparent_encoding... ¯_(ツ)_/¯
The following seems to be working (Python 3.7, Requests 2.22.0, YMMV)

import requests
import cchardet

class ForceCchardet:
    @property
    def apparent_encoding(obj):
        return cchardet.detect(obj.content)['encoding']
requests.Response.apparent_encoding = ForceCchardet.apparent_encoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.