Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Charset detection in text() can be pathologically slow #2359
@property def apparent_encoding(self): """The apparent encoding, provided by the chardet library""" return chardet.detect(self.content)['encoding']
Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:
causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)
In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.
The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.
So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access
This commit improves the semantics of request and response contents. We now have: - message.raw_content: The unmodified, raw content as seen on the wire - message.content: The HTTP Content-Type-decoded content. That is, gzip or deflate compression are removed. - message.text: The HTTP Content-Type-decoded and charset-decoded content. This also takes into account the text encoding and returns a unicode string. Both .content and .text are engineered in a way that they shall not fail under any circumstances. If there is an invalid HTTP Content-Type, accessing `.content` will return the verbatim `.raw_content`. For the text encoding, we first try the provided charset and then immediately fall back to surrogate-escaped utf8. Using chardet would be an alternative, but chardet is terribly slow in some cases (psf/requests#2359). While `surrogateescape` may cause issues at other places (http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/), it nicely preserves "everything weird" if we apply mitmproxy replacements - this property probably justifies keeping it.
Requests internally (and irritatingly imho) bundles `chardet`. It uses this library to figure out the encoding for a given response if there are no suitable HTTP headers detailing the character set of the response. This library is pathologically slow (its loop heavy) on any python implementation _not_ running on top of a VM. There are better implementations of `chardet` for vanilla pythons but the requests library authors wish to avoid bundling C extensions. psf/requests#2359 Taking a leaf from opentable, we now force the encoding of the mesos response to always be UTF8 which avoids requests pulling large mesos payloads through `chardet` and prevents the datadog agent taking large amounts of CPU time to run the check.
Note that if the payload is binary (e.g. download a .gz file from S3),
I took the path of... monkey-patching