Make chardet/charset_normalizer optional dependencies #5875

akx · 2021-07-15T12:24:28Z

An implementation of #5871 for consideration.

Unlike my suggestion in #5871, this implementation prefers chardet / charset_normalizer if it's available, with the idea that those libraries should hopefully do a better job than trying to decode the payload in the two most likely encodings.

Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.

This is still missing documentation, changelogs, etc. – help appreciated.

Considering the switchover to charset_normalizer has broken some users' workflows anyway, I think it wouldn't be too much of an additional breaking change to do this without a major version bump.

… checks

See psf#5871

nateprewitt · 2022-05-11T01:26:03Z

I think we're going to close this out given the outcome of tracking issue. The long term goal is to remove character detection from Requests. I'm not sure this solution is robust enough and is likely to introduce further unnecessary breakages at this point. We'll track progress on the original issue. Thanks!

akx · 2022-05-11T08:31:17Z

@nateprewitt What is the tracking issue, please? #5871 has been "locked as off-topic", #5797 has been "locked as resolved".

I'm curious what the perceived non-robustness here is, because I do believe we could just drop the default chardet/charset_normalizer dependency and most of our users would be none the wiser.

Encodings on the web

I wrote a program to get the top-level pages of the domains in the TRANCO list to figure out the lay of the land encoding-wise. I filtered to the top 10000 or so domains because of a lack of patience, and the results are as follows:

Out of 5938 pages, there are 10 pages (0.17%) where an encoding wasn't (correctly) declared.
With binary responses filtered out, there are 7 such pages (0.12%):

URL	Content-Type	Remarks
http://micpn.com/	(no content-type)	ASCII error page
https://grubhub.com/	(no content-type)	ASCII error page telling me Grubhub doesn't do Finland
https://mob.com/	(no content-type)	utf-8 HTML
https://sberbank.ru/	text/html; charset=utf-8, text/html	malformed content-type header, content utf-8
https://sbrf.ru/	text/html; charset=utf-8, text/html	malformed content-type header, content utf-8
https://unicode.org/	(no content-type)	(ironically enough) serves an ASCII redirection page to home.unicode.org
https://www.fuse-cloud.com/	(no content-type)	returns `invalid request`

For the rest, the following encodings were declared (via fallback meaning gleaned with requests' internal fallback for text:

Encoding	Count	%
utf-8	4918	82.82%
iso-8859-1 (via fallback)	938	15.80%
iso-8859-1	27	0.45%
windows-1251	18	0.30%
gbk	9	0.15%
euc-jp	6	0.10%
gb2312	4	0.07%
iso-8859-2	4	0.07%
iso-8859-15	3	0.05%
shift_jis	3	0.05%
euc-kr	2	0.03%
us-ascii	1	0.02%
windows-1250	1	0.02%
windows-1256	1	0.02%
utf-8 (via fallback)	1	0.02%
windows-31j	1	0.02%
big-5	1	0.02%

The top 3 encodings account for 99.07% of the pages analyzed.

The key takeaway that is In this corpus, all text pages will be decodable by my proposed .text implementation, either by way of the explicit .encoding read from headers, or implied UTF-8. And again, should decoding fail, the error message guides the user to install either detector package and to try again.

UTF-8 misinterpretations

I also believe it's quite unlikely for the UTF-8 codec to successfully misinterpret a string.

To that end, I wrote a small test bench that encodes random test sequences with random codecs and attempts to decode them as UTF-8, recording how many misinterpretations there are.

After 2 million attempts, there were about 22663 misinterpretations (and 41038 correct interpretations!), and by far the top "misbehaving" encoding is UTF-7 (20669 out of 22663 misinterpretations), which is esoteric in its own right, but also very esoteric on the web: 0.012%. The other top misbehavers were UTF-16 and UTF-32, which are in the same esotericity league on the web. (None of these codecs appear in the TRANCO corpus.)

akx · 2022-05-30T12:37:33Z

Hi @nateprewitt – sorry for the extra ping, but could you take a look at the above comment?

nateprewitt · 2022-05-30T16:53:48Z

Hi @akx,

The tracking issue is still #5871 as noted in the last comment and our state of the change is in the first bullet point.

Yes, it [character detection] should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.

You'll find the research you've done is very similar to what was performed in the original issue by Apache [1][2]. There are however issues with this approach.

The fact you noticed a difference between chardet and charset-normalizer to begin with demonstrates the sampling approach is insufficient. Trying to extrapolate behavior across billions of websites to the 10000 most used (and best behaved) is extremely lossy. It's also failing to account for usage within private networks where Requests is routinely used for internal tooling.

Our key takeaway from the charset-normalizer migration is we cannot accurately gauge usage on this and there is no "correct" answer on how we handle it. That's where we arrived at the functionality just doesn't belong in Requests as a default. The best path forward is deferring to the end user for how they'd like to handle decoding. text may eventually choose to fallback to utf-8/ascii, but that's not a change we're looking to make at this time.

And again, should decoding fail, the error message guides the user to install either detector package and to try again.

This is a hard no from us. We spent months doing analysis on impact for the first change and very reluctantly made it due to the impact to the broader Apache ecosystem. Despite best efforts, it was still more visible that we'd have liked and further ingrained these changes are not safe for a project of Requests size. We won't intentionally make a more significant breaking change like this without a major version bump.

akx added 4 commits July 15, 2021 14:09

Move chardet/charset_normalizer imports to .compat and remove version…

0712b82

… checks

Make chardet/charset_normalizer an optional dependency

cf3fbf0

See psf#5871

Make chardet/charset_normalizer extras, not required dependencies

03ee569

Add tests to check against false positives for more esoteric encodings

2d46ddb

akx mentioned this pull request Jul 15, 2021

Make chardet/charset_normalizer optional? #5871

Open

akx marked this pull request as ready for review July 26, 2021 14:36

nateprewitt changed the base branch from master to main January 3, 2022 15:04

nateprewitt closed this May 11, 2022

akx mentioned this pull request May 12, 2022

[Proposal] Add module creation with mypyc to speed up Ousret/charset_normalizer#182

Closed

github-actions bot locked as resolved and limited conversation to collaborators May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make chardet/charset_normalizer optional dependencies #5875

Make chardet/charset_normalizer optional dependencies #5875

akx commented Jul 15, 2021 •

edited

nateprewitt commented May 11, 2022

akx commented May 11, 2022 •

edited

akx commented May 30, 2022

nateprewitt commented May 30, 2022

Make chardet/charset_normalizer optional dependencies #5875

Make chardet/charset_normalizer optional dependencies #5875

Conversation

akx commented Jul 15, 2021 • edited

nateprewitt commented May 11, 2022

akx commented May 11, 2022 • edited

Encodings on the web

UTF-8 misinterpretations

akx commented May 30, 2022

nateprewitt commented May 30, 2022

akx commented Jul 15, 2021 •

edited

akx commented May 11, 2022 •

edited