Skip to content

Conversation

alexdowad
Copy link
Contributor

FYA @nikic

@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from dfe181e to 1a10e34 Compare October 17, 2021 18:46
@alexdowad
Copy link
Contributor Author

@nikic, one of the irritating problems which still remains after this PR is that HTML-ENTITIES appears before ASCII and UTF-8 in the output of mb_list_encodings(), and any ASCII string is 'valid' in HTML-ENTITIES...

Any thoughts what to do about that one?

@alexdowad
Copy link
Contributor Author

Travis CI says: "An error occurred while generating the build script". Lame.

@alexdowad alexdowad closed this Oct 17, 2021
@alexdowad alexdowad reopened this Oct 17, 2021
@alexdowad
Copy link
Contributor Author

From TravisCI:

$ git clone --depth=50 --quiet https://github.com/php/php-src.git php/php-src
fatal: unable to access 'https://github.com/php/php-src.git/': Operation timed out after 300046 milliseconds with 0 out of 0 bytes received
The command "eval git clone --depth=50 --quiet https://github.com/php/php-src.git php/php-src " failed. Retrying, 2 of 3.
fatal: unable to access 'https://github.com/php/php-src.git/': gnutls_handshake() failed: Error in the pull function.
The command "eval git clone --depth=50 --quiet https://github.com/php/php-src.git php/php-src " failed. Retrying, 3 of 3.
fatal: unable to access 'https://github.com/php/php-src.git/': Operation timed out after 300012 milliseconds with 0 out of 0 bytes received
The command "eval git clone --depth=50 --quiet https://github.com/php/php-src.git php/php-src " failed 3 times.

@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from 1a10e34 to ba929ff Compare October 18, 2021 05:51
@alexdowad
Copy link
Contributor Author

Looks like Travis CI is healthy again.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the "common codepoints" data come from?

@alexdowad
Copy link
Contributor Author

Where does the "common codepoints" data come from?

I reviewed the Unicode standard and picked out the ranges of codepoints (between U+0000-U+FFFF) which I judged as 'likely' to appear in natural text in the world's major languages. For CJK ideographs (汉字/漢字), I knew that rejecting everything would make mb_detect_encoding useless on CJK text, but accepting everything would make it useless in general, since mis-detected strings often contain a lot of CJK characters. So I searched on the Internet and found some studies where people had calculated character frequency tables from modern Chinese corpora. Those provided the basis for choosing which Chinese characters to include. (I used tables from the Unicode consortium to make sure that the equivalent simplified and traditional characters were both included for each one chosen.) Because some 漢字 which are commonly used in Japanese are not common in Chinese, I added everything in the basic 常用漢字 list from the Japanese government.

Korean Hangul syllables cover about 1/6 of the range from U+0000-U+FFFF, so accepting all of them was out of the question. More Internet searching revealed Hangul frequency tables based on modern Korean corpora. So the top ~2000 of those were added as well. (I haven't done enough testing on Korean text to know if that is actually a good number to include. It may be higher than necessary.)

I built a test harness which could pick a bunch of random excerpts from a corpus of text, convert them to all different text encodings using mb_convert_encoding, then run mb_detect_encoding on them and calculate statistics on its accuracy and which encodings were most commonly mistaken for which. To get the input corpus, I grabbed random text in various languages from the web. Then started incrementally adjusting the "common codepoints" table and the weights assigned to various types of characters to get better and better performance under test.

Originally I had just included 3000 common Chinese characters, but testing revealed that was not enough to achieve reliable detection on Chinese text, so it was increased to 4000. I also discovered some obscure punctuation characters were used in my Chinese and Japanese samples, so those were added. For Hangul, I had originally included all of them, but testing revealed that caused poor performance.

Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf9, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from ba929ff to 9eaec4e Compare October 18, 2021 18:02
@alexdowad
Copy link
Contributor Author

OK, @nikic... axed the UUencode change. As you suggested, all the non-encodings are 'blacklisted' instead.

Have a look.

@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from 9eaec4e to d708370 Compare October 18, 2021 18:58
@alexdowad
Copy link
Contributor Author

Please also let me know if the added deprecations look good or not.

@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from d708370 to 3ed8e90 Compare October 18, 2021 19:46
@alexdowad
Copy link
Contributor Author

Just discovered more nonsense in mbstring, specifically in its implementation of HTML entities. Sigh. Details in the commit log messages on this PR.

@kamil-tekiela
Copy link
Member

I would deprecate all 3 of them right away. https://grep.app/search?q=HTML-ENTITIES&filter[lang][0]=PHP doesn't show many uses anyway.

@nikic
Copy link
Member

nikic commented Oct 18, 2021

Could you please split off the deprecations into a separate PR? We'll want to land the mb_detect_encoding changes in 8.1 but deprecations only in master.

@alexdowad
Copy link
Contributor Author

I would deprecate all 3 of them right away. https://grep.app/search?q=HTML-ENTITIES&filter[lang][0]=PHP doesn't show many uses anyway.

Thanks. I guess you did see my comments in the commit log?

@alexdowad
Copy link
Contributor Author

Could you please split off the deprecations into a separate PR? We'll want to land the mb_detect_encoding changes in 8.1 but deprecations only in master.

OK, will do.

@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from 3ed8e90 to d22a505 Compare October 18, 2021 20:06
@alexdowad
Copy link
Contributor Author

Deprecations have been removed from this PR.

Among the text encodings supported by mbstring are several which are
not really 'text encodings'. These include Base64, QPrint, UUencode,
HTML entities, '7 bit', and '8 bit'.

Rather than providing an explicit list of text encodings which they are
interested in, users may pass the output of mb_list_encodings to
mb_detect_encoding. Since Base64, QPrint, and so on are included in
the output of mb_list_encodings, mb_detect_encoding can return one of
these as its 'detected encoding' (and in fact, this often happens).
Before mb_detect_encoding was enhanced so it could detect any of the
supported text encodings, this did not happen, and it is never desired.
@alexdowad alexdowad force-pushed the cleanup-mbstring-16 branch from d22a505 to 1f873e4 Compare October 19, 2021 15:16
@alexdowad
Copy link
Contributor Author

Applied changes to PHP-8.1, then merged PHP-8.1 into master.

@alexdowad alexdowad closed this Oct 19, 2021
@alexdowad alexdowad deleted the cleanup-mbstring-16 branch October 19, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants