-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Major overhaul of mbstring (part 16) #7587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dfe181e
to
1a10e34
Compare
@nikic, one of the irritating problems which still remains after this PR is that HTML-ENTITIES appears before ASCII and UTF-8 in the output of Any thoughts what to do about that one? |
Travis CI says: "An error occurred while generating the build script". Lame. |
From TravisCI:
|
1a10e34
to
ba929ff
Compare
Looks like Travis CI is healthy again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the "common codepoints" data come from?
I reviewed the Unicode standard and picked out the ranges of codepoints (between U+0000-U+FFFF) which I judged as 'likely' to appear in natural text in the world's major languages. For CJK ideographs (汉字/漢字), I knew that rejecting everything would make Korean Hangul syllables cover about 1/6 of the range from U+0000-U+FFFF, so accepting all of them was out of the question. More Internet searching revealed Hangul frequency tables based on modern Korean corpora. So the top ~2000 of those were added as well. (I haven't done enough testing on Korean text to know if that is actually a good number to include. It may be higher than necessary.) I built a test harness which could pick a bunch of random excerpts from a corpus of text, convert them to all different text encodings using Originally I had just included 3000 common Chinese characters, but testing revealed that was not enough to achieve reliable detection on Chinese text, so it was increased to 4000. I also discovered some obscure punctuation characters were used in my Chinese and Japanese samples, so those were added. For Hangul, I had originally included all of them, but testing revealed that caused poor performance. |
Originally, `mb_detect_encoding` essentially just checked all candidate encodings to see which ones the input string was valid in. However, it was only able to do this for a limited few of all the text encodings which are officially supported by mbstring. In 3e7acf9, I modified it so it could 'detect' any text encoding supported by mbstring. While this is arguably an improvement, if the only text encodings one is interested in are those which `mb_detect_encoding` could originally handle, the old `mb_detect_encoding` may have been preferable. Because the new one has more possible encodings which it can guess, it also has more chances to get the answer wrong. This commit adjusts the detection heuristics to provide accurate detection in a wider variety of scenarios. While the previous detection code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with UTF-16LE, the adjusted code is extremely accurate in those cases. Detection for Chinese text in Chinese encodings like GB18030 or BIG5 and for Japanese text in Japanese encodings like EUC-JP or SJIS is greatly improved. Detection of UTF-7 is also greatly improved. An 8KB table, with one bit for each codepoint from U+0000 up to U+FFFF, is used to achieve this. One significant constraint is that the heuristics are completely based on looking at each codepoint in a string in isolation, treating some codepoints as 'likely' and others as 'unlikely'. It might still be possible to achieve great gains in detection accuracy by looking at sequences of codepoints rather than individual codepoints. However, this might require huge tables. Further, we might need a huge corpus of text in various languages to derive those tables. Accuracy is still dismal when trying to distinguish single-byte encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is because the valid bytes in these encodings are basically all the same, and all valid bytes decode to 'likely' codepoints, so our method of detection (which is based on rating codepoints as likely or unlikely) cannot tell any difference between the candidates at all. It just selects the first encoding in the provided list of candidates. Speaking of which, if one wants to get good results from `mb_detect_encoding`, it is important to order the list of candidate encodings according to your prior belief of which are more likely to be correct. When the function cannot tell any difference between two candidates, it returns whichever appeared earlier in the array.
ba929ff
to
9eaec4e
Compare
OK, @nikic... axed the UUencode change. As you suggested, all the non-encodings are 'blacklisted' instead. Have a look. |
9eaec4e
to
d708370
Compare
Please also let me know if the added deprecations look good or not. |
d708370
to
3ed8e90
Compare
Just discovered more nonsense in mbstring, specifically in its implementation of HTML entities. Sigh. Details in the commit log messages on this PR. |
I would deprecate all 3 of them right away. https://grep.app/search?q=HTML-ENTITIES&filter[lang][0]=PHP doesn't show many uses anyway. |
Could you please split off the deprecations into a separate PR? We'll want to land the mb_detect_encoding changes in 8.1 but deprecations only in master. |
Thanks. I guess you did see my comments in the commit log? |
OK, will do. |
3ed8e90
to
d22a505
Compare
Deprecations have been removed from this PR. |
Among the text encodings supported by mbstring are several which are not really 'text encodings'. These include Base64, QPrint, UUencode, HTML entities, '7 bit', and '8 bit'. Rather than providing an explicit list of text encodings which they are interested in, users may pass the output of mb_list_encodings to mb_detect_encoding. Since Base64, QPrint, and so on are included in the output of mb_list_encodings, mb_detect_encoding can return one of these as its 'detected encoding' (and in fact, this often happens). Before mb_detect_encoding was enhanced so it could detect any of the supported text encodings, this did not happen, and it is never desired.
d22a505
to
1f873e4
Compare
Applied changes to |
FYA @nikic