-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The return value of mb_list_encodings() doesn't contain 'SJIS-win' #8308
Comments
@sji-il, thank you for the report. This change actually dates back to 2020; see e245985. Before that commit, I think the first question here is: Was there actually some difference between If indeed there is no difference, do you still feel that Regarding the BC issue, I'm sorry that this was not documented as a backwards-incompatible change. I think this is actually a change from PHP 7.4 to 8.0, not from 8.0 to 8.1 (but please correct me if that is wrong). |
These libraries probably should also check |
https://3v4l.org/AVllX
As there has been some difference already 20+ years... there should be, unfortunately.
Even if there is no difference, I would recommend keeping two names on the list to avoid BC breaks, unless the maintenance burden is too high. When the maintenance burden is so much problematic, changing it in a major release with deprecation beforehand is reasonable. If the behavior was newborn or clearly disadvantageous to users, I feel it could simply be fixed as a bug in a minor release though...
It's from PHP 8.1. |
@sj-i, thank you for pointing out the difference in mappings between I would just like to ask: as far as you know, is U+203E the only codepoint which should be handled differently between |
(Commenting further on the sample code which @sj-i shared:) Hmm. This actually seems to be closely related to the same issue which you have commented on in #8281. (Whether SJIS 0x7E should be interpreted as U+007E or U+203E...) It does appear that the legacy behavior of Do you have any idea why this is the case?
If these encodings are different, and both are used by even a small number of users, then we definitely need to support both of them. That is not under question. Currently I am still trying to understand clearly what the difference is. |
I can't say sure, but this is the only one I'm aware of.
I'm not sure about this specific case. |
@alexdowad I've also asked the original author, and he said that |
Understood. 👍🏻 So it looks like the solution to this issue here is to restore Will do that as soon as I have a bit of time to prepare a PR. |
Just did a first draft of the code to reintroduce I also still need to re-examine the mappings for CP932, confirm if there have been any changes from legacy I think it would be helpful if we can include some explanation of why users might choose to use @sj-i stated above:
I've not yet read through the references provided by @zonuexe on #8281, so maybe this is already covered there, but can you explain any reasons why one might recommend the use of |
Patch is ready for review, just waiting to see if @sj-i or any other interested parties can provide any additional background information which can go into code comments. |
OK, I won't be available for a few more hours to take care of my 2 yo daughter, but I'll try to summarize the information I have after she slept (if no one else has done it by then) . |
Unfortunately, these encodings are not named according to standard specifications. Both Java and Ruby have a different set of encoding names about for SJIS variants than PHP's mbstring.
mbstring originates from the PHP3 漢字パッチ(Kanji patch), which is based on a library called libkcc or "streamable kanji code filter and converter". I haven't seen the full code at that time, but there is evidence that sjis-win was added earlier in the first commit of libmbfl. (Note: libkcc is not libkkc. Finding information about kcc on the web today is very difficult.) One of the original authors, @moriyoshi, said that |
@zonuexe, these are interesting points, though what I was trying to ask was a bit different. Let me try again. Let's imagine you are starting a new PHP-based software project right now. You need to ingest SJIS-encoded data provided by users, and convert to Unicode for processing. Maybe it's for CSV import. Would you use Have you ever developed such software projects in the past? If so, which of It may be that none of the programmers who are sharing in this discussion know why they should choose I'm still hoping that somebody who is following this discussion does have reasons for choosing one variant or another. |
One more small comment... the BC issue is understood and appreciated. I'm not saying much about it, because there is not much to say about it, but it does not mean it is being ignored. |
OK, I don't know if this will fit nicely into the code comments, but let me explain my findings in the additional investigation. In my opinion, unfortunately, the two encodings are often chosen almost at random among users by coincidence or historical reasons, rather than by strong opinions. As @zonuexe said, SJIS-win existed at the time of the first commit of libmbfl. libmbfl's predecessor was mbfilter, a program created by a Japanese company called HappySize and released as OSS for use with PHP. Although the implementation history at the time of the introduction of SJIS-win cannot be tracked anymore, what is important to this story is the initial implementation of libmbfl used with PHP's mbstring. On December 8, 2002, SJIS-win and CP932 were aliases. The filename of the filter was mbfilter_cp932.c. At this point, U+00A5 and U+203E were mapped to FULLWIDTH YEN SIGN (0x818F) and FULLWIDTH MACRON (0x8150), respectively. (Previously I said that the 8.0 behavior is for "20+ years", it was wrong. Correctly it's about 10 years.) On March 14, 2004, A statement was added to the PHP Manual
This phrase is still in the PHP manual, so the number of users who refer to the PHP manual would specify On the other hand, Japanese programmers often know that there are multiple variants of ShiftJIS, and that some characters can only be handled with a encoding called CP932 or Windows-31J. Some people may have "tried" one of them and it worked. At this point, these were just aliases, so the behavior was the same no matter what name they specified. They could use any name they like. On March 2, 2010, a couple of fixes were made to libmbfl that changes the conversion of U+00A5 and U+203E to 0x5c and 0x7e for CP932, respectively, and separate SJIS-win as a compatible implementation. For the 10 years between the release of PHP 5.3.3 on July 22, 2010 and the release of PHP 8.1.0, CP932 and SJIS-win had different conversion tables for the two characters. And this difference is not specifically explained in the PHP manual. During this decade, many users may have unwittingly written programs that rely on a conversion table that they chose for no particular reason. On October 30, 2011, another page in the PHP manual also specifically lists CP932 as a supported encoding. This is still in the manual, too. Depending on perspective, when users try to find out from the manual how to use CP932 with PHP, the PHP manual itself now supports two factions, CP932 and SJIS-win, incites warfare between them^^ Of course, there would be some programmers who will notice the difference between the two starting with PHP 5.3.3 and consciously choosing the encoding. Yes, don't forget about this faction: ....... The more serious and careful programmers know that there is no single right answer to the code conversion rules and that they should be used differently for different purposes. |
Depending on how the document is generated, such as the conversion table used to convert a document from legacy encoding to Unicode, which code point is intended to represent which character may vary. When such a document is further converted to the same or another legacy encoding, the conversion method required will depend on the intended use. There are conversion tables with different policies for handling U+00A5 and U+203E, so we can choose which one we imagine would work best for our use case. That's all we can talk about if we try to choose an encoding reasonably. |
Even if the behavior was about 10 years old instead of over 20, I would still recommend keeping both conversion tables and improving the PHP manual to make it a description that clearly states the differences and gives hints on how to choose one. Maybe @mumumu will write a good explanation :) |
@alexdowad Above all, the following phrase is simply expressed.
I preferred
I don't think you're trying to ignore the effects of BC. Throughout the last few days of work, I understand that everyone is working hard to resolve these issues. Thank you! |
I have just been searching a bit more to see if it is possible to identify which specification @moriyoshi was following 10 years ago when he adjusted the mappings for I did find that for handling text on the web, the W3C and WHATWG publish a standard on handling of various text encodings. They refer to CP932 as "Shift_JIS", and their definition of "Shift_JIS" seems to agree with |
@cmb69, I think you are a Windows man. I'm just wondering if mbstring's definition of CP932 might have been chosen to agree with the Win32 API. Presumably you have a Windows machine with VC++? Are you able to build and run this program? I don't have a Windows machine to do it on. (Please note, I wrote this code from the Win32 API docs, but can't test it, so it may not work... 😬) #include <stdio.h>
#include <stringapiset.h>
void TryConversion(int wchar)
{
WCHAR instr[1];
CHAR outstr[32];
instr[0] = wchar;
int num_chars = WideCharToMultiByte(932, 0, instr, 1, outstr, sizeof(outstr), NULL, NULL);
if (num_chars == 0) {
printf("ERROR\n");
return;
}
for (int i = 0; i < num_chars; i++)
printf("%x ", outstr[i] & 0xFF);
printf("\n");
}
int main()
{
TryConversion(0xA5);
TryConversion(0x203E);
return 0;
} A web search suggests that in Visual Studio, when creating a new project for this, you would want to choose "Windows Desktop", choose "Windows Desktop Wizard", then in the wizard, pick "Console Application (.exe)" as the "Application Type". |
https://gist.github.com/sj-i/6bacf1bef255c096b3cbb78bfc29aa28 References:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt |
@sj-i Hmm, OK, thanks. |
About U+203E, there should be another background... Your finding of the WHATWG encoding standard would be a hint. whatwg/encoding@be441c8 I'm currently searching for old implementations of browsers that handle encodings. BTW, glibc iconv translates U+203E to 0x7E on converting Unicode texts to CP932. http://www2d.biglobe.ne.jp/~msyk/cgi-bin/charcode/373.html glibc iconv adopts UCS normalization, so even conversions between SJIS family and EUC-JP family are done through Unicode. |
Leaving aside the historical truth, the fact that glibc iconv uses these one-way mappings for such practical reasons may in itself justify its implementation in PHP. This could fit nicely into the code comments^^ |
Speaking of |
I've also found the Twitter account of the original author of the patch that added support for CP932 to the glibc libiconv, and asked him the intent about U+203E. If I get a response, will share it here. |
@alexdowad Those one-way mappings in glibc iconv were added to allow practical conversion between legacy encodings. |
@sj-i Thanks very much for this information. Just trying to process this... The reply from Moriyama-san which you kindly linked to mentions SJIS → CP932, EUC-JP → CP932, and ISO-2022-JP → CP932 conversions. The implication is that in all 3 of the source encodings, something maps to U+203E. Then, since conversions from one legacy encoding to another go through Unicode, when converting Unicode → CP932, we convert U+203E to 0x7E so that the something mentioned above will map to CP932 0x7E. At least in mbstring, SJIS 0x7E maps to U+203E. So mapping U+203E to CP932 0x7E means that SJIS 0x7E will convert to CP932 0x7E. However, in both mbstring and iconv, EUC-JP 0x7E and ISO-2022-JP 0x7E do not map to U+203E. So I am still trying to figure out why Moriyama-san said this mapping is related to EUC-JP → CP932 and ISO-2022-JP → CP932 conversions... |
@alexdowad Moriyama-san was working in several OSS communities at the time to make each OSS's character code processing interoperable with each other. There is a problem with practicality when a character that Japanese users generally regard as "the same" is changed to another character when converted via Unicode. He proposed the addition of several additional one-way mappings for the then-current ISO 2022 JP and EUC-JP encodings, as well as for ShiftJIS and its derivatives. http://www.mysql.gr.jp/mysqlml/mysql/msg/12442 Although not each of these efforts actually bore fruit, he chose to include ISO-2022-JP and EUC-JP in his mention of the topic because similar efforts were being made at the time for code points other than U+203E and U+00A5. |
@sj-i I am sorry I have delayed a bit in responding. Thanks for your exhaustive research. I think we are almost ready to update both documentation for core developers (i.e. code comments) and the PHP manual here. To summarize @sj-i's findings, I think we can characterize • It follows the "best fit" CP932 mappings from the Microsoft Windows API... Before we put that information in the PHP manual, we need to confirm that those statements are true. For example, @sj-i kindly checked the Win32 mappings for U+203E and U+00A5, but we haven't checked all Win32 "best fit" mappings from Unicode → CP932 and CP932 → Unicode. I would also like to check all Shift_JIS mappings in the JavaScript Encoding API (i.e. WHATWG), to confirm whether we are truly compatible with them or not. I can do this when I have some time (probably in the next few days). For core developers, we can add the following information: • The 'extra mapping' is for U+203E. Again, we need to check if the statement about NOW! We have come a long way here, but are still have not 100% reached where we need to go. The remaining piece of critical information which needs to be documented, both for core developers and for users of PHP, is why we have the so-called That is all well and good, but do we really want to tell our users that we are currently keeping
I think we can summarize the recommended uses of these mapping tables as follows: • If you are converting SJIS text to CP932, use Does that sound right? |
@alexdowad So, as we found, there are parts that are currently consistent with the MS best-fit mapping, the WHATWG specification, and the behavior of other converters. But even if there are differences with them in the implementation, I don't think they should be "fixed" to eliminate the differences, at least in minor versions, unless there is an obvious error that is to the detriment of the user. The long-standing behavior of PHP's mbstring itself is now part of the customs that our users rely on. The advice we can surely give to future core developers, in general, is as follows
I don't think anybody should use legacy encodings in their 'greenfield' development ^^
"If you need to convert U+00A5 (YEN SIGN) and U+203E (OVERLINE) from Unicode to full-width, use SJIS-win." I think it is appropriate to explain the behavior just as it is. |
@sj-i Regarding the specific issue under discussion here ( The remaining "question marks" are all related to the way forward from here. If I had known in 2020 that there was a difference in behavior between Normally, when a change like this is made unintentionally or due to a misunderstanding, the default response is to roll it back. And we might still go that way. But, there are some unusual things about this situation, which are making it hard for me to feel sure about what to do:
It is revealing that both gentlemen could experience many years of PHP development without ever finding it necessary to pay attention to the difference between
Another point: To me, the maintenance burden of keeping two different CP932 variants in mbstring is not a great concern. My concern is more for the users. I think it is a disservice to PHP users if we keep two different features, which work almost the same, but just slightly different, and nobody can provide any reasonable explanation of why things should be that way or how the users can choose which one to use. (The PHP platform may have had such inconsistencies in the past, but they are being gradually smoothed out with each successive release.) Thought experiment: If we were not working on PHP here, but on Python, Ruby, JavaScript, or some other platform which only supports one variant of CP932 in its standard library, would you recommend that a second variant of CP932 should be added? If not, that suggests there might not be a logical or practical reason to have two variants. If indeed, there is no practical reason to have two different CP932 variants, it is possible that we could still roll the change back for PHP 8.1 (to avoid BC break), and perform the simplification in a future release. Please note, I am just mentioning this as an idea and am still open to all possible outcomes here. At this point, it would be very helpful if other core developers can share their views. Finally, I would like to say that the issue raised in #8281 may be more straightforward than this one, and it is possible we might be able to sort it out more quickly. |
It might be reasonable to revert for now, and to merge these encodings for PHP 9. |
In e245985, I combined mbstring's "SJIS-win" text encoding into CP932. This was done after doing some testing which appeared to show that the mappings for "SJIS-win" were the same as those for "CP932". Later, it was found that there was actually a small difference prior to e245985 when converting Unicode to CP932. The mappings for the following two codepoints were different: CP932 SJIS-win U+203E 0x7E 0x81 0x50 U+00A5 0x5C 0x81 0x8F As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and 'YEN SIGN' to the ASCII bytes which have conflicting uses in most legacy Japanese text encodings. "SJIS-win" mapped these to equivalent JIS X 0208 fullwidth characters. Since e2459867af was not intended to cause any user-visible change in behavior, I am rolling back the merge of "CP932" and "SJIS-win". It seems doubtful whether these two text encodings should be kept separate or merged in a future release. An extensive discussion of the related historical background and compatibility issues involved can be found in this GitHub thread: php#8308
In e245985, I combined mbstring's "SJIS-win" text encoding into CP932. This was done after doing some testing which appeared to show that the mappings for "SJIS-win" were the same as those for "CP932". Later, it was found that there was actually a small difference prior to e245985 when converting Unicode to CP932. The mappings for the following two codepoints were different: CP932 SJIS-win U+203E 0x7E 0x81 0x50 U+00A5 0x5C 0x81 0x8F As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and 'YEN SIGN' to the ASCII bytes which have conflicting uses in most legacy Japanese text encodings. "SJIS-win" mapped these to equivalent JIS X 0208 fullwidth characters. Since e2459867af was not intended to cause any user-visible change in behavior, I am rolling back the merge of "CP932" and "SJIS-win". It seems doubtful whether these two text encodings should be kept separate or merged in a future release. An extensive discussion of the related historical background and compatibility issues involved can be found in this GitHub thread: #8308
This has been resolved. |
Description
The following code:
https://3v4l.org/GH0Gj
Resulted in this output:
But I expected this output instead:
As of PHP 8.1, SJIS-win is an alias of CP932, thus the return value of mb_list_encodings() doesn't contain 'SJIS-win'.
This breaks BC.
Some libraries have functions which receive one or more encoding names as arguments, and they often validate the input by the return value of mb_list_encodings().
This should be reverted, or fixed, or at least documented as an incompatible change.
PHP Version
PHP 8.1.0
Operating System
No response
The text was updated successfully, but these errors were encountered: