Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The return value of mb_list_encodings() doesn't contain 'SJIS-win' #8308

Closed
sji-il opened this issue Apr 6, 2022 · 34 comments
Closed

The return value of mb_list_encodings() doesn't contain 'SJIS-win' #8308

sji-il opened this issue Apr 6, 2022 · 34 comments

Comments

@sji-il
Copy link

sji-il commented Apr 6, 2022

Description

The following code:

<?php
var_dump(mb_list_encodings());

https://3v4l.org/GH0Gj

Resulted in this output:

  • contains 'CP932' and 'SJIS', but no 'SJIS-win'

But I expected this output instead:

  • contains 'SJIS-win'

As of PHP 8.1, SJIS-win is an alias of CP932, thus the return value of mb_list_encodings() doesn't contain 'SJIS-win'.

This breaks BC.

Some libraries have functions which receive one or more encoding names as arguments, and they often validate the input by the return value of mb_list_encodings().
This should be reverted, or fixed, or at least documented as an incompatible change.

PHP Version

PHP 8.1.0

Operating System

No response

@alexdowad
Copy link
Contributor

@sji-il, thank you for the report.

This change actually dates back to 2020; see e245985. Before that commit, SJIS-win was indeed treated as a separate encoding from CP932, but testing revealed that the behavior of SJIS-win and CP932 was exactly the same (i.e. all the same mappings).

I think the first question here is: Was there actually some difference between SJIS-win and CP932 which I missed? Perhaps more importantly, should there have been some difference?

If indeed there is no difference, do you still feel that mb_list_encodings should list the same legacy text encoding twice, under two different names? Or perhaps SJIS-win should be the primary name, and CP932 should be an alias?

Regarding the BC issue, I'm sorry that this was not documented as a backwards-incompatible change. I think this is actually a change from PHP 7.4 to 8.0, not from 8.0 to 8.1 (but please correct me if that is wrong).

@cmb69
Copy link
Member

cmb69 commented Apr 6, 2022

SJIS-win is only an alias of CP932 as of PHP 8.1.0 (https://3v4l.org/bF8G4). I think we should document this change, but not revert it.

Some libraries have functions which receive one or more encoding names as arguments, and they often validate the input by the return value of mb_list_encodings().

These libraries probably should also check mb_encoding_aliases() (something like https://www.php.net/manual/en/function.mb-list-encodings.php#122266).

@sj-i
Copy link
Contributor

sj-i commented Apr 6, 2022

@alexdowad

Was there actually some difference between SJIS-win and CP932

https://3v4l.org/AVllX
This code behaves differently between SJIS-win and CP932 before 8.1.

should there have been some difference?

As there has been some difference already 20+ years... there should be, unfortunately.
Yeah, I think I can hear the third sigh from you.
Legacy Japanese encodings should all be burned by fire and buried in the dark, though it's not realistic.
SJIS-win and CP932 are both used in Japan. Some articles recommend one, others recommend another. It is optimistic to say that keeping only one behavior is enough. Both could be depended on by a fair number of users.

mb_list_encodings should list the same legacy text encoding twice

Even if there is no difference, I would recommend keeping two names on the list to avoid BC breaks, unless the maintenance burden is too high. When the maintenance burden is so much problematic, changing it in a major release with deprecation beforehand is reasonable. If the behavior was newborn or clearly disadvantageous to users, I feel it could simply be fixed as a bug in a minor release though...

this is actually a change from PHP 7.4 to 8.0, not from 8.0 to 8.1

It's from PHP 8.1.
https://3v4l.org/glAfa

@alexdowad
Copy link
Contributor

@sj-i, thank you for pointing out the difference in mappings between SJIS-win and CP932. I don't know where I messed up when testing mbstring's handling of these text encodings back in 2020, but in any case, the correction is appreciated. (As mentioned, when testing, it appeared that there was no difference in behavior.)

I would just like to ask: as far as you know, is U+203E the only codepoint which should be handled differently between SJIS-win and CP932? Or are there any others?

@alexdowad
Copy link
Contributor

(Commenting further on the sample code which @sj-i shared:) Hmm. This actually seems to be closely related to the same issue which you have commented on in #8281. (Whether SJIS 0x7E should be interpreted as U+007E or U+203E...)

It does appear that the legacy behavior of CP932 in mbstring was a bit illogical, in that when converting Unicode → CP932, it would convert U+007E to 0x7E (implying that 0x7E is being used to represent a tilde, not an overbar), but then when converting CP932 → Unicode, it would convert U+203E to 0x7E (implying that 0x7E represents an overbar, not a tilde).

Do you have any idea why this is the case?

SJIS-win and CP932 are both used in Japan. Some articles recommend one, others recommend another. It is optimistic to say that keeping only one behavior is enough. Both could be depended on by a fair number of users.

If these encodings are different, and both are used by even a small number of users, then we definitely need to support both of them. That is not under question. Currently I am still trying to understand clearly what the difference is.

@sj-i
Copy link
Contributor

sj-i commented Apr 6, 2022

@alexdowad

Or are there any others?

I can't say sure, but this is the only one I'm aware of.

Do you have any idea why this is the case?

I'm not sure about this specific case.
Tilde (0x7E in ASCII) is often used in programming languages for special purposes. So historically there is confusion about how 0x7E in legacy encodings should be converted to Unicode, and how its roundtrip should be treated.

@sj-i
Copy link
Contributor

sj-i commented Apr 7, 2022

@alexdowad
https://3v4l.org/lQpnG
I found that conversion from U+00A5 is also different between CP932 and SJIS-win via brute-forcing.

I've also asked the original author, and he said that SJIS-win is a compatibility implementation created for historical reasons to map U+00A5 and U+203E in the same way as eucJP-win, so that the differences with CP932 are intended and should be kept.

@alexdowad
Copy link
Contributor

I've also asked the original author, and he said that SJIS-win is a compatibility implementation created for historical reasons to map U+00A5 and U+203E in the same way as eucJP-win, so that the differences with CP932 are intended and should be kept.

Understood. 👍🏻

So it looks like the solution to this issue here is to restore SJIS-win. That will also make it 'reappear' in mb_list_encodings.

Will do that as soon as I have a bit of time to prepare a PR.

@alexdowad
Copy link
Contributor

Just did a first draft of the code to reintroduce SJIS-win (it's just a few lines). Still need to adjust the unit tests.

I also still need to re-examine the mappings for CP932, confirm if there have been any changes from legacy mbstring, and if so, raise each one for discussion here.

I think it would be helpful if we can include some explanation of why users might choose to use CP932 or SJIS-win in the code comments; this would help future maintainers and developers to understand why the code is the way it is.

@sj-i stated above:

SJIS-win and CP932 are both used in Japan. Some articles recommend one, others recommend another.

I've not yet read through the references provided by @zonuexe on #8281, so maybe this is already covered there, but can you explain any reasons why one might recommend the use of CP932, or conversely, might recommend SJIS-win?

@alexdowad
Copy link
Contributor

Patch is ready for review, just waiting to see if @sj-i or any other interested parties can provide any additional background information which can go into code comments.

@sj-i
Copy link
Contributor

sj-i commented Apr 7, 2022

OK, I won't be available for a few more hours to take care of my 2 yo daughter, but I'll try to summarize the information I have after she slept (if no one else has done it by then) .

@zonuexe
Copy link

zonuexe commented Apr 7, 2022

@alexdowad

I've not yet read through the references provided by @zonuexe on #8281, so maybe this is already covered there, but can you explain any reasons why one might recommend the use of CP932, or conversely, might recommend SJIS-win?

Unfortunately, these encodings are not named according to standard specifications. Both Java and Ruby have a different set of encoding names about for SJIS variants than PHP's mbstring.

mbstring originates from the PHP3 漢字パッチ(Kanji patch), which is based on a library called libkcc or "streamable kanji code filter and converter". I haven't seen the full code at that time, but there is evidence that sjis-win was added earlier in the first commit of libmbfl.

(Note: libkcc is not libkkc. Finding information about kcc on the web today is very difficult.)

One of the original authors, @moriyoshi, said that SJIS-win and CP932 needed to be separated in order to comply with "some specifications". They forget which spec they followed, but the important fact is that for 20 years many users have relied on the name SJIS-win and the conversion map.

@alexdowad
Copy link
Contributor

@zonuexe, these are interesting points, though what I was trying to ask was a bit different.

Let me try again. Let's imagine you are starting a new PHP-based software project right now. You need to ingest SJIS-encoded data provided by users, and convert to Unicode for processing. Maybe it's for CSV import. Would you use mb_convert_encoding with "SJIS", "SJIS-win", "CP932", or something else? Why?

Have you ever developed such software projects in the past? If so, which of mbstring's text encodings did you use? And most importantly, why? (I hope the answer will not be "I just flipped a coin". 😅 )

It may be that none of the programmers who are sharing in this discussion know why they should choose SJIS-win or CP932. It may be that almost none of the (presumably many) PHP users who have ever used mbstring for this purpose, over its entire history, have ever understood why they should choose SJIS-win over CP932, or vice versa. If so, well... it is what it is. But it would be helpful, and appreciated, if we can be clear about that.

I'm still hoping that somebody who is following this discussion does have reasons for choosing one variant or another.

@alexdowad
Copy link
Contributor

One more small comment... the BC issue is understood and appreciated. I'm not saying much about it, because there is not much to say about it, but it does not mean it is being ignored.

@sj-i
Copy link
Contributor

sj-i commented Apr 7, 2022

OK, I don't know if this will fit nicely into the code comments, but let me explain my findings in the additional investigation. In my opinion, unfortunately, the two encodings are often chosen almost at random among users by coincidence or historical reasons, rather than by strong opinions.

As @zonuexe said, SJIS-win existed at the time of the first commit of libmbfl. libmbfl's predecessor was mbfilter, a program created by a Japanese company called HappySize and released as OSS for use with PHP. Although the implementation history at the time of the introduction of SJIS-win cannot be tracked anymore, what is important to this story is the initial implementation of libmbfl used with PHP's mbstring.

On December 8, 2002, SJIS-win and CP932 were aliases. The filename of the filter was mbfilter_cp932.c. At this point, U+00A5 and U+203E were mapped to FULLWIDTH YEN SIGN (0x818F) and FULLWIDTH MACRON (0x8150), respectively.
moriyoshi/libmbfl@6c6ea25#diff-5eecf532cc3c80d6877f1356bbc4b084ccd049313ab175160a1f9370384d1362R59-R63
moriyoshi/libmbfl@6c6ea25#diff-5eecf532cc3c80d6877f1356bbc4b084ccd049313ab175160a1f9370384d1362R237-R240

(Previously I said that the 8.0 behavior is for "20+ years", it was wrong. Correctly it's about 10 years.)

On March 14, 2004, A statement was added to the PHP Manual
php/doc-en@cee35d9#diff-d5b0e3b6642a9b9829d0fc664170f19ddb344fd869a5317934d22db6e8bf179cR346

For the CP932 codemap, use SJIS-WIN instead.

This phrase is still in the PHP manual, so the number of users who refer to the PHP manual would specify 'SJIS-win' because of this.

On the other hand, Japanese programmers often know that there are multiple variants of ShiftJIS, and that some characters can only be handled with a encoding called CP932 or Windows-31J. Some people may have "tried" one of them and it worked. At this point, these were just aliases, so the behavior was the same no matter what name they specified. They could use any name they like.

On March 2, 2010, a couple of fixes were made to libmbfl that changes the conversion of U+00A5 and U+203E to 0x5c and 0x7e for CP932, respectively, and separate SJIS-win as a compatible implementation.
moriyoshi/libmbfl@cb257bc
moriyoshi/libmbfl@46a83aa

For the 10 years between the release of PHP 5.3.3 on July 22, 2010 and the release of PHP 8.1.0, CP932 and SJIS-win had different conversion tables for the two characters. And this difference is not specifically explained in the PHP manual. During this decade, many users may have unwittingly written programs that rely on a conversion table that they chose for no particular reason.

On October 30, 2011, another page in the PHP manual also specifically lists CP932 as a supported encoding.
php/doc-en@93c8f3c

This is still in the manual, too.

Depending on perspective, when users try to find out from the manual how to use CP932 with PHP, the PHP manual itself now supports two factions, CP932 and SJIS-win, incites warfare between them^^

Of course, there would be some programmers who will notice the difference between the two starting with PHP 5.3.3 and consciously choosing the encoding. Yes, don't forget about this faction: ....... The more serious and careful programmers know that there is no single right answer to the code conversion rules and that they should be used differently for different purposes.

@sj-i
Copy link
Contributor

sj-i commented Apr 7, 2022

Depending on how the document is generated, such as the conversion table used to convert a document from legacy encoding to Unicode, which code point is intended to represent which character may vary. When such a document is further converted to the same or another legacy encoding, the conversion method required will depend on the intended use. There are conversion tables with different policies for handling U+00A5 and U+203E, so we can choose which one we imagine would work best for our use case. That's all we can talk about if we try to choose an encoding reasonably.

@sj-i
Copy link
Contributor

sj-i commented Apr 7, 2022

Even if the behavior was about 10 years old instead of over 20, I would still recommend keeping both conversion tables and improving the PHP manual to make it a description that clearly states the differences and gives hints on how to choose one. Maybe @mumumu will write a good explanation :)

@zonuexe
Copy link

zonuexe commented Apr 8, 2022

@alexdowad
The research by @sj-i clearly explained my vague understanding of the historical background.

Above all, the following phrase is simply expressed.

the two encodings are often chosen almost at random among users by coincidence or historical reasons, rather than by strong opinions.

I preferred CP932 as a common encoding name compatible with other languages, but I didn't know the subtle differences in behavior and it's essentially just a coin toss. 😄

One more small comment... the BC issue is understood and appreciated. I'm not saying much about it, because there is not much to say about it, but it does not mean it is being ignored.

I don't think you're trying to ignore the effects of BC. Throughout the last few days of work, I understand that everyone is working hard to resolve these issues. Thank you!

@alexdowad
Copy link
Contributor

I have just been searching a bit more to see if it is possible to identify which specification @moriyoshi was following 10 years ago when he adjusted the mappings for CP932. Coming up empty so far.

I did find that for handling text on the web, the W3C and WHATWG publish a standard on handling of various text encodings. They refer to CP932 as "Shift_JIS", and their definition of "Shift_JIS" seems to agree with mbstring's definition of CP932. However, the first version of that standard was published in 2014... so it looks like it's not the one.

@alexdowad
Copy link
Contributor

alexdowad commented Apr 9, 2022

@cmb69, I think you are a Windows man. I'm just wondering if mbstring's definition of CP932 might have been chosen to agree with the Win32 API. Presumably you have a Windows machine with VC++?

Are you able to build and run this program? I don't have a Windows machine to do it on. (Please note, I wrote this code from the Win32 API docs, but can't test it, so it may not work... 😬)

#include <stdio.h>
#include <stringapiset.h>

void TryConversion(int wchar)
{
    WCHAR instr[1];
    CHAR outstr[32];
    instr[0] = wchar;
    int num_chars = WideCharToMultiByte(932, 0, instr, 1, outstr, sizeof(outstr), NULL, NULL);
    if (num_chars == 0) {
        printf("ERROR\n");
        return;
    }
    for (int i = 0; i < num_chars; i++)
        printf("%x ", outstr[i] & 0xFF);
    printf("\n");
}

int main()
{
    TryConversion(0xA5);
    TryConversion(0x203E);
    return 0;
}

A web search suggests that in Visual Studio, when creating a new project for this, you would want to choose "Windows Desktop", choose "Windows Desktop Wizard", then in the wizard, pick "Console Application (.exe)" as the "Application Type".

@sj-i
Copy link
Contributor

sj-i commented Apr 9, 2022

https://gist.github.com/sj-i/6bacf1bef255c096b3cbb78bfc29aa28
I have experimented with this already. So the case about U+00A5 is justified by the implementation of Windows, via its 'best-fit' mapping.

References:

https://web.archive.org/web/20100212163537/http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx#332622

Now through the masgic of "best fit" mappings (which I will cover another day), U+2089 also maps to 0x5c on cp 932. Its a one-way mapping, obviously. But its there.

https://web.archive.org/web/20100212163537/http://blogs.msdn.com/michkap/archive/2004/12/14/284838.aspx#332622

Look at http://microsoft.com/globaldev/reference/dbcs/932.htm which has the offical mapping for cp932. It only has the 0x4c to U+005c mapping, as the other (best fit) mapping is not documented; none of the best fit mappings are.

https://web.archive.org/web/20100130054656/http://blogs.msdn.com:80/michkap/archive/2005/09/17/469941.aspx

both U+00a5 and U+20a9 have one-way 'best fit' mappings to 0x5c on their respective code pages.

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

@alexdowad
Copy link
Contributor

@sj-i Hmm, OK, thanks.

@sj-i
Copy link
Contributor

sj-i commented Apr 9, 2022

About U+203E, there should be another background...

Your finding of the WHATWG encoding standard would be a hint.
https://encoding.spec.whatwg.org/#shift_jis

whatwg/encoding@be441c8
This was added in 2012. But W3C and WHATWG are standardization organizations, there must be at least one, probably two actual implementations that behave like this beforehand.

I'm currently searching for old implementations of browsers that handle encodings.

BTW, glibc iconv translates U+203E to 0x7E on converting Unicode texts to CP932.
https://gist.github.com/sj-i/c003796281253c41b70a1d9b92de2ea3

http://www2d.biglobe.ne.jp/~msyk/cgi-bin/charcode/373.html
Maybe this is the reason of the glibc implementation.
(Note: The author of this forum post is the man who wrote the actual glibc code for the CP932 mapping)

glibc iconv adopts UCS normalization, so even conversions between SJIS family and EUC-JP family are done through Unicode.
So additional one-way mappings are required for each encoding for convenience.
I assume a similar reason is there for browser implementations.

@sj-i
Copy link
Contributor

sj-i commented Apr 9, 2022

Leaving aside the historical truth, the fact that glibc iconv uses these one-way mappings for such practical reasons may in itself justify its implementation in PHP. This could fit nicely into the code comments^^

@alexdowad
Copy link
Contributor

Speaking of iconv, I reached out to the maintainer of GNU libiconv earlier today to see if he has any insight on the background of these extra one-way mappings from Unicode to CP932. Haven't heard back from him yet.

@sj-i
Copy link
Contributor

sj-i commented Apr 9, 2022

I've also found the Twitter account of the original author of the patch that added support for CP932 to the glibc libiconv, and asked him the intent about U+203E. If I get a response, will share it here.

@sj-i
Copy link
Contributor

sj-i commented Apr 11, 2022

@alexdowad
https://twitter.com/tree3yama/status/1513172414460743683
I got the answer from Moriyama-san (who added support for CP932 to glibc iconv)!

Those one-way mappings in glibc iconv were added to allow practical conversion between legacy encodings.
And from what I can find, several other systems, such as Java, have similar mappings for the same reason.
I don't have time right now, but I can put together more detailed information if needed?

@alexdowad
Copy link
Contributor

@sj-i Thanks very much for this information. Just trying to process this...

The reply from Moriyama-san which you kindly linked to mentions SJIS → CP932, EUC-JP → CP932, and ISO-2022-JP → CP932 conversions. The implication is that in all 3 of the source encodings, something maps to U+203E. Then, since conversions from one legacy encoding to another go through Unicode, when converting Unicode → CP932, we convert U+203E to 0x7E so that the something mentioned above will map to CP932 0x7E.

At least in mbstring, SJIS 0x7E maps to U+203E. So mapping U+203E to CP932 0x7E means that SJIS 0x7E will convert to CP932 0x7E.

However, in both mbstring and iconv, EUC-JP 0x7E and ISO-2022-JP 0x7E do not map to U+203E. So I am still trying to figure out why Moriyama-san said this mapping is related to EUC-JP → CP932 and ISO-2022-JP → CP932 conversions...

@sj-i
Copy link
Contributor

sj-i commented Apr 11, 2022

@alexdowad
Ah, that point needs a little more context. To summarize, there is no direct relationship with the one-way mapping for this specific code point.

Moriyama-san was working in several OSS communities at the time to make each OSS's character code processing interoperable with each other.

There is a problem with practicality when a character that Japanese users generally regard as "the same" is changed to another character when converted via Unicode.

He proposed the addition of several additional one-way mappings for the then-current ISO 2022 JP and EUC-JP encodings, as well as for ShiftJIS and its derivatives.
Examples of such efforts can be found, for example, in the following URLs

http://www.mysql.gr.jp/mysqlml/mysql/msg/12442
https://ja.osdn.net/projects/legacy-encoding/

Although not each of these efforts actually bore fruit, he chose to include ISO-2022-JP and EUC-JP in his mention of the topic because similar efforts were being made at the time for code points other than U+203E and U+00A5.

@alexdowad
Copy link
Contributor

@sj-i I am sorry I have delayed a bit in responding. Thanks for your exhaustive research.

I think we are almost ready to update both documentation for core developers (i.e. code comments) and the PHP manual here. To summarize @sj-i's findings, I think we can characterize mbstring's CP932 implementation as follows:

• It follows the "best fit" CP932 mappings from the Microsoft Windows API...
• ...together with an extra mapping which improves the quality of conversions from SJIS to CP932.
• It is also compatible with the WHATWG standard for "Shift_JIS".

Before we put that information in the PHP manual, we need to confirm that those statements are true. For example, @sj-i kindly checked the Win32 mappings for U+203E and U+00A5, but we haven't checked all Win32 "best fit" mappings from Unicode → CP932 and CP932 → Unicode.

I would also like to check all Shift_JIS mappings in the JavaScript Encoding API (i.e. WHATWG), to confirm whether we are truly compatible with them or not. I can do this when I have some time (probably in the next few days).

For core developers, we can add the following information:

• The 'extra mapping' is for U+203E.
mbstring's implementation of CP932 is also compatible with iconv's.
• We can also include a small amount of historical background information.

Again, we need to check if the statement about iconv is true or not. If it is true, it is good information for core developers. But for users of PHP, I don't think most of them really care about compatibility with iconv. Hence, I don't think that statement needs to go into the PHP manual. (If you disagree, please mention.)

NOW! We have come a long way here, but are still have not 100% reached where we need to go.

The remaining piece of critical information which needs to be documented, both for core developers and for users of PHP, is why we have the so-called SJIS-win encoding. We already know that this encoding matches the original implementation of CP932 which was included in libmbfl when it was integrated into PHP. Further, when CP932 was adjusted, @moriyoshi retained the previous implementation as SJIS-win, presumably for backwards compatibility.

That is all well and good, but do we really want to tell our users that we are currently keeping SJIS-win purely for BC? If so, that implies that nobody should be using it for new, 'greenfield' development. But I think it is useful for more than just BC.

SJIS-win maps U+203E and U+00A5 to fullwidth, JIS X 0208 characters. For people who are converting arbitrary Unicode text to SJIS/CP932, that could help to avoid situations where a source text contains ¥, but after conversion, the user sees \ instead.

I think we can summarize the recommended uses of these mapping tables as follows:

• If you are converting SJIS text to CP932, use CP932. (Should we mention this? Is it something that people are actually likely to do?)
• If you are exchanging CP932 data with Microsoft Windows software (such as Microsoft Office), especially if you receive data which was exported from MS Office and then emit data which will be imported back into MS Office, use CP932 for maximum compatibility.
• If you will be processing SJIS/CP932 data in a web browser using JavaScript, use CP932 for maximum compatibility between conversions performed in the browser and those performed in PHP.
• If you need to convert other, generic Unicode text to CP932, use SJIS-win.

Does that sound right?

@sj-i
Copy link
Contributor

sj-i commented Apr 16, 2022

@alexdowad
You have probably understood almost all the details necessary about this case, but I would like to emphasize the following point to be sure.
It should be noted that what is important here is not conformance to any written standard or specific implementation of another product, but a bit vague customs and BC.
Until 8.0, mbstring behaved differently from the current implementation. The reason behind this is not a formal standard, but an effort made for the convenience of users. Some code points had the same mappings in multiple OSS as it is customs from old ages. I've found yet another example in the JDK bug tracker.
https://bugs.openjdk.java.net/browse/JDK-4361835

So, as we found, there are parts that are currently consistent with the MS best-fit mapping, the WHATWG specification, and the behavior of other converters. But even if there are differences with them in the implementation, I don't think they should be "fixed" to eliminate the differences, at least in minor versions, unless there is an obvious error that is to the detriment of the user.

The long-standing behavior of PHP's mbstring itself is now part of the customs that our users rely on.

The advice we can surely give to future core developers, in general, is as follows

  • In many cases, there is no standard to follow for mapping between Japanese legacy encodings and Unicode.
  • In some cases, it is not clear which way of mapping is the "correct" one.
  • Mappings that have existed for a long time may have users relying on any part of them.
  • If the mapping is to be changed, there must be enough benefit to balance the possible BC breaks.
  • If it is not an obvious bug, it should not be fixed in a minor release.
  • See The return value of mb_list_encodings() doesn't contain 'SJIS-win' #8308 for more detail :)

nobody should be using it for new, 'greenfield' development.

I don't think anybody should use legacy encodings in their 'greenfield' development ^^
Joking aside, for the recommendations you listed, I think the following one should be fixed.

If you need to convert other, generic Unicode text to CP932, use SJIS-win.

"If you need to convert U+00A5 (YEN SIGN) and U+203E (OVERLINE) from Unicode to full-width, use SJIS-win."

I think it is appropriate to explain the behavior just as it is.

@alexdowad
Copy link
Contributor

alexdowad commented Apr 17, 2022

@sj-i
These are good points and are appreciated. In general, I will certainly try to follow your advice.

Regarding the specific issue under discussion here (SJIS-win vs. CP932), I must honestly say that this discussion has left a lot of question marks in my mind. How we got to the present state of affairs is now fairly clear. The general principles which @sj-i has set forth regarding future development of mbstring, at least as regards legacy Japanese encodings, are clear (and I do agree with all of them). A lot of things have become clear, largely due to @sj-i's research.

The remaining "question marks" are all related to the way forward from here. If I had known in 2020 that there was a difference in behavior between SJIS-win and CP932, I would not have merged these encodings at that time. But I didn't know, and I did merge them, and so here we are in 2022, trying to decide what to do next.

Normally, when a change like this is made unintentionally or due to a misunderstanding, the default response is to roll it back. And we might still go that way. But, there are some unusual things about this situation, which are making it hard for me to feel sure about what to do:

  • PHP 8.1 was released 6 months ago, but at present, we do not have any evidence that any PHP application in production has been broken by the merge of SJIS-win and CP932. Certainly, it is possible that some application, somewhere, might have been broken, and perhaps the developers have already patched it, without reporting the breakage. I wouldn't want to ignore that possibility; but still, for now, we do not have any evidence of actual problems being caused.

  • @zonuexe and @sj-i are obviously both very experienced, knowledgeable, and accomplished Japanese PHP developers, but it seems that neither of these esteemed developers has any reason for preferring SJIS-win or CP932. Certainly, reasons could be invented ex post facto to justify the request to restore SJIS-win; but when this GH issue was opened, neither had a reason.

It is revealing that both gentlemen could experience many years of PHP development without ever finding it necessary to pay attention to the difference between SJIS-win and CP932. Of course, there may be other Japanese PHP developers out there who have practical reasons to care about the difference; but at the moment, we do not have any evidence of such.

  • From a (very limited, incomplete) survey of Japanese-language blog posts on the subject, it did not appear that any of the authors had clear reasons for preferring SJIS-win or CP932. It seems likely that most existing users have just chosen randomly, or by "superstition" (copying what others do, without understanding the reasons).

  • It doesn't seem that other software development platforms have found it necessary or desirable to support two different variants of CP932.

Another point: To me, the maintenance burden of keeping two different CP932 variants in mbstring is not a great concern. My concern is more for the users. I think it is a disservice to PHP users if we keep two different features, which work almost the same, but just slightly different, and nobody can provide any reasonable explanation of why things should be that way or how the users can choose which one to use. (The PHP platform may have had such inconsistencies in the past, but they are being gradually smoothed out with each successive release.)

Thought experiment: If we were not working on PHP here, but on Python, Ruby, JavaScript, or some other platform which only supports one variant of CP932 in its standard library, would you recommend that a second variant of CP932 should be added? If not, that suggests there might not be a logical or practical reason to have two variants.

If indeed, there is no practical reason to have two different CP932 variants, it is possible that we could still roll the change back for PHP 8.1 (to avoid BC break), and perform the simplification in a future release. Please note, I am just mentioning this as an idea and am still open to all possible outcomes here.

At this point, it would be very helpful if other core developers can share their views.

Finally, I would like to say that the issue raised in #8281 may be more straightforward than this one, and it is possible we might be able to sort it out more quickly.

@cmb69
Copy link
Member

cmb69 commented Apr 18, 2022

If I had known in 2020 that there was a difference in behavior between SJIS-win and CP932, I would not have merged these encodings at that time. But I didn't know, and I did merge them, and so here we are in 2022, trying to decide what to do next.

It might be reasonable to revert for now, and to merge these encodings for PHP 9.

alexdowad added a commit to alexdowad/php-src that referenced this issue Aug 2, 2022
In e245985, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e245985 when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

php#8308
alexdowad added a commit that referenced this issue Aug 16, 2022
In e245985, I combined mbstring's "SJIS-win" text encoding
into CP932. This was done after doing some testing which appeared
to show that the mappings for "SJIS-win" were the same as those
for "CP932".

Later, it was found that there was actually a small difference
prior to e245985 when converting Unicode to CP932. The
mappings for the following two codepoints were different:

        CP932  SJIS-win
U+203E  0x7E   0x81 0x50
U+00A5  0x5C   0x81 0x8F

As shown, mbstring's "CP932" mapped Unicode's 'OVERLINE' and
'YEN SIGN' to the ASCII bytes which have conflicting uses in
most legacy Japanese text encodings. "SJIS-win" mapped these
to equivalent JIS X 0208 fullwidth characters.

Since e2459867af was not intended to cause any user-visible
change in behavior, I am rolling back the merge of "CP932"
and "SJIS-win".

It seems doubtful whether these two text encodings should
be kept separate or merged in a future release. An extensive
discussion of the related historical background and
compatibility issues involved can be found in this
GitHub thread:

#8308
@alexdowad
Copy link
Contributor

This has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants