Skip to content

Update SJIS-mac mappings to use Unicode codepoints which were added after Unicode 1.0 #10264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexdowad
Copy link
Contributor

I expect @youkidearitai will be particularly interested in this one?

Probably a lot of the other legacy text encodings supported by mbstring could benefit from similar updates.

@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

The conversion tables used by mbstring were made a long time ago.
Since that time, a lot of characters have been added to Unicode. This
means that in some cases, we now have Unicode codepoints available
which are a better match for characters in other legacy charsets.

For MacJapanese, there are at least 7 mappings which can be updated
to make use of newer Unicode codepoints:

• 0x869E -> U+213B FACSIMILE SIGN (added in Unicode 4.0)

Previously, we used a special "Apple transcoding hint" codepoint
U+F861 to indicate that the following 3 codepoints should be treated
as a single character. Then we used the ASCII letters 'F', 'A', 'X'.

However, U+213B explicitly includes all three of the Latin letters 'FAX'
as a single character.

• 0x86D3 -> U+27A1 BLACK RIGHTWARDS ARROW (Unicode 1.1)
• 0x86D4 -> U+2B05 LEFTWARDS BLACK ARROW (added in Unicode 4.0)
• 0x86D5 -> U+2B06 UPWARDS BLACK ARROW (added in Unicode 4.0)
• 0x86D6 -> U+2B07 DOWNWARDS BLACK ARROW (added in Unicode 4.0)

Previously, we used a different special "Apple transcoding hint"
codepoint U+F87A to indicate that the color palette should be reversed
when rendering the following codepoint, to make white color into black
and black color into white. This was done because there were no
Unicode codepoints at the time which specifically represented
black-colored arrows. But... now there are.

• 0xEB6D -> U+FE47 PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET
(added in Unicode 4.0)
• 0xEB6E -> U+FE48 PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET
(added in Unicode 4.0)

Previously, yet another special "Apple transcoding hint" codepoint
U+F87E was used to indicate that a 'vertical form' should be used
when rendering the following codepoint.

However, we now have codepoints which specifically represent vertical
forms for square brackets.

Although we are now making use of newer codepoints to reduce the number
of cases where we need to map one MacJapanese character to a sequence
of several codepoints, we still convert the old sequence of codepoints
back to the same MacJapanese character (for backwards compatibility).

There are likely other cases where we could make good use of newer
Unicode codepoints, both in MacJapanese and in other legacy text
encodings. As these come to my attention, I would like to continue
modernizing the mappings used by mbstring.
Use another 7 newer Unicode codepoints which match legacy SJIS-mac
characters better than what we are currently mapping them to.

• 0x8591 -> U+1F100 DIGIT ZERO FULL STOP (added in Unicode 5.2)

We previously used an "Apple transcoding hint" to indicate that the
next two codepoints should be treated as a single character, then the
ASCII characters '0', '.'

• 0x8645 -> U+1F13C SQUARE M (added in Unicode 6.0)
• 0x864B -> U+1F136 SQUARE G (added in Unicode 6.0)

Legacy Japanese charsets include various symbols inside of squares;
Unicode 6.0 added all the letters of the Latin alphabet inside squares
specifically to allow lossless mapping to and from such charsets.

• 0x86CE -> U+21F5 DOWNWARDS ARROW LEFTWARDS OF UPWARDS ARROW
(added in Unicode 3.2)

We previously used an "Apple transcoding hint" to indicate that the
next two codepoints should be treated as a single character, then the
Unicode codepoints U+2193 DOWNWARDS ARROW, U+2191 UPWARDS ARROW.

• 0xEB41 -> U+FE11 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
• 0xEB42 -> U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
• 0xEB63 -> U+FE19 PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS
(all added in Unicode 4.1)

We previously used an "Apple transcoding hint" to indicate that a
vertical form should be used when rendering the following codepoint,
then a codepoint such as U+3001 IDEOGRAPHIC COMMA.

Now we have codepoints specifically for the vertical form of these
symbols.
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@youkidearitai
Copy link
Contributor

@alexdowad I have a question. Where is from MacJapanese-SJIS.txt? On my mac displayed another character this PR.
For example, this PR output is , but until now output is ….
スクリーンショット 2023-01-09 19 33 22

Its not ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT isn't it?

@alexdowad
Copy link
Contributor Author

@alexdowad I have a question. Where is from MacJapanese-SJIS.txt? On my mac displayed another character this PR. For example, this PR output is , but until now output is …. スクリーンショット 2023-01-09 19 33 22

Thanks for sharing that example.

I believe that in MacJapanese 0xEB63 is supposed to be a "vertical form" of the ellipsis, so from looking at the screenshot you kindly shared, I think the glyph which shows is actually more correct than the other one.

Its not ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT isn't it?

Yes, that should be the one.

@youkidearitai
Copy link
Contributor

I convert SJIS-mac ed63 to UTF-16 that use TextEdit app, output to U+2026+U+F87E.

$ xxd -g 2 eb63_to_utf16.txt
00000000: fffe 7300 7400 7200 6900 6e00 6700 2800  ..s.t.r.i.n.g.(.
00000010: 3200 2900 2000 2200 2620 7ef8 2200 0a00  2.). .".& ~."...

Reference from https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT of comment is vertical form for HORIZONTAL ELLIPSIS.

0xEB63	0x2026+0xF87E	# vertical form for HORIZONTAL ELLIPSIS

But, looks like U+2026+0xF87E is not mean to vertical form for HORIZONTAL ELLIPSIS.
Another example using mac_japanese from Ruby, result is U+2026+U+F87E.

$ ruby -e 'require "mac_japanese"; x = "\xeb\x63"; x.force_encoding(Encoding::MacJapanese); p MacJapanese.to_utf8(x)'
"…"

スクリーンショット 2023-01-09 22 03 01

Therefore, I think result of ed63 is U+2026+U+F87E that 2 result of example and unnecessary SJIS-mac change mappings.

@alexdowad
Copy link
Contributor Author

Therefore, I think result of ed63 is U+2026+U+F87E that 2 result of example and unnecessary SJIS-mac change mappings.

Fair enough. It does raise the question of what 0xED63 was intended to represent in MacJapanese (it obviously wasn't supposed to be "an ellipsis followed by a white rectangle" like …), but anyways, if there is uncertainty on this point, I can take it out of this PR.

What do you think about the other mappings?

@youkidearitai
Copy link
Contributor

What do you think about the other mappings?

I search different to PHP 8.0 and this PR, I created shell script to all differents character.

It seems any meaning characters that all characters. My macOS font (Monaco), there is no rectangle.
スクリーンショット 2023-01-09 23 37 06

@alexdowad
Copy link
Contributor Author

Thanks for the testing.

It seems any meaning characters that all characters.

Sorry, I'm not very sure what this means, but would like to understand more. Could you kindly explain a bit more? (英語が難しかったら、日本語も読めます。)

@youkidearitai
Copy link
Contributor

@alexdowad ありがとうございます。それではお言葉に甘えて日本語でお伝えします。
(I'm sorry for talking Japanese. Telling English about MacJapanese of history was very difficult.)

WikipediaのMacJapaneseが参考になるかと思いますが、このPull RequestではUnicodeコードポイントの互換がないため、違う文字に変換されてしまいます。

MacJapaneseとは何なのかを調べていくと、Macが日本の印刷所で使われていた時代に、自ら外字を当てはめていったものを、Appleが取り込んだとする説を見ることができました。Macが日本の印刷所で使われていた時代、となると1990年代から2000年代あたりだと思います。

MacJapaneseでは、複数のUnicodeコードポイントを用いて表現する文字がありえますが、この文字を再現し、満足に表示できるフォントは現代においてはほぼ無いと考えられ、違うUnicodeコードポイントで復元することは何らかの文字列を破壊する可能性があります。そのため、「It seems any meaning characters that all characters.」という文章を書いてしまいました。

従って、後方互換性の破壊(BC breaks)になりうるため、なるべく維持するのが望ましいです。

参考

@alexdowad
Copy link
Contributor Author

@youkidearitai Thanks for your research.

Brief summary for others who are following this discussion in English: @youkidearitai did some research on the background of MacJapanese encoding. He says that it dates back to the time when classic Macs were used in Japanese printing companies, around the 1990s and early 2000s.

@youkidearitai further says that fonts which can faithfully recreate the same characters which were used on those classic Macs may not be available on modern platforms, and if we try to recreate them using other Unicode codepoints, it might cause breakage, so he would prefer that we avoid changing the mappings as much as possible.

@alexdowad
Copy link
Contributor Author

My response is as follows.

I think there is a very simple way to determine whether any Unicode codepoints in the current Unicode standard can faithfully represent characters used on classic Macs or not: find someone who has a classic Mac running a Japanese build of MacOS, get screenshots showing how the characters in question appeared, and compare them to the glyphs which are available on modern personal computers.

I don't know if I will actually do that or not. If I have time, I might try.

@ranvis
Copy link
Contributor

ranvis commented Apr 20, 2024

@alexdowad
There seems to be some points missing, that some Unicode text that used to map to "known MacJapanese" can no longer be converted back to MacJapanese with the updated mbstring converter anymore, which is a BC breakage. *1
Also, while I admit that the modified map is cool, labeling this invented mapping as MacJapanese is dubious. It is almost creating another layer of "people sometimes call this encoding Shift_JIS, but actually its variant" problem. This is a new incompatibility with other converter programs.

*1: Re @youkidearitai's comment "to break strings of some sort"

From a consistency standpoint, whether the glyph shapes are identical is not the main problem by the way. Its mapping for those glyphs had been defined as such. It is unfortunate that many fonts today may not have such ligatures defined in them, making it not possible to display the sequences correctly even on macOS.

That being said, with the late comment, I've made the SJIS glyph table displayed on Mac OS 9: https://github.com/ranvis/sjis-pict/wiki/samples

@alexdowad
Copy link
Contributor Author

@ranvis Thanks for your comments. If it wasn't clear, this PR was not merged and there are currently no plans to merge it.

Let me close it now.

@alexdowad alexdowad closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants