Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kanbun marks (U+3190...U+319F) not differentiated from normal CJK characters #159

Closed
verdy-p opened this issue Jul 13, 2019 · 24 comments
Closed

Comments

@verdy-p
Copy link

verdy-p commented Jul 13, 2019

The "Kanbun" sinographic marks (U+3190...U+319F) are rendered the same as plain CJK characters.

They should be rendered as superscript (centered in the top help for the first one, "tateten", or occuying the top-left quarter of the ideographic box for the 15 other "kaereten"), instead of remapping the standard CJK (which should then be rescaled to 50% to 60% of width or height, and repositioned, with slightly simplified stroke to improve readability).

But it seems that you have just remapped the same glyph betweeen the full-square ideographs, and the Kanbun marks.

Note: these are NOT "compatibility" characters but really they are annotation marks. Rendering them as standard characters just creates ambiguity/confusion.

This affects ALL existing CJK fonts families (Noto Sans CJK, Noto Serif CJK), in the four supported language groups (JP, KR, TC, SC), and all styles and weights.

For reference, look at this chart:
https://www.unicode.org/charts/PDF/U3190.pdf

@kenlunde
Copy link

Um, no. Hundreds and possibly thousands of existing Japanese fonts that are based on Adobe-Japan1-5 or higher implement these characters using glyphs that may appear to be full-size, but are expected to be reduced in size when used for their intended purpose.

This is no different than ruby whose glyphs are provided at full-size, but are expected to be shrunk to an appropriate size when used for their intended purpose.

If you've read JIS X 4051, which is referenced in the Section 18.1 of the Core Specification, you'd understand this.

@verdy-p
Copy link
Author

verdy-p commented Jul 13, 2019

Ruby is a markup system, a style. But these characters are encoded in Unicode as plain-text where they are described as being superscript. Their reference chart is also showing them only as superscript.
You don't need any style to insert these characters between two ideographs in a row and then see the two different sizes and the spacing gap left to the right of the kanbun mark.
I don't see that at all as being "ruby", they are just marks.
If you use the Kanbun "man" mark after the "man" sinographs, they should be clearly different.

@kenlunde
Copy link

Still no. Unicode explicitly refers to JIS X 4051 which states that the glyphs for Kanbun are to be scaled to one-half size when used. You're conflating Ruby as a markup system and the additional need to scale them to one-half size, and it is the latter requirement to which I am comparing to Kanbun.

Whether the glyphs for Kanbun are exactly the same as their corresponding ideographs depends entirely on the font.

@kenlunde
Copy link

kenlunde commented Jul 13, 2019

I would also like to add that U+FE45 ﹅ SESAME DOT and U+FE46 ﹆ WHITE SESAME DOT in the CJK Compatibility Forms block require similar treatment according to JIS X 4051, meaning that they are reduced to half size. These are among a small number of 圏点 (kenten) characters.

EDITED TO STATE TO IGNORE: I am planning to submit feedback for UTC #160 that the "<super>" portion of their annotation be removed, along with providing an alternate font that more accurately reflects how these characters are implemented in virtually all fonts.

@verdy-p
Copy link
Author

verdy-p commented Jul 13, 2019

Sorry, but YOU only are making this conflation. I've never spoken about Ruby. The glyphs are to be scaled 1/2 when used, INCLUDING in plain text for which they were encoded in Unicode.
This means that the glyphs are already 1/2 in size in fonts.
Ruby can be used on any other character (not those kanbun) that it will reduce. But the Kanbun are already "ruby-like" and cannot be used in Ruby (well it could but then it would be reduced again).

These kanbun are not much different from other encoded superscripts like "²" which is pre-reduced, and where the superscript is also part of the identity, jsut like the superscript is part of the Kanbun identity.

Unicode would have never encoded these Kanbun characters if they were simple duplicates of the existing full-size characters.

And the Unicode charts are correctly stating showing them PRE-REDUCED, so the "<super>" portion in the UCD is accurate (and it is normative! you cannot remove it because it would violate the stability of normalizations!)

@kenlunde
Copy link

The characters in question date back to the very beginning of Unicode, so they were not really added, but were there from the beginning. There are plenty of characters in Unicode that were there from the beginning and for which not everything was known about how they should be implemented.

Consider U+3031 〱 VERTICAL KANA REPEAT MARK and U+3032 〲 VERTICAL KANA REPEAT WITH VOICED SOUND MARK whose "implemented as glyphs that are two-em tall" annotation was only recently added to the code charts (at my suggestion). These characters were in Unicode from the very beginning.

Anyway, good luck in convincing all Japanese type foundries to change the glyphs for these characters, which affects several hundred Japanese fonts.

@verdy-p
Copy link
Author

verdy-p commented Jul 13, 2019

Good luck then for your intent to "remove" the "<super>" portion of the UCD, it will be consistantly rejected as this property is frozen. And then the Charts published since the begining by Unicode and ISO are also accurate, but have not been correctly applied when creating Japanese fonts with Unicode mappings.
Note that these fonts with their legacy JISX mappings mappings don't have to be changed at all. What is wrong is the Unicode mapping in these fonts, which MUST NOT duplicate the same glyph id for unrelated UCS characters where these glyphs MUST be pre-reduced to superscripts.
So these Japanese foundries (and others) were wrong. This includes other non-Noto fonts, including other fonts from Adobe, Microsoft. I think that this bug should be discussed in the Unicode mailing list because if this is to be an "exception", then it should be documented so that renderers will do the correct thing (i.e. apply a reduction automatically to these characters, even if they are not styled in Ruby or superscript) when they use the Unicode mapping and not the legacy JISX mappings.

@kenlunde
Copy link

After looking at the UCD, I agree that changing the property is a non-starter. I should have checked before making such a suggestion.

BTW, these characters are not included in any legacy JIS standard, except for JIS X 0221 that is a clone of ISO/IEC 10646. Also, while the glyphs may look the same as their corresponding ideographs, their GIDs are unique. Some fonts treat them as generic, meaning that they do not vary in weight like their corresponding ideographs. Some fonts give them weight-specific treatment. Kozuka Mincho and Hiragino are examples of the former treatment, and Kozuka Gothic, Source Han, and Noto CJK are examples of the latter treatment. I feel that the former treatment is correct.

I will be at UTC #160 later this month, and will raise this as an issue. Fixing hundreds of Japanese fonts is a non-starter.

@macnmm
Copy link

macnmm commented Jul 13, 2019

From a renderer’s perspective, pre-sizing the glyphs differently so they occupy some fraction of the full embox would make it more difficult to size them exactly as the typographer specs it, I would think. Superscript and subscript sizes are customizable and so choosing one (in the font) to fit all cases of kanbun notation could be problematic. The renderer is responsible for size and placement relative to the font embox, not relative to the body text embox (as it would be if ruby and kenten and kanbun and other odd-sized glyphs were pre-shrunk by the font designer).

@verdy-p
Copy link
Author

verdy-p commented Jul 14, 2019

BTW, these characters are not included in any legacy JIS standard, except for JIS X 0221 that is a clone of ISO/IEC 10646.

This means that there are only proprietary implementations by some font foundries (but outside Japan, the usage of these font positions is really rare, as the Japanese language is almost unknown elsewhere, because of the difficult of the language and of its complex script whose phonographic part is ignored, and whose glyphs are also modified in many places, making the Kanjis very difficult to use and learn, even for Japanese people themselves, much more than simplified or traditional Hanzi characters for Chinese speakers; the Han characters were so difficult to use for Vietnamese, and so much mis-adapted to their language, that they have abandoned the old "Chuh Nom" style).

But the real standard is what is set in Unicode and ISO/IEC 10646 (and their charts give them a string identity). Some proprietary font vendors have just done errors when they started to adapt their fonts to the Unicode mapping: may be this was good for their CID-keyed mapping, but the conversion of old CID-keys (or private-use proprietary extensions in JIS) to Unicode, when these characters were finally encoded in the UCS, was really incorrect (and this has not been so many years ago: Fonts with Unicode mappings for these Kanbun are still very young).

This is not different from other preencoded superscripts like "²": fonts have to make them superscript for the superscript glyph (they can adopt the metrics they want, not necessarily "half-width" or fitting exactly in the quarter ideographic cell), but it must respect the identity.

No additional styling is needed : in plain-text these Kanbun should still be superscripts. For rich-text format, you would not even use these characters, you would use superscript or ruby styling over the normal characters.

I will be at UTC #160 later this month, and will raise this as an issue. Fixing hundreds of Japanese fonts is a non-starter.

@verdy-p
Copy link
Author

verdy-p commented Jul 14, 2019

From a renderer’s perspective, pre-sizing the glyphs differently so they occupy some fraction of the full embox would make it more difficult to size them exactly as the typographer specs it, I would think. Superscript and subscript sizes are customizable and so choosing one (in the font) to fit all cases of kanbun notation could be problematic. The renderer is responsible for size and placement relative to the font embox, not relative to the body text embox (as it would be if ruby and kenten and kanbun and other odd-sized glyphs were pre-shrunk by the font designer).

Incorrect ! The font designer still defines the metrics to adopt to the Kanbun. It's not up to the renderer to define it, except to synthetize the Kanbun if the font does not provide glyphs for them.

And this will never happen for styling rich-texts with superscripts or ruby: these texts will use the standard base characters, and then styles will apply to them (superscript/subscriptt variants of these base glyphs may be looked up in "OpenType features" of the same font, but if the feature is unimplemented or does not map these variants, the renderer will synthetize these styles using the glyphs mapped for the base characters in the "default feature" mapping, and will never need to use the Kanbun in the base mapping of the same font). So there's no metric issue at all.

@kenlunde
Copy link

The point that @macnmm is making is that Kanbun layout is complex, which explains why Section 5 of JIS X 4051:2004 (reconfirmed on 2018-10-22) spans 10 pages (pp 35 through 44), and describes special layout requirements that go above and beyond plain text. Its use is also relatively rare, which probably explains why JLREQ didn't take it on.

In any case, I submitted public feedback via Unicode's Contact Form, which included a link to this discussion, so that it can be discussed during UTC #160 later this month.

@kojiishi
Copy link

kojiishi commented Jul 15, 2019

I don't have opinions how to define them in TUS/UCD, but I agree with @macnmm on the renderer perspective. Fonts pre-sizing will make renderes more difficult, at least until we have a good Kanbun spec for OpenType.

@kenlunde
Copy link

@JPRidgeway Note the link directly above to the Source Han Sans issue that you needlessly opened, which means that there is no need to add verbatim what you wrote there.

@kenlunde
Copy link

@marekjez86 or @davelab6: You can safely close this issue. The only actions, which I have recorded in Source Han Sans Issue #205 and Source Han Serif Issue #36, is to make the glyphs for U+3191 ㆑ through U+319F ㆟, uni3191 through uni319F, generic in terms of weight.

The feedback that I submitted on 2019-07-15, which was included in L2/19-272, was discussed at length during last week's UTC meeting last week. The conclusion was that the superscript property does not require the glyphs to be preshrunk or positioned in a particular way. In retrospect, a property unique to these characters, perhaps called kanbun, would have been better. The property assignment was made before the implementation details were fully understood. Also, as a member of the Unicode Editorial Committee, I was given an Action Item to clarify this in the section of the Core Specification that describes the Kanbun block. This will be reflected in Unicode Version 13.0.

@verdy-p
Copy link
Author

verdy-p commented Aug 3, 2019

@kenlunde
Note that the UTC report in n L2/19-272, just exposes your own question and your comments, but nothing about the possible reply made by UTC. So you gave your opinion stating that you'd like just to not vary the glyphs by weight (but other fonts have varied it by weight) and your intent to not varry them in terms or size of positioning.

We're still waiting for an opinion other than your's. And there's NOTHING in the current Unicode alpha version 13.0, so this has most probably not been discussed or decided.

Closing this issue seems then VERY prematurate as the "conclusion" you state is invisible for now.
May be we have to wait for another beta update of Unicode 13.0 (which will only be released in March 2020). Unicode 13.0 is still a work in progress, that does not even meet the alpha stage, with most sections marked "TBD".

But may be this issue can be tracked in the two alterate bugs now open for "adobe-fonts/source-han-serif#36" and "adobe-fonts/source-han-sans#205"

@kenlunde
Copy link

kenlunde commented Aug 3, 2019

Sorry, but the issue was discussed at UTC #160, and the result was that I was assigned Action Item 160-A54 to clarify this in the "Kanbun" section of the Core Specification for Unicode 13.0, which will be published next March. The opinion other than mine that disagrees with you is the UTC. The attendee list at the end of the meeting minutes shows you who were in attendance, which included several "property" experts, such as Roozbeh Pournader, Mark Davis, and Ken Whistler. There are also opinions expressed earlier in this issue that disagree with you.

What we have yet to see is an opinion other than yours that agrees with you.

@verdy-p
Copy link
Author

verdy-p commented Aug 3, 2019

And "Action Item 160-A54" to which you were committed to work on a proposal is still waiting for a text that the UTC will need to approve (with other experts expressing their opinions before deciding it) before including in it in one of the "TBD" sections of the later Unicode 13.0 alpha (or some later version). For now this is still an open issue in Unicode, but thanks for reporting it to them and including it in their working schedule.

@kenlunde
Copy link

kenlunde commented Aug 3, 2019

As soon as I have drafted the additional text for the "Kanbun" section, and had it reviewed by the Unicode Editorial Committee, I will share it here. Property experts are also on that committee.

@kenlunde
Copy link

For those interested in this issue, the following is the text for the Kanbun section of the Core Specification for Unicode Version 13.0 that the Unicode Editorial Committee finalized today:

This block contains a set of Kanbun marks that are used in Japanese literary texts to indicate the Japanese reading order of Classical Chinese poetry and prose. These marks, named for the Japanese word for Chinese writing (漢文), occur particularly in Japanese educational and scholastic texts. They are typically written in an annotation style, placed interlinearly at the left side of each line of vertically rendered original Chinese text. Typesetting Kanbun text is inherently complex, requiring some form of markup and special handling to achieve the desired layout results.

Fourteen of the Kanbun marks, in the range U+3192 ㆒ ideographic annotation one mark through U+319F ㆟ ideographic annotation man mark, have compatibility decompositions to a corresponding CJK unified ideograph. These marks are merely special-purpose variants of those CJK unified ideographs, used with a specialized meaning and layout rules in Kanbun text. The way the glyphs are shown in the code charts at reduced size and raised above the baseline is intended to mimic their appearance as formatted for use in annotations. This appearance is the reason the compatibility mappings have been assigned the tag <super>. The compatibility mappings do not imply that these characters are appropriate for use as superscript forms in ordinary Chinese text; the preferred means for that purpose are text styles or markup in rich text. (See Section 22.4, Superscript and Subscript Symbols for more information.) Common practice for existing Japanese fonts that support these characters is to provide their glyphs at full size, with the expectation that the layout engine will scale and position them accordingly, per the layout specification for Kanbun text in JIS X 4051.

@verdy-p
Copy link
Author

verdy-p commented Aug 26, 2019

This text still does not indicate how the characters will render in plain text (without any external layout).

As the "Kanbun" behavior is entirely dependant of the presentation by the external styling engine and not at all on the glyphs themselves, they should have their default positioning and sizing (in plain text) still different from the normal CJK plain ideographs.

Otherwise, these characters are in fact only complete duplicate TOTALLY equivalent to the default CJK characters and the compatibility decomposition (which is normative) makes no sense at all.

So yes I maintain that fonts should map the Kanbun characters in superscript size and positioning, and it will still be up to the specific rendering engine to resize/reposition them in a Kanbun layout where available (with two variants, one for the vertical presentation, another for the horizontal presentation). If needed, fonts may supply OpenType features to include metrics/sizing/positioning substitution for Kanbun, but these features won't be enabled for the plain text rendering which will continue to use the superscript default style.

I don't like at all this new text, but even if it is adopted, this does not prohibit fonts to implement the opentype features for resizing/repositioning these superscript glyphs for vertical (or horizontal?) Kanbun interlinear layout (which, IMHO, just equivalent to ruby text).

And I've not seen for now any working implementation of Kanbun layout, except by using the standard CJK characters (not these Kanbun characters) along with standard ruby styling (e.g. in HTML/CSS).

@verdy-p
Copy link
Author

verdy-p commented Aug 26, 2019

What all this means is that the "compatibility decomposition" was not based at all on compatibility. And the choice of the <super> tagging was possibly wrong, but then should have been <ruby> (except that this tag was not documented; I don't think that changing the tag only in a compatiblity decomposition mapping of the UCD changes anything to the stability of normalisation; may be it would only break some old processors of the UCD database that could not parse the new <ruby> tag and mark the mapping as invalid, even if the remapping codepoint is in fact not changed at all and remains valid).

And when rendering texts containing these "Kabun", I don't see at all how layout engines can do anything without adding explicit styling in the document (i.e. using <ruby> text in HTML, or the deprecated interlinear controls around these characters, but then we don't even need at all these "Kanbun" characters when we can use the standard CJK characters with exactly the same styling added around the text but not to the plain-text itself.)

I don't see at all what is specific to the "Kanbun" layout. For me it's just a precomposition of standard ruby text, and if so, the correct rendering in plain text should also be interlinear as ruby text: if there's no way to for the ruby in the current lineheight, e.g. in monospaced texts, then the proper way to render it inline, because it cannot be placed elsewhere, is still as superscript, which is then the appropriate style to use for the default mapping in fonts in absence of any other styling/layout engine.

@kenlunde
Copy link

@verdy-p At this point, you need to take this up with the UTC, not by posting in this now-closed issue. The UTC already discussed this topic at length during UTC 160, so you're not likely to convince many people. Still, if you care to do so, please submit your feedback so that it can be discussed at UTC 161 in mid-October. Or, if you prefer to wait until the text shown above is published in the Unicode Version 13.0 Core Specification to submit feedback, that works, too.

@verdy-p
Copy link
Author

verdy-p commented Aug 27, 2019

I canot report to Unicode something they still have not published at all, and that has not even entered any beta survey. What was discussed by them was because of this bug I submitted here, and I think that this talk is still part of their ongoing process (I just hope they still have a link to this bug, which may continue to be useful for the future beta to come later).
For now it's impossible to submit something in Unicode according to what is currently published, there's no bug except still in these Noto fonts. But if Unicode wants to change the initial specification (which was NOT at all based in JISX 4051:2004, which was encoded many years after!) it will create some havoc. In my opinion the initial encoding was not wrong, but later this specification was changed incoherently by JISX and by this font and is probably broken in all other implementations in incompatible ways.
The only thing that seems wrong in the initial Unicode specification (in the UCD) was the choice of the "super" leading tag for the compatibility decomposition, where it could have been a bit more specific (may be "kanbun", but I don't see why it cannot be more generic simply as "ruby", whose specification is wellknown and not hidden in a cryptic Japanese "standard" invisible to almsot everyone in the world; it has never been published openly like Unicode, the Japanese standard body did not peer-reviewed its own new standard with existing international standard; it was created "ad hoc", mostly like a private-use application; and I've not seen any other reference to this Japanese JISX standard from any other vendors, not even font makers, but may be there's been some work made specifically by Adobe for Japan; I've not seen any other Japanese product referencing it; for me it's just a paper given to the industry to "testbed" their interest, but this JISX standard has not demonstrated any interest from third parties, not even Japanese ones, and as long as this JISX standard will remain proprietary and unpublished, this will like not conveince other vendors to change their behavior to deviate from the international ISO/IEC/Unicode standard and other related standard, notably the ruby specification made by the W3C).
As well Unicode did not made any statement of conformance with a Japanese standard that never existed at time of publication (and I've not seen in UTC and ISO WG any comment made by Japan to reserve the behavior to a pending standard to come later).
So like it or not, the kanbun characters were standardized in Unicode since ever in their "superscript" size and positioning (even if what was intended was a "ruby", standardized later by W3C but also before the JISX new national "standard" that ignored all past work). The Unicode characters were not intended to be "compatibility" characters with any other prior standard, they just got a "compatibility decomposition" for NFKD/NFKC, which is normalized and cannot be changed at all, jsut to support a reasonable rendering fallback: the charts display this reasonable fallback using superscript and this should be also followed in the design of Noto). Japan will have to fix their standard to comply to the international ISO/IEC/Unicode standards that they have already endorsed since long. The non-compliance bug is in " JISX 4051:2004" and should be reproduced "as is" in Noto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants