Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK ExtensionB, etc. #13

Open
oliviazhu opened this issue Jun 9, 2015 · 26 comments
Open

CJK ExtensionB, etc. #13

oliviazhu opened this issue Jun 9, 2015 · 26 comments

Comments

@oliviazhu
Copy link

What steps will reproduce the problem?

  1. Install CJK fonts
  2. Look at CJK text from Extension B
  3. See tofu.
  4. Sadness

What is the expected output? What do you see instead?
No tofu (it is right in the name of the font).

What version of the product are you using? On what operating system?
MacOS X 10.10.1 (Yosemite)
...
Er, uh, the same comment equally applies to Extensions C through E (Extension E
is being added to Unicode this year as Version 8.0). To play the devil's
advocate, I feel obliged to point out that it takes a non-trivial amount of
time to design glyphs for the tens of thousands of CJK Unified Ideographs that
are in Plane 2, so a great deal of patience is necessary.
...
Well, sure. Things do take time. At least we should probably document where
we are along the continuum from mo-tofu to noto-fu. I suggest a character set
coverage table or something. The project proposes to be a font for all of
Unicode, but if stuff like Extension B (which was in Unicode 3.1, and is 14
years old now) are always going to be on the nice to have list, then maybe we
need to call it "lessto", not "noto". I happen to care about Extension B
because my genealogy data includes given names which are out there in ExtB, so
they are (typically) tofu on my family tree website. It's also useful for
testing surrogate pair support with real text.

Moved from notofonts/noto-fonts#242

@dougfelt
Copy link
Contributor

Copied assignee over from the original issue.

This is a request, essentially, for coverage of every single Unicode CJK point, which as Ken points out would take some time, so could be considered a feature request. But the submitter points out that 'Noto' means 'no tofu', so arguably it's a defect that we don't cover all of CJK even as defined in Unicode 3.1.

@jungshik
Copy link

This is working-as-intended for the current stage of the project.

We do cover a small subset ( < 3,000) of Plane 2 characters (CJK Ext B, C, D) as used in Hong Kong, included in JIS X 213 and so forth. However, the full coverage of Plane 2 CJK Ideographs is beyond the scope of our current Pan CJK fonts.

We're keenly aware of the lack of the coverage for them. At this point, when and whether we'll cover them is still undecided.

@anyong
Copy link

anyong commented Feb 24, 2016

There are 60 characters in the extension planes that are official Taiwanese (Southern Min Chinese) characters as designated by the Taiwanese Ministry of Education and in their comprehensive dictionary that are in Unicode and not in any font that I know of (except for Hanizono). It would be great if these were in Noto Sans, as I do a lot of work in Taiwanese language materials and these characters always look different since they always show in Hanizono although I prefer to use Noto.

There are also 7 characters missing from the Unicode spec, but I'm not sure how one would go about getting those added to the spec before adding them to fonts.

Here are the lists:

In Unicode but missing from Noto (download Hanizono font HanaMinB to see):

𫝛 𫟊 𪜶 𫞼 𬦰 𠯗 𧉅 𧉟 𠞭 𠢕
𢲸 𨂿 𨃟 𢻷 𢼌 𣮈 𩨑 𡳞 𩵱 𧜞
𪁎 𤖯 𩛩 𠕆 𥰔 𥕥 𥴊 𥽕 𢓜 𪁎
𥍉 𠲿 𤶃 𧮙 𩟗 𪐞 𤲍 𩏠 𦊓 𩸙
𧺤 𩚨 𢄧 𩜇 𣁳 𦟪 𤺅 𧿳 𧌄 𤺪
𧿬 𣍐 𢪱 𩸶 𦜆 𨂾 𫝏 𫝺 𫝻 𫟂

The character missing from HanaMinB, 𬦰, is 足百 put together, it exists in the Unicode spec but not in any fonts that I'm aware of.

Missing from Unicode spec (join radicals left to right as single character):

率刂
歹差
氐頁
氵雀戈
豖殳
疒哥
石匹

Would be awesome if Noto CJK could include these characters, or at least the 60 that already have designated code points!

I would be happy to make the vector images for these characters based on existing Noto CJK characters, if that would make it easier/faster to get them included in the next release.

@KrasnayaPloshchad
Copy link

Is it possible to create additional fonts to give support for such characters? Hanazono Mincho did it.

@dougfelt
Copy link
Contributor

dougfelt commented May 2, 2017

@jungshik, @kenlunde care to comment on the Min characters reportedly listed in the Taiwan MoE dictionary?

@kenlunde
Copy link

kenlunde commented May 2, 2017

@dougfelt: Here's my response:

Apparently, @anyong didn't look hard enough for the ideographs that were reported as unencoded, because all of them, except one that is in the process of being encoded, were lurking in Extension B:

⿰率刂 → U+207A9 𠞩 (Extension B)
⿰歹差 → U+23A48 𣩈 (Extension B)
⿰氐頁 → U+2947E 𩑾 (Extension B)
⿲氵雀戈, ⿰氵𢧵, or ⿰𣼎戈 → U+24062 𤁢 (Extension B)
⿰豖殳 → U+27C35 𧰵 (Extension B)
⿸疒哥 → UTC-02663 (included in IRG Working Set 2015; aka Extension G)
⿰石匹 → U+25435 𥐵 (Extension B)

As to the list of 60 that were reported as missing from Noto Sans CJK, two of them are actually included (because they are within the scope of Hong Kong SCS-2008):

U+25565 𥕥 (Extension B)
U+280BE 𨂾 (Extension B)

51 (including the two above) are in Extension B, one is in Extension C, seven are in Extension D, and one is in Extension E.

In any case, being included in a dictionary may certainly qualify for encoding a character, particularly if it is a head entry. However, these characters are outside the current scope of the character set. The highest-priority issue for the forthcoming Version 2.000 update is to support Hong Kong SCS-2016 both in terms of characters and appropriate HK glyphs, and the next highest-priority issue is to support the additional ideographs for Korean names in Issue #80. Anything beyond that will depend on whether there are available CIDs, and will need to be considered on a case-by-case basis. There also needs to be mutual agreement between Adobe and Google, because ultimately someone will need to pay $¥₩ to have additional glyphs designed.

@audreyt
Copy link

audreyt commented May 28, 2017

⿲氵雀戈 is encoded as U+24062 𤁢, no?

ref: https://github.com/ethantw/moedict-idc/blob/master/cjk-ext.tsv#L103

@kenlunde
Copy link

@audreyt: Yep. Will edit my response above, from unencoded to U+24062 𤁢.

@anyong
Copy link

anyong commented Jan 13, 2019

Closing in on 2 years later, any possibility of getting these into Noto CJK? These characters not being available, especially on Android phones, is a major hindrance to writing Taiwanese in everyday conversation and thus further development of the sense of "normalcy" around it. Some characters listed above such as 𠢕 ("to be good at sth") have a very high frequency in Taiwanese, and are often replaced by similar sounding but meaningless-in-context characters such as 猴 (monkey).

@kenlunde
Copy link

@anyong The short answer is: Don't hold your breath.

The long answer is: Extension B is the proverbial "800-pound 🦍" when it comes to CJK Unified Ideographs. In related news, if Taiwan's standards weren't such a mess, it'd be possible to determine which characters beyond CNS 11643 Planes 1 and 2 (aka Big Five) are frequently used. While we eventually plan to support Extension B and beyond, it will take years and possibly a decade or more to achieve that goal.

The image below puts Extension B into perspective, in terms of how much of Plane 2 (aka SIP) it occupies:

sip-table

@anyong
Copy link

anyong commented Jan 13, 2019

@kenlunde I can certainly appreciate the problems presented by the size of Ext. B. On the other hand, I'm asking about a few dozen characters in particular that are defined by the Taiwanese Ministry of Education as part of the Taiwanese orthography.

Certainly it should be possible to prioritize some small subsets over the entirety of Ext. B. Could you explain what you mean by Taiwan's standards being "such a mess"?

I am sure @audreyt could help resolve any issues to get a definitive answer on what should be prioritized. At the very least, the characters in the MOE's set of frequently used Taiwanese characters should be included as they are indeed frequently used.

@kenlunde
Copy link

@anyong See Source Han Sans Issue #222. In other words, the dot-release that is currently targeted for mid-April will not include glyph for these characters, but they will be considered for the next major release.

@anyong
Copy link

anyong commented Jan 17, 2019

@kenlunde thanks! If there is anything I could do to help, I would be happy to.

@kenlunde
Copy link

@anyong As long as the listing in that Source Han Sans issue is accurate, there's nothing further that you need to do.

@davelab6
Copy link
Member

Thanks for discussing this @anyong - I'm keeping this issue open, as it seems it will eventually end up in a future release.

@kenlunde
Copy link

@davelab6 The reasons why I suggested closing this issue is because 1) supporting additional CJK Unified Ideographs is so blatantly obvious; and 2) supporting Extension B and beyond in their entirety will take many years.

@davelab6
Copy link
Member

they will be considered for the next major release

Is this publicly documented anywhere? :) If not, lets chat about this next time we speak privately, I'd like to better understand any loose ideas about the roadmap

@kenlunde
Copy link

@davelab6 Affirmative. See Source Han Sans Issue 222.

The reason why I suggest that this issue be closed is because it will take years or decades to fully support the CJK Unified Ideograph extensions. The graphic that I posted on 2019-01-12 illustrates this extraordinarily well.

@tamcy tamcy mentioned this issue Feb 19, 2020
@ghost
Copy link

ghost commented Apr 16, 2021

According to the @kenlunde 's image, it's a bunch of work! 😮

When you have to render these characters, use other fonts (such as MS Gothic) for now.
Please see this.

These are the huge numbers of characters to support

There are:
-6592 characters in Extension A
-42718 characters in Extension B (which is the biggest)
-4149 characters in Extension C
-222 characters in Extension D
-5762 characters in Extension E
-7473 characters in Extension F
-542 characters in Compatibility Supplement
-4939 characters in Extension G
(Source: en.wikipedia.org)

Altogether, there are 72397 characters to design!

That's not all: Unicode is growing year by year, and the numbers above will grow toghether with Unicode.
So, @kenlunde is right.

@twardoch
Copy link

twardoch commented Apr 16, 2021

The numbers are even larger. Many of these characters need separate traditional and simplified glyph forms, I imagine (and that's what the issue suggests to which @kenlunde referred).

@twardoch
Copy link

If you need a free font to fill the tofu blanks, you may consider https://github.com/cjkvi/HanaMinAFDKO/releases which is built from http://en.glyphwiki.org/wiki/GlyphWiki:MainPage — it can be used as a bearable fallback for Source Han Serif / Noto CJK Serif.

@ghost
Copy link

ghost commented Apr 21, 2021

For a list of fonts to cover (almost) all of Unicode 13 + ConScript, go here.

@Jimw338
Copy link

Jimw338 commented Jan 11, 2023

Where do the graphics shown on https://decodeunicode.org/en/u+200D3 come from? They obviously came from somewhere. Is there such a "generic complete CJK Unicode font" out there - and then software tools (OSX in my case) that can automatically substitute such "generic glyphs" if a particular glyph isn't available in whatever font is selected?

My problem is in learning/dabbling with Chinese, and seeing all the "hex-code-boxes" that show up on Wiktionary for unusual glyphs (or for a part of a glyph, for instance showing a radical or a character simplification pattern - some are "standard radical", some might be other "non-radical graphic patterns" (?) - I don't know (obviously).

For that matter, how were the code-points (ALL of them) defined in the first place? Presumably when they were defined, there was SOME graphical representation used. So why isn't there a "complete CJK" font out there somewhere, even if it doesn't look particularly good.

This probably has an obvious answer.. somewhere.. The problem is how to find it..

@anyong
Copy link

anyong commented Jan 12, 2023

Hanazono Mincho has almost all CJK glyphs (or about 110,000 of them, at least). That's too many for one font file, so make sure to install both HanaMinA and HanaMinB. The download link is on the release page. Look for hanazono-20170904.zip on that page.

@tamcy
Copy link

tamcy commented Aug 9, 2023

Wow, next year is the 10th anniversary of Noto Sans CJK!

If Ext-B is the real blocker for increased coverage, is it possible to split Ext-B into various stages? Among those 42,720 characters in Extension B, 17,985 (~42%) of them are from the KangXi Dictionary and are not supported in Noto Sans CJK (probably a bit more unsupported in Serif CJK, but the figure should be close).

Due to the legacy of KangXi Dictionary, I'd argue that to-fu characters in this dictionary are relatively more frequently encountered, and supporting them should benifit the largest community, including regions that did not submit a reference glyphs for these characters, i.e. JP and HK.

In this 17,985 codepoints, 17,978 (~99%) of them have a G (CN) source, 17,338 (~96%) of them have a T (TW) source, and 83 (~0.5%) of them have a K (KR) source. None of them has a J (JP) or H (HK) source. I believe supporting CN and TW sources are sufficient. My clueless speculation is that about 40-50% of the glyhs could be shared among CN and TW, probably a little bit more if subtle differences are ignored. For these less-frequently used characters, being able to show a meaningful and recognizable glyph instead of a tofu on a device is more important than to have a glyph with strokes exactly the same as the one on the Unicode code chart.

And I wonder if generative AI can play a part in future iteration of the font... just a wild guess though.

@tommai4881
Copy link

tommai4881 commented Aug 14, 2023

How about the characters from the following fonts (all sourced from Source Han Sans+Serif aka Noto Sans+Serif CJK)
Gothic Nguyên: https://github.com/TKYKmori/Gothic-Nguyen
Minh Nguyên: https://github.com/TKYKmori/Minh-Nguyen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests