Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

Closed
wants to merge 1 commit into from
Closed

Rebuild ImFontAtlas::GetGlyphRangesJapanese offset table #3627

wants to merge 1 commit into from

Conversation

vaiorabbit
Copy link
Contributor

Hello!

This PR makes ImFontAtlas::GetGlyphRangesJapanese() support more Japanese characters (Kanjis defined by the Government of Japan) out of the box.

  • 2136 Joyo (meaning "for regular use" or "for common use") characters
  • 863 Jinmeiyo (meaning "for personal name") characters

What this PR do

The commit 0e6b84c rebuilds internal offset table in ImFontAtlas::GetGlyphRangesJapanese.

Source of the offset table

As a reliable source of this offset table, I chose the character information database of the Information-technology Promotion Agency (IPA, an administrative entity of Japan).

IPA provides REST API to access their database https://mojikiban.ipa.go.jp/mji/ .
The information acquired from the database is freely available under the terms of Creative Commons Attribution-ShareAlike 2.1 Japan (CC BY-SA 2.1 JP).

Supplemental scripts

I made a repository https://github.com/vaiorabbit/everyday_use_kanji that contains several Ruby scripts to

These scripts will be useful when we want to keep GetGlyphRangesJapanese() up-to-date in the future.

Motivation

Click here to expand

Current GetGlyphRangesJapanese() implementation supports 1946 characters, but this is not enough to support 2136 Joyo (common-use) characters and 863 Jinmeiyo (for personal names) characters, which are defined by the Government of Japan).

So we often see garbled characters in relatively simple Japanese sentences and people's names
(displayed by the replacement character ("?") as a fallback in this screenshot).

Though Sometimes GetGlyphRangesChineseFull is recommended as a replacement,

  • using GetGlyphRangesChineseFull() tends to produce texture larger than that of GetGlyphRangesJapanese. Though it would depend on the configuration, GetGlyphRangesChineseFull produces 4096 x 4096 font texture internally, which is quite large compared to GetGlyphRangesJapanese() implementation, which produces only 1024 x 2048 texture.
    • Font texture (GetGlyphRangesChineseFull) displayed in RenderDoc
    • Font texture (GetGlyphRangesJapanese) displayed in RenderDoc
  • but still fails to display several Joyo characters.

There is another alternative called GetGlyphRangesChineseSimplifiedCommon that supports 2500 characters,

  • but covers different ranges that does not used from Japanese characters, results in more garbled characters.

I thought it would be easy and reasonable to rebuild the internal tables in GetGlyphRangesJapanese() to support Japanese characters defined by the government.

Limitations

What you will about to read below is a topic that is difficult even for the Japanese people. But I will try to explain it somehow.

In short:

  • In the current Joyo kanji table, there is only one character that its code point cannot be represented in 2-byte variable.
  • To avoid/alleviate the problem, I made a tweak so that most Japanese wouldn't notice.
  • Those who wants to handle this character correctly, IMGUI_USE_WCHAR32 and ImFontGlyphRangesBuilder easily solve the problem.
Limitation and workaround due to the code point of "𠮟"

Limitation and workaround due to the code point of "𠮟"

In a commit in the previous similar PR ( #1650 ), we can see a line that says:

// FIXME: We lost U+20B9F because it's out of range.

This means the character corresponding to the code point 0x20B9F(==134047) exceeds the range of 2-byte variable (short or ImWchar16) so cannot be displayed.

The actual character is "𠮟" (scold, rebuke or reprimand, etc.).

  • 𠮟 (code point 0x20b9f(==134047)
    • encoded as F0 A0 AE 9F in UTF-8
    • was added as the Joyo Kanji in 2010
    • is the only character in 2136 Joyo characters that requires more than 2 bytes to express its code point

"𠮟" still can cause garbled character. When we try to use "𠮟" in Windows, Microsoft's standard Japanese IME displays attention "環境依存(environment-dependent)", that means "this character may cause garbled characters because there are several environments that cannot handle this character code".

So, this character is often substituted by the variant character "叱" (U+53F1).

  • 叱 (code point 0x53f1(==21489)
    • encoded as E5 8F B1 in UTF-8
    • is the traditional form of 「𠮟」
    • means "scold, rebuke or reprimand", etc. So the only difference between the two kanji is in design.
    • can be stored its code point in 2-byte variable (short or ImWchar16).
    • has been used for a long time before the modern form 「𠮟」 was added in 2010, and still used

Actual history of this problem is a bit more complex, but in terms of actual use cases, these two characters can be recognized as the same character, differing only in design.

So in this PR, I intentionally used "叱 (U+53F1)" at everywhere "𠮟 (u+20B9F)" should come but unusable.

Even after this PR was merged, GetGlyphRangesJapanese() can display "叱" (U+53F1) but cannot display "𠮟 (u+20B9F)".
Users who want to display "𠮟 (modern form)" should follow these steps:

  • Build ImGui with IMGUI_USE_WCHAR32 enabled

  • Prepare appropriate font (e.g. Google Noto Fonts)

  • Write codes like:

    ImFontGlyphRangesBuilder builder;
    builder.AddRanges(io.Fonts->GetGlyphRangesJapanese());
    #ifdef IMGUI_USE_WCHAR32
    builder.AddText(u8"𠮟"); // code point 0x20b9f(==134047, exceeds the range of ImWchar16), encoded as F0 A0 AE 9F in UTF-8
    #endif
    ImVector<ImWchar> out_ranges;
    builder.BuildRanges(&out_ranges);
    ImFont* font = io.Fonts->AddFontFromFileTTF("/font/NotoSansMonoCJKjp-Regular.otf", 20.0f, nullptr, out_ranges.Data);
  • References

Test and Performance

I made a small test code that tries to display all 2136 Joyo characters and 863 Jinmeiyo characters.

Screenshots

Screenshot (/w current GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

  • causes several garbled characters.

Screenshot (/w new GetGlyphRangesJapanese(), IMGUI_USE_WCHAR32 disabled)

  • can display all 2999 characters, except for 「叱 (modern form)」

Screenshot (/w new GetGlyphRangesJapanese(), enable IMGUI_USE_WCHAR32 and use ImFontGlyphRangesBuilder::AddText)

Performance issue

Size of font texture

Though it would depend on the configuration, both current GetGlyphRangesJapanese() and new implementation created 1024x2048 font texture internally in the test code. The increase in texture size was not so great.

  • GetGlyphRangesJapanese[Current]
  • GetGlyphRangesJapanese[New]

Memory consumption

The test code reports memory consumption by ImGui when the macro MEASURE_MEMORY_ALLOCATION is defined
(by using the allocator hooks provided by ImGui::SetAllocatorFunctions).
The increase in memory consumption due to the new implementation is less than 100K Bytes.

[Windows x64 / Visual Studio 2019 Version 16.7.4 / ImGui 1.80 WIP]
GetGlyphRangesJapanese[Current]
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=27718242
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=28537544
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=27730578
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=28549880

GetGlyphRangesJapanese[New]
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=27790566
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=28613312
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=27802902
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=28625648

GetGlyphRangesChineseFull
  (Debug, IMGUI_USE_WCHAR32 undefined)   -> GetAllocatedSize=102034930
  (Debug, IMGUI_USE_WCHAR32 defined)     -> GetAllocatedSize=102847380
  (Release, IMGUI_USE_WCHAR32 undefined) -> GetAllocatedSize=102034924
  (Release, IMGUI_USE_WCHAR32 defined)   -> GetAllocatedSize=102847374

- GetGlyphRangesJapanese now supports
  - 2136 'Joyo (meaning "for regular use" or "for common use")' Kanji
  - 863 'Jinmeiyo" (meaning "for personal name")' Kanji
ocornut pushed a commit that referenced this pull request Dec 2, 2020
- GetGlyphRangesJapanese now supports
  - 2136 'Joyo (meaning "for regular use" or "for common use")' Kanji
  - 863 'Jinmeiyo" (meaning "for personal name")' Kanji
@ocornut
Copy link
Owner

ocornut commented Dec 2, 2020

Thank you for this incredible amount of details.
I took the liberty to add a line of comment under the comments for GetGlyphRangesJapanese(), which says:
"- Missing 1 Joyo Kanji: U+20B9F (Kun'yomi: Shikaru, On'yomi: Shitsu,shichi), see #3627 for details."

image

We'll later take inspiration from some of your tests to include in your test suite!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants