Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if Traditional/Simplified errors were introduced by the update of Unihan_Variants.txt to “2021-12-01 Unicode 15.0.0 draft” #97

Closed
mike-fabian opened this issue Jan 20, 2022 · 13 comments

Comments

@mike-fabian
Copy link
Owner

c1c39a3

Regenerate engine/chinese_variants.py for Unihan_Variants.txt from “2021-12-01 Unicode 15.0.0 draft”

Because of the new Unihan_Variants.txt, the following 37 characters
are added to the VARIANTS_TABLE in chinese_variants.py:
u'䓨': 1, u'沄': 1, u'潕': 2, u'澐': 2, u'罃': 2, u'鮗': 2, u'龻': 2,
u'鿟': 1, u'鿠': 2, u'鿰': 1, u'鿲': 1, u'鿳': 2, u'鿴': 1, u'鿵': 1,
u'鿶': 1, u'鿷': 1, u'鿸': 1, u'鿹': 1, u'鿺': 1, u'𣲘': 1, u'𤇾': 2,
u'𤪤': 2, u'𦥯': 2, u'𧰎': 2, u'𩷓': 2, u'𩷕': 2, u'𩹎': 2, u'𪄳': 2,
u'𪛞': 1, u'𫇦': 1, u'𬉧': 2, u'𬵨': 2, u'𰀡': 1, u'𰀢': 1, u'𰁜': 1,
u'𰃮': 1, u'𰯲': 2,

1 = simplified Chinese
2 = traditional Chinese
3 = used both in simplified and traditional Chinese
@ghost
Copy link

ghost commented Jan 21, 2022

Oops, I left the above two comments on the wrong issue, it seems. So I deleted them.

@ghost
Copy link

ghost commented Jan 21, 2022

u'沄': 1

This one is used in Trad:

https://dict.revised.moe.edu.tw/search.jsp?md=1&word=%E6%B2%84&qMd=0&qCol=1

@ghost
Copy link

ghost commented Jan 21, 2022

The rest seem correct to me. However, there are a few characters that don't display for me.
Do I need to download some fonts for them to display? I mean I have all the hanazono fonts for displaying the CJK Extentions fonts as per this page: https://ctext.org/font-test-page

@ghost
Copy link

ghost commented Jan 21, 2022

u'鿰': 1, u'鿲': 1, u'鿳': 2, u'鿴': 1, u'鿵': 1,
u'鿶': 1, u'鿷': 1, u'鿸': 1, u'鿹': 1, u'鿺': 1,
u'𪛞': 1, u'𰀡': 1, u'𰀢': 1, u'𰁜': 1,
u'𰃮': 1, u'𰯲': 2,

These are the characters that aren't displayed (i.e., I get "tofu").

@mike-fabian
Copy link
Owner Author

u'鿰': 1, u'鿲': 1, u'鿳': 2, u'鿴': 1, u'鿵': 1,
u'鿶': 1, u'鿷': 1, u'鿸': 1, u'鿹': 1, u'鿺': 1,
u'𪛞': 1, u'𰀡': 1, u'𰀢': 1, u'𰁜': 1,
u'𰃮': 1, u'𰯲': 2,

These are the characters that aren't displayed (i.e., I get "tofu").

None of these display for me either at the moment. I have not investigated yet why, maybe they are new.

@mike-fabian
Copy link
Owner Author

u'鿰': 1, u'鿲': 1, u'鿳': 2, u'鿴': 1, u'鿵': 1,
u'鿶': 1, u'鿷': 1, u'鿸': 1, u'鿹': 1, u'鿺': 1,
u'𪛞': 1, u'𰀡': 1, u'𰀢': 1, u'𰁜': 1,
u'𰃮': 1, u'𰯲': 2,

These are the characters that aren't displayed (i.e., I get "tofu").

None of these display for me either at the moment. I have not investigated yet why, maybe they are new.

Many of those which display as Tofu are apparently not new, for example

'鿰' U+9FF0

https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E9%BF%B0

says:

“The Unicode Standard (Version 3.2)”

i.e. it is there for a long time already.

The only source given at https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E9%BF%B0

is

kIRG_GSource GKJ-00201

which according to

https://www.unicode.org/reports/tr38/index.html#kIRG_GSource

is “GKJ Terms in Sciences and Technologies (科技用字) approved by the China National Committee for Terms in Sciences and Technologies (CNCTST)”

But if none of our fonts have glyphs for that, I guess it is something really obscure, maybe nobody really uses that.

@mike-fabian
Copy link
Owner Author

u'沄': 1

This one is used in Trad:

https://dict.revised.moe.edu.tw/search.jsp?md=1&word=%E6%B2%84&qMd=0&qCol=1

I think I’ll fix only that one and assume that the classification of all the other characters is correct.

@ghost
Copy link

ghost commented Jan 23, 2022

Sounds reasonable.

@ghost
Copy link

ghost commented Jan 24, 2022

Here's another Trad/Simp character that's only marked as Simp:

@mike-fabian
Copy link
Owner Author

Here's another Trad/Simp character that's only marked as Simp:

Thank you, great!

@mike-fabian
Copy link
Owner Author

I made a new issue with a list of all characters in the cangjie5 table which are currently classified as simplified only.
Most of these are classified correctly, but there seem to be a few errors:

#100

Mike’s github project board automation moved this from In progress to Done Jan 24, 2022
@mike-fabian
Copy link
Owner Author

@mike-fabian
Copy link
Owner Author

Builds for Fedora are here: https://copr.fedorainfracloud.org/coprs/mfabian/ibus-table/

sudo dnf copr enable mfabian/ibus-table
sudo dnf update ibus-table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant