Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traditional/Simplified errors introduced by the update of Unihan_Variants.txt to “2021-08-06 Unicode 14.0.0 final” #96

Closed
mike-fabian opened this issue Jan 20, 2022 · 6 comments
Assignees
Labels

Comments

@mike-fabian
Copy link
Owner

4debaa5

Regenerate engine/chinese_variants.py for Unihan_Variants.txt from “2021-08-06 Unicode 14.0.0 final”

  • All our fixes which are now included upstream.

  • Because of the new Unihan_Variants.txt, the following 48 characters
    are added to the VARIANTS_TABLE in chinese_variants.py:
    u'䓖': 1, u'了': 1, u'伙': 1, u'借': 1, u'傢': 2, u'冬': 1, u'千': 1,
    u'卜': 1, u'卷': 1, u'吁': 1, u'合': 1, u'回': 1, u'夥': 2, u'姜': 1,
    u'家': 1, u'峃': 1, u'嶨': 2, u'庼': 1, u'廎': 2, u'懞': 2, u'才': 1,
    u'折': 1, u'捲': 2, u'摺': 2, u'旋': 1, u'朱': 1, u'濛': 2, u'灶': 1,
    u'瞭': 3, u'矇': 2, u'硃': 2, u'秋': 1, u'竈': 2, u'籲': 2, u'纔': 2,
    u'蒙': 1, u'蔑': 1, u'蔔': 2, u'薑': 2, u'藉': 3, u'藭': 2, u'衊': 2,
    u'迴': 2, u'霉': 1, u'鞦': 2, u'黴': 2, u'鼕': 2,

    1 = simplified Chinese
    2 = traditional Chinese
    3 = used both in simplified and traditional Chinese

@mike-fabian
Copy link
Owner Author

@igor-kobzev kindly checked the above list of characters, see these comments:

#88 (comment)
#88 (comment)

@mike-fabian
Copy link
Owner Author

I also checked the above characters using https://www.mdbg.net/ .

What I found in https://www.mdbg.net/ agreed 100% with what @igor-kobzev commented.

But according to https://www.mdbg.net/, two characters which are marked as traditional only
seem to be used in simplified as well.

Here are the combined results:

u'䓖': 1, @igor-kobzev, https://www.mdbg.net/: OK
u'了': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'伙': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'借': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'傢': 2, https://www.mdbg.net/: OK
u'冬': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'千': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'卜': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'卷': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'吁': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'合': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'回': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'夥': 2, https://www.mdbg.net/: -> 3
u'姜': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'家': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'峃': 1, @igor-kobzev, https://www.mdbg.net/: OK
u'嶨': 2, https://www.mdbg.net/: OK
u'庼': 1, @igor-kobzev, https://www.mdbg.net/: OK
u'廎': 2, https://www.mdbg.net/: OK
u'懞': 2, https://www.mdbg.net/: OK
u'才': 1, @igor-kobzev, https://www.mdbg.net/: OK: -> 3
u'折': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'捲': 2, https://www.mdbg.net/: OK
u'摺': 2, https://www.mdbg.net/: -> 3
u'旋': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'朱': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'濛': 2, https://www.mdbg.net/: OK
u'灶': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'瞭': 3, https://www.mdbg.net/: OK
u'矇': 2, https://www.mdbg.net/: OK
u'硃': 2, https://www.mdbg.net/: OK
u'秋': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'竈': 2, https://www.mdbg.net/: OK
u'籲': 2, https://www.mdbg.net/: OK
u'纔': 2, https://www.mdbg.net/: OK
u'蒙': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'蔑': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'蔔': 2, https://www.mdbg.net/: OK
u'薑': 2, https://www.mdbg.net/: OK
u'藉': 3, https://www.mdbg.net/: OK
u'藭': 2, https://www.mdbg.net/: OK
u'衊': 2, https://www.mdbg.net/: OK
u'迴': 2, https://www.mdbg.net/: OK
u'霉': 1, @igor-kobzev, https://www.mdbg.net/: -> 3
u'鞦': 2, https://www.mdbg.net/: OK
u'黴': 2, https://www.mdbg.net/: OK
u'鼕': 2, https://www.mdbg.net/: OK

i.e. these two characters:

u'夥': 2, https://www.mdbg.net/: -> 3
u'摺': 2, https://www.mdbg.net/: -> 3

are also used in Simplified Chinese, not only in Traditional Chinese.

@ghost
Copy link

ghost commented Jan 21, 2022

i.e. these two characters:

u'夥': 2, https://www.mdbg.net/: -> 3
u'摺': 2, https://www.mdbg.net/: -> 3

are also used in Simplified Chinese, not only in Traditional Chinese.

I checked with another dictionary, and these two are indeed used in Simplified Chinese.

@mike-fabian
Copy link
Owner Author

@mike-fabian
Copy link
Owner Author

@mike-fabian
Copy link
Owner Author

I reported the problem in Unihan_Variants.txt as follows here:

https://corp.unicode.org/reporting/error.html

Your Contact E-mail Address: 	mfabian@redhat.com
Your Name: 	Mike Fabian
Request Type: 	Error Report
Document or File with Error: 	Unihan_Variants.txt

Errors in kTraditionalVariant/kSimplifiedVariant in Unihan_Variants.txt

栗 U+6817 is used in Traditional Chinese as well.

https://en.wikipedia.org/wiki/Miaoli 苗栗市
https://en.wikipedia.org/wiki/Miaoli_County 苗栗縣

See also: https://github.com/mike-fabian/ibus-table/issues/95

Unihan_Variants.txt “2021-08-06 Unicode 14.0.0 final” also introduced
some errors in kTraditionalVariant/kSimplifiedVariant,
see the discussion in:
 
https://github.com/mike-fabian/ibus-table/issues/96

I fixed the problems like this:

https://github.com/mike-fabian/ibus-table/commit/c80597d1d35bb7bf3302ed0c7ffa3fea37a3c67d
https://github.com/mike-fabian/ibus-table/commit/9ee6edbe23ed1ad39199d916ec740775d69b58f0

I.e. I applied the following changes to Unihan_Variants.txt:

-U+6817	kTraditionalVariant	U+6144
+U+6817	kTraditionalVariant	U+6144 U+6817

-U+4E86	kTraditionalVariant	U+77AD
+U+4E86	kTraditionalVariant	U+4E86 U+77AD

-U+4F19	kTraditionalVariant	U+5925
+U+4F19	kTraditionalVariant	U+4F19 U+5925

-U+501F	kTraditionalVariant	U+85C9
+U+501F	kTraditionalVariant	U+501F U+85C9

-U+51AC	kTraditionalVariant	U+9F15
+U+51AC	kTraditionalVariant	U+51AC U+9F15

-U+5343	kTraditionalVariant	U+97C6
+U+5343	kTraditionalVariant	U+5343 U+97C6

-U+535C	kTraditionalVariant	U+8514
+U+535C	kTraditionalVariant	U+535C U+8514

-U+5377	kTraditionalVariant	U+6372
+U+5377	kTraditionalVariant	U+5377 U+6372

-U+5401	kTraditionalVariant	U+7C72
+U+5401	kTraditionalVariant	U+5401 U+7C72

-U+5408	kTraditionalVariant	U+95A4
+U+5408	kTraditionalVariant	U+5408 U+95A4

-U+56DE	kTraditionalVariant	U+8FF4
+U+56DE	kTraditionalVariant	U+56DE U+8FF4

-U+5925	kSimplifiedVariant	U+4F19
+U+5925	kSimplifiedVariant	U+4F19 U+5925

-U+59DC	kTraditionalVariant	U+8591
+U+59DC	kTraditionalVariant	U+59DC U+8591

-U+5BB6	kTraditionalVariant	U+50A2
+U+5BB6	kTraditionalVariant	U+50A2 U+5BB6

-U+624D	kTraditionalVariant	U+7E94
+U+624D	kTraditionalVariant	U+624D U+7E94

-U+6298	kTraditionalVariant	U+647A
+U+6298	kTraditionalVariant	U+6298 U+647A

-U+647A	kSimplifiedVariant	U+6298
+U+647A	kSimplifiedVariant	U+6298 U+647A

-U+65CB	kTraditionalVariant	U+93C7
+U+65CB	kTraditionalVariant	U+65CB U+93C7

-U+6731	kTraditionalVariant	U+7843
+U+6731	kTraditionalVariant	U+6731 U+7843

-U+7076	kTraditionalVariant	U+7AC8
+U+7076	kTraditionalVariant	U+7076 U+7AC8

-U+79CB	kTraditionalVariant	U+97A6
+U+79CB	kTraditionalVariant	U+79CB U+97A6

-U+8499	kTraditionalVariant	U+61DE U+6FDB U+77C7
+U+8499	kTraditionalVariant	U+61DE U+6FDB U+77C7 U+8499

-U+8511	kTraditionalVariant	U+884A
+U+8511	kTraditionalVariant	U+8511 U+884A

-U+9709	kTraditionalVariant	U+9EF4
+U+9709	kTraditionalVariant	U+9709 U+9EF4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant