Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tone color incorrect in numerals (digits) after converting from CEDict u8 format to the Stardict format dictionary #328

Closed
camerooncameroon opened this issue Aug 20, 2021 · 14 comments
Labels

Comments

@camerooncameroon
Copy link

Hello,
Found a little error when converting to stardict from the Cedict u8 source:
tone colors are wrong when the headword contains numerals (digits).
For example, if we have '21' in the headword, it's pronounced as 'ershiyi' (3 syllables), while '21' is only two characters.

Pyglossary takes tones (fourth, second, first) from the pinyin pronunciation and extrapolates it to the '21', so the '2' is fourth tone (and hence its color), '1' is second tone, AND... the leftover first tone goes to the next character.

Then, this character gives its tone to its further neighbor and so on. This way the all tones could be misplaced (shift to the right by one position).

Possible solution: not to assign tone colors to the numerals shown as digits, because their pronunciation is unpredictable (in the example above it could be also spelled as 'eryi', two syllables).
Cheers,
Alex

@camerooncameroon
Copy link
Author

To see how it works take a look at the Cedict dictionary entry 21三体综合症

@ilius
Copy link
Owner

ilius commented Aug 26, 2021

Hi
This seems to be outside the scope of PyGlossary.
Because we take the tones from CC-CEDICT source file, then just visualize it as html.
For example:

2019冠狀病毒病 2019冠状病毒病 [er4 ling2 yi1 jiu3 guan1 zhuang4 bing4 du2 bing4] /COVID-19, the coronavirus disease identified in 2019/

21三體綜合症 21三体综合症 [er4 shi2 yi1 san1 ti3 zong1 he2 zheng4] /trisomy/Down's syndrome/

If tones are wrong, you should either edit it on https://cc-cedict.org/editor/editor.php
Or report it to the website.

@ilius ilius closed this as completed Aug 26, 2021
@camerooncameroon
Copy link
Author

camerooncameroon commented Aug 26, 2021

Hi!
Nope, the tones are correct in the source cedict file.
The thing is, '21' is pronounced er shi yi (two ten one) in Chinese, that makes three syllables, i.e. three colors to apply to the hieroglyphs, or, in this case, numerals. But there are only two of them! So the third excess tone color is applied to the next closest character, and so on, making the whole colored phrase incorrect.
I only found 1 such case and corrected it for myself. But the issue is connected to the Pyglossary, not Cedict source.
Maybe I fail to explain the situation as it should be explained. Simply put, I suggest not to colorize numerals (digits 0-9) in the code of Pyglossary, because of possible ambiguities (for instance, '21' can be spelled both as 'er shi yi' and 'er yi' in some cases, which makes it a pain to distinguish where each of the patterns shall be used)

@ilius
Copy link
Owner

ilius commented Aug 27, 2021

I get it now.
Thanks.

@ilius
Copy link
Owner

ilius commented Aug 27, 2021

Please try again.

@camerooncameroon
Copy link
Author

camerooncameroon commented Aug 27, 2021

Thanks, it worked great.
New to me that some Chinese single characters are two-syllable. So not only digits may have caused the tone ambiguities.
The only doubted item remains 粨 which Cedict renders as two-syllable 'bai3mi3', while some other dicts insist it's single syllable, 'bai3'.
But that's already purely Cedict's issue (or point of view, rather), not Pyglossary's.
Great thanks again!

@ilius ilius closed this as completed Aug 27, 2021
@camerooncameroon
Copy link
Author

If that doesn't cause deep changes to the code: can these, now not colored, 'problem' items nevertheless be colored but with black color? So that a user can later manually edit these two items (美国51区 and 21三体综合症) adding correct colors. If color is hex, then stardict format won't be ruined, as the number of symbols remains unchanged.

Of course that's quite a minor thing already now ))

@camerooncameroon
Copy link
Author

camerooncameroon commented Aug 27, 2021

Nevermind, I think it would be easier to use old conv.py to convert (to keep placeholders for color tags), and new conv.py to check which words are redundant in tone, and then manually fix these exceptions, colorizing it in correct colors or just black where no color is applicable.
Slightly above ten cases in total, a snap to fix in a minute - or five ))

@ilius
Copy link
Owner

ilius commented Aug 27, 2021

In order to fix the color with the new code, you just need to know what each Chinese character sounds like.
Because pinyin is always colored.
Am I right?

@camerooncameroon
Copy link
Author

camerooncameroon commented Aug 27, 2021

Exactly. Or to know it by heart (it's easy to remember tones in just 2 phrases).
Although, with the new code, the hanzi (which tones are in question) aren't colorized. If we colorize them manually (via editing the .dict file), then it won't function properly (once created, no changing of the number of symbols is possible by directly editing the .dict file, perhaps special editors can handle this, though I am unaware of such).
With the old code, we had colorization tags in the 'problem' hanzi (with the issues now successfully ousted). And if it's in hex-format (I personally set my own colors in hex in the conv.py), then you can change the color to whatever you want, without risking to tamper the .dict file (the length in chars stays the same, hex code is always 7 chars).
To conclude, if you find it useful, it might be a good idea to still colorize the entries where pinyin syllables count doesn't match the number of hanzi (new code doesn't colorize them now) - but colorize such entries with black (#000000). Which allows to manually edit them then to the rightest extent

@ilius
Copy link
Owner

ilius commented Aug 27, 2021

Please try again.
I added <font color=""> so that it works with dark theme as well.

@camerooncameroon
Copy link
Author

Then we'll need 7 more symbols inside this as a placeholder. Because if we add symbols, each entry will have greater number of chars than it had initially and thus stardict format will be ruined.
I tried underscores, spaces, periods and # signs - they all are treated as black by GoldenDict and thus look poor on dark themes (( Light themes are ok. It's always possible to manually substitute, say #...... with #000000, but that's an ugly way of doing things, you can't do it every time you just want to switch theme...
As a remedy, perhaps it's possible not to colorize digits at all.
And as for two-syllable single-hanzi (which don't actually have any single tone, there are two of them in one hanzi), some neutral color might be used, well seen both on black and white background - like grey.
But it's reserved for 5th tone already almost everywhere there tones are colorized. Users might get confused. Maybe medium-blue? But in your color scheme it's also reserved (I don't use blue for tones, because Pleco uses blue color for some hanzi on system level not related to their tones, so I substituted blue with purple, which is rather informative, and still use blue for any uses where neutral, not tone related color is required). So you may think about purple for the rare two-syllable single-hanzi entries. Users like me can substitute it later manually according to their individual color schemes

@ilius
Copy link
Owner

ilius commented Aug 28, 2021

StarDict format is not meant for editing.
You should convert MDX to an editable format, like Tabfile (.txt) or CSV, then edit and convert it to StarDict.

@camerooncameroon
Copy link
Author

camerooncameroon commented Aug 28, 2021

Yes, I know it. Still, while editing doesn't affect the size of a dictionary's entry in chars, it's possible even in direct mode with no side effects. I edit colors (mass substitute) in hex-format. for example, or convert Trad<>Simp and vice versa, etc. All works fine.

Thanks to you, now I have full solution to this issue, with some minimal handmade adjustment still needed (moreover there's no need to update Cedict daily). Not sure it's important for other users (only 2 incorrect entries and ~10 exotic hanzi abridgments aren't really an issue) to spend efforts making a general workaround which is uncertain taking into account so many language and software tricks and limitations.

Current version of Pyglossary produces result without previous incorrect colorings - that's ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants