[Solved] message was cropped and render un-usable, sometimes, for chinese fonts. #6498
Replies: 1 comment
-
Posted at 2023-10-28 by ccchan the 1st screenshot, the "9" in the long line will be the 6th digit of the code, rendering it quite un-useful.... thanks original msg: 【知乎】你的验证码是 698449,此验证码用于登录知乎或重置密码。10 分钟内有效。 current displayed as: ps, in gadgetbridge.js, i have to tick "use bitmap if font not available", setup info: Posted at 2023-10-28 by ccchan even clicked into the msg, it doesn't help. and i tried all the several message apps in the store, (this messagedebug.log is a working one for the #1 message) Attachments: Posted at 2023-10-30 by @gfwilliams Thanks for those logs, that's great - I'm not entirely sure what's up with the larger font sizes not working, but the issue with wrapping is that when deciding what to convert to an image, Gadgetbridge uses the following logic:
The issue here is that Gadgetbridge thinks it is just one giant word. I'm afraid I don't know enough about Chinese, but is there some 'separator' character that we should be looking out for and just aren't? Or maybe Gadgetbridge will just have to split any bitmap of a 'word' that's greater than 100px wide into multiple bitmaps... Posted at 2023-10-30 by ccchan hi,
because in this msg, if GB.js understand those 2 symbols, Hope this could happen, may be at least only do to the chinese msg? i.e. spare other languages? thanks Or, may be, if there is a sms in chinese language, GB.js could simply fill it to a line, if the line is full, GB.js can simply make a new line? because in chinese, each font/character is indeep a word on itself already. say in the above line 此验证码用于登录知乎或重置密码. is same as even you cut it like this, everyone still understand this: so you could cut however you like, just keep the sequence and dont add space/puntuations. thx Posted at 2023-10-30 by ccchan this is overkill but just for reference: Posted at 2023-10-31 by houshou_m The typical Mandarin written word consists of either one or two graphs, with two being more common. For reference, the notification text can be broken down into the following "dictionary words": 你 的 验证码 是 698449,此 验证码 用 于 登录 知乎 或 重置 密码。10 分钟 内 有效。 The simplest solution might be to simply preserve groupings of two. Unlike the Latin script, Chinese characters are all meant to be written such that they take up the same amount of space (like monospace fonts do), so perhaps that wouldn't be too difficult a task? Posted at 2023-10-31 by ccchan hi, but without the AI, for such a string of 20 chinese character/word, GB.js wont know where to break it up. i am sure it will need a dictionary to only GUESS the breakpoint, and i am sure it wont perform as good as a human brain. so, for this msg from zhihu.com I just wish it to fill the screen space, and when that line is full, make a new line. 【知乎】你的验证码是 698 ANYONE who knows chinese could interpret as long as the sequence is unchanged, nothing like space nor punctuations are added. 6 分 Any breaking it up into phrase of 2-3 characters introduced space which could imply an opposite meaning so I completely disagree on that. thanks Posted at 2023-10-31 by ccchan pls dont add space nor punctuations, it could ruin the message, thx Posted at 2023-11-02 by @gfwilliams Thanks - well, that's promising. I'd propose:
I could split every two characters which would be a bit more efficient, but it's just added complexity and it sounds like in some cases it might mess things up? Posted at 2023-11-02 by ccchan
hi, please DO NOT add spaces or split whatever if you are unsure! This is a joke from a famous movie in chinese. original, NO spaces: Good guy read this as: Bad guy read this as: which is a complete opposite meaning! thank you. Posted at 2023-11-03 by @gfwilliams
That's a bit of a problem then. What do you suggest we do instead? Is there some simple rule, like 'if char code is within this range we can put a newline after if needed' Posted at 2023-11-03 by ccchan i mean add a new line/wrap to the screen is usually ok. may be could you please try your method(s) on the above 2 msg and let me see what will happen?
if msg 1 now become: if msg2 become: (the following is suppose bangle shows 18 chinese characater on 1 line) thanks Posted at 2023-11-04 by houshou_m Hello, Gordon, As with all languages, it's certainly the case that if you unnaturally break things up, there's the chance of introducing ambiguity. However, it's also the case that context will resolve the matter most of the time. And in the worst case scenario, it's not as if the watch is the only means by which we can read the messages. We can think about the well known joke in English about the importance of commas for a comparable example: "Let's eat grandpa!" Unfortunately, although I am familiar with the linguistic and orthographic side of this problem, I am not familiar with the coding side. But a method that could theoretically be used to handle run on English sentences such as "Thequickbrownfoxjumpsoverthelazydog" would be our best starting point. One solution I can imagine would be to cut the line once it approaches the side margin of the screen at a point where the preceding characters form a block divisible by two and then starting a new line; the software could go on like this until it both reaches the bottom margin and cannot reasonably shrink the bitmaps down any further, at which point it must necessarily elide any remaining text. So applying this proposal for how to cut up lines to article 1 of the Universal Declaration of Human Rights, we would get the following if we assume a character limit of 11 per line (characters bolded to show where words get split up): 人人生而自由﹐在尊嚴 (10 characters) As you can see from the above, only two words get split up (thanks in part to punctuation in Mandarin being full-width, hence taking up the same space as a "normal" character). These splits do not result in any ambiguity in this example either. Posted at 2023-11-04 by ccchan
what do you guys mean by every two characters? a chinese word = a chinese character. in the past, , it take up double the byte of that of an ASCII character. for the number of bytes you store "AB", thanks Posted at 2023-11-04 by houshou_m While in Classical Chinese it was certainly the case that one character most often wrote one word, in modern Mandarin, it is more often the case that a word consists of two syllables, represented by two characters. This change can actually start to be seen as early as Xunzi, who uses a noticeably larger amount of disyllabic words than Mencius. Let's look again at the example sentence from the first article from the Universal Declaration of Human Rights, this time adding spaces between words*: 人人 生 而 自由﹐在 尊嚴 和 權利 上 一律 平等。他們 賦有 理性 和 良心﹐並 應 以 兄弟 關係 的 精神 互相 對待。 Sum of monosyllabic words (one graph, one word): 10 *Note that for the purposes of this conversation, a word should be understood as something one would encounter in daily life and also something that one would find in a dictionary of modern Mandarin. Posted at 2023-11-04 by ccchan i am not sure you natively speak chinese or not. you will be correct that some are "two words" , a short phase, like 人人, 自由 , 尊嚴, 權利, 一律, 平等。他們 , 理性 , 良心﹐, 兄弟 ,關係 , 精神, 互相, 對待。 but some are "single words" like 在, 和 , 和 ,並 , 的. for a bangle.js v2, with that 512KB ram, there wont be a dictionary, there wont be an AI, i am pretty sure that is NO WAY for the bangle to understand the chinese. you may think this looks better, by adding empty spaces for easier reading: but i am 100.000000% sure that, the lines looks 100.000000% same FUNCTIONALLY without the spaces. and the bangle watch dont have the luxury to add them. just A. wrap the msg untouched into new lines will be ok. pls only do C. to introduce empty spaces only if you got AI, you got a dictionary to check for. ps: the author of the message will already consider whether the message is understandable, so you really DONT need to introduce empty spaces. this is a simple logic. 人人生而自由﹐在尊嚴和權利上一律平等。他們賦有理性和良心﹐並應以兄弟關係 的精神互相對待。 as seen in #1, the 2 messages i received dont have the empty spaces. Posted at 2023-11-04 by ccchan to be simple: Same as a string of number 8374018374018374018237401837083, thanks Posted at 2023-11-06 by @gfwilliams @user156881 thanks for the info! From what you can say, it feels to me like wrapping (if needed!) every 2 chars is probably best. Because the alternative to not wrapping every two chars is that the text goes off the screen, in which case it is definitely unreadable. But for now, I have just added conversion of Posted at 2023-11-06 by ccchan hi, do you mind clarify what do you mean by that? because currently i can only think of as i told before that for that long chinese sentence, some phrase consist of and honestly practically non-human could interpret another set of phrases, which is expected as even english/other languages sometimes got some ambigious issue due to historical reasons. but if you mean by 1 "block"=2 char, and wrap it, 6 分 thanks Posted at 2023-11-06 by ccchan
so in this example, i 'll say in the author 's orginal line, he/she have already considered that others will understand his/her line without spaces 【知乎】你的验证码是 698449,此验证码用于登录知乎或重置密码。10 分钟内有效。 i just afraid if you rely on the android phone or bangle to choose the length of the phrases, it may hurt, because sometimes the phrases in the dictionary are overlapped, pre-GPT4 AI dont have the ability to do so. And even GPT4 is not good in chinese too. (but chinese made their own AI which process chinese better). Posted at 2023-11-07 by houshou_m @gfwilliams I'm happy to contribute to the development of this device. As for your suggested change, I think that's a fantastic start to tackling this matter. It really might be enough! I will test this out for a few days before reporting back to you. I will also see how it fairs with Japanese. Additionally, I want to clarify my suggestion in the space below, since it seems that I was not able to communicate what I meant well. I apologize for that. I was writing late at night and wasn't as careful about my words as I should have been. I am assuming that there is a character limit to each line of text which the device observes with the default messaging app. If that is the case, then should the punctuation fix not be enough, it would be ideal to be able to wrap the lines in such a way as to preserve a number of characters which is divisible by two. So if there is a limit of, say, 13 characters per line, we wrap at the 12th. Would that be possible? To give an example of what I mean using ccchan's original message, let's say hypothetically that the Bangle.js does have a 13 character limit with the default messaging app. The message should be split up like so: 【知乎】你的验证码是 6 (12) Each line of text is wrapped perfectly in this example. No Chinese is split up in a way that splits a two syllable word across two lines. I had the second to last line be 13 characters on the assumption that Gadgetbridge would be able to handle whitespace normally, even within a message consisting of both full- and half-width characters. @ccchan I hope the above explanation clarifies what I mean. We are only concerned with the number of characters per line. There is no need for any kind of advanced programming to deal with splitting lines of Chinese text this way. Gadgetbridge only needs to be able to count characters and be aware of character limits. Again, I apologize for not being clear. Posted at 2023-11-07 by ccchan i still could not understand what you guys mean by "wrapping (if needed!) every 2 chars is probably best." because your example looks like it's wrapping by even numbers, not two. anyway may be you guys could proceed ahead, just please try leave me an option to choose the versions so i can still use a working one while it's improving, thanks Posted at 2023-11-07 by @gfwilliams @houshou_m great! Yes, I think we're on the same page there. I'm not expecting to wrap every 2 characters, but merely to as you say: "wrap the lines in such a way as to preserve a number of characters which is divisible by two" Also, I just wanted to check - are Posted at 2023-11-08 by houshou_m Yes, for all intents and purposes, those characters are equivalent. Chinese also has the optional I did some tests with the nightly build of Bangle.js Gadgetbridge (commit 1aadc04fd) today, and unfortunately nothing seems to have changed. Chinese text was still running off the screen, and no line wrapping was being done. Also, I don't know if I was just experiencing some exceptionally well-timed bad luck, but I was not always being notified on my watch when a message containing Chinese was sent to me --- this despite the fact that I never encounter this issue with messages using ASCII characters in the body. The messages would simply be completely absent from my watch, even when there was a lull of 10 seconds between the messages. I do not have Gadgetbridge set to limit messages if multiple messages come in too quickly, and I of course had the screen of my phone off while performing these tests, which I did using two different instant messaging platforms. I also tested to see how Japanese would be handled. Although I received the Japanese messages on my watch, the device strangely would not buzz to alert me to them. Furthermore, the messages were all blank. Perhaps these are two related issues? These issues occurred with messages that had both English and Japanese, as well as only Japanese. I took pictures to show the issues I have described and have attached them to this message. I hope this helps. The Chinese message sent in full is supposed to be: 孟子見梁惠王,王立於沼上,顧鴻鴈麋鹿,曰:「賢者亦樂此乎?」 Attachments: Posted at 2023-11-08 by ccchan
hi, do you need pay money to send this test message? Posted at 2023-11-08 by ccchan
yeah, quite many times my bangle turn into this and i dont know what to do. i guess some messages with some error make it this way. thanks Attachments: Posted at 2023-11-08 by @gfwilliams You could try looking at the Gadgetbridge log to see if any errors were reported by Bangle.js when the screen got displayed like that? If it's reproducible you could ensure that in the Posted at 2023-11-08 by ccchan okay, will do that next time, Posted at 2023-11-09 by houshou_m Gordon, I sent you a direct message with my watch's log and a description of the testing I did. Please let me know if there is anything else I can do. @ccchan You can get your friends to send you quotes from Mencius if you just ask. I'm sure they'll do it for free. Mine did :) Posted at 2023-11-12 by ccchan ps, as i installed both the default "message UI" and the "message list", today i found some chinese SMS was wrapped correctly in the "message list". yet, the empty blank msgs problem still occur with it. so i'll try record down the record logs for it/them later. Posted at 2023-11-13 by @gfwilliams Thanks for the update. I believe I have now fixed this - there were a bunch of issues:
So you need to update Gadgetbridge to nightlies, as well as the Bangle.js firmware, but hopefully it's sorted now! Posted at 2023-11-14 by ccchan @houshou_m do you have experience to use the GB nightly and ... update of bangle.js firmware? It will take me sometime to figure it out, thanks. or in a few days time i could try my self. ===========extra, msg list also not working======= anyway i'll try screenshot it out and copy here first. for reference, with software version numbers. summary: both still not wrapping correctly. better try out the modifications by gordon later, thanks the 1st attachment is "msg UI", the 2nd one is "msg list" Attachments: Posted at 2023-11-14 by @thyttan You'll find a link to Gadgetbridge 'nightly' builds on this page: https://www.espruino.com/Gadgetbridge#how-to-set-up Here's the bangle.js firmware updater on the app loader (there will be one file that's the 'cutting edge' one): Posted at 2023-11-15 by ccchan hi,
ps: so, previously i am using SMS, currently i am using whatsapp to test the text. Attachments: Posted at 2023-11-15 by ccchan ok, thanks for the instructions. Posted at 2023-11-15 by ccchan when i compare the screen shots, you can view the screenshots in the thread, thanks Posted at 2023-11-16 by @gfwilliams
Ok, so I think you're on the latest Gadgetbridge, that's good, and should mean that you no longer have messages that won't show on the Bangle. Now you need to update your Bangle.js firmware to a cutting edge build. As an aside, I've been looking into storing the full Chinese fonts on the Bangle using the PBF font format that's been supported for a few versions now. However it seems the font format itself, while designed for Unicode, wasn't designed for Chinese unicode usage. It can only store around 10,000 characters maximum inside it before overflowing the indices in the file format. https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode would seem to imply that there are 90,000 characters in full, with around 20,000 that we have to support (even though that also supports Japanese and Korean in the same set of chars). Do you believe that really is the case? If it's possible to get down under 10,000 chars then I can use the current PBF file format with current firmware, otherwise I'll have to make some custom changes to the file format and update the Bangle.js firmware to get it working Posted at 2023-11-16 by ccchan isn't i am already on the "edge" version of 2v19? Posted at 2023-11-16 by ccchan hi, HOWEVER there are 1. simplified chinese characters, invented by china-mainland after communist took place in 1949 (now used by the 1.4 billion china-mainland users), and 2. those tradional chinese characters, used in china-taiwan, china-hongkong, china-macau (21 million+7 million+1 million users.). yeah i knew why it say 20,000 then. it would be a honor to have chinese fonts stored onto the bangle, but will it use up a lot of resources? also, althought chinese is 1 of 6 languages used in United Nations, what if other languages users also want to have their fonts stored on the watch? thanks =========================User ChatGPT In general, a literate person in a Chinese-speaking environment is usually familiar with at least 3,000 to 4,000 characters, which are enough for everyday reading and writing. However, for more advanced literacy, such as reading newspapers, literature, or academic texts, knowledge of 6,000 to 8,000 characters is often necessary. The education system in Hong Kong, like in many Chinese-speaking regions, emphasizes the learning of a large number of characters from an early age. However, the number of characters used in daily life can be much lower, as many characters are used infrequently. For daily practical purposes, such as texting, social media, casual reading, and signage, a smaller subset of these characters is often sufficient. It's also worth noting that Hong Kong uses traditional Chinese characters, which are more complex than the simplified characters used in Mainland China. To sum up, while a literate person in Hong Kong might know several thousand Chinese characters, the number actively used in daily life is likely lower, focusing on a core set of commonly used characters. Posted at 2023-11-16 by ccchan https://eriksen.com/language/simplified-vs-traditional-chinese/ it's quite complex.... Posted at 2023-11-16 by @gfwilliams
No - that's just the last release. Go to the firmware update app and use 'cutting edge' - it should be something like 2v19.60 or later
Well it's heartening to see that maybe 8000 chars would do it - now can you find out what their character codes are? :)
Adding all 20,000 chars uses about 850kB of flash. It takes a while to upload, but the Bangle has 8MB and usually uses maybe only 2% of that - so there is plenty to go around.
Well, most of the UN could be handled by under 1000 chars which can easily be part of the main font. It's nothing compared to 20,000. But my plan was to have separate font apps for different Locales, but it would seem that splitting the characters that way isn't actually very straightforward. If someone had a list of characters for each locale it'd make it really easy. Posted at 2023-11-16 by ccchan hi,
gpt4 result: in unicode, there are just about 10000 daily use traditional chinese characters and some 10000 daily used simplified chinese characters. do you know their codes? the code ranges? ChatGPT: In Unicode, Traditional and Simplified Chinese characters are not separated into distinct blocks based solely on their writing system. Instead, they are mostly included together in several blocks, primarily in the "CJK Unified Ideographs" block. The range of these characters in Unicode is quite extensive, encompassing multiple blocks. Here's an overview: CJK Unified Ideographs: This is the largest block of Chinese characters. It includes both Traditional and Simplified forms. The range is U+4E00 to U+9FFF. This block contains the majority of characters used in daily Chinese text. CJK Unified Ideographs Extension A: This block extends the basic set and includes more obscure and less frequently used characters. The range is U+3400 to U+4DBF. CJK Unified Ideographs Extension B to Extension G: These blocks contain even more rarely used, historic, or dialect-specific characters. The ranges are: Extension B: U+20000 to U+2A6DF CJK Compatibility Ideographs: This block includes characters that are variants of those in the main CJK block and are used for compatibility purposes with other character sets. The range is U+F900 to U+FAFF. CJK Compatibility Ideographs Supplement: This is an extension of the compatibility ideographs. The range is U+2F800 to U+2FA1F. These ranges cover both Traditional and Simplified characters, and the distinction between the two is often based on context and regional usage rather than Unicode encoding. Unicode primarily focuses on the unification of characters that have the same etymological root, regardless of regional script variations. You: for the above ranges, can you also tell how many characters are within them? thanks ChatGPT: Here is the number of characters in each Unicode block relevant to Chinese characters: CJK Unified Ideographs: 20,992 characters These counts reflect the number of individual characters encoded in each respective block. Posted at 2023-11-16 by ccchan let me try upload the 2 screenshot for the SAME msg later. thanks Posted at 2023-11-16 by ccchan ps, so gpt4 say trad. chinese and simpl. chinese have union, and each have unique parts. and these 3 parts are all within the 1st block of 20000 characters. gpt4 record: You: i dont understand, for the english "old", tell me the code for the chinese word 舊and 旧 then. ChatGPT: The Chinese characters "舊" (Traditional) and "旧" (Simplified), both meaning "old" in English, have different Unicode code points. Here are their respective codes: 舊 (Traditional Chinese): The Unicode code for this character is U+820A. 旧 (Simplified Chinese): The Unicode code for this character is U+65E7. These codes are hexadecimal representations of their respective positions in the Unicode character set. You can use these codes in various digital systems to represent these specific characters. You: so, the 1st block of 20000 you mentioned will include both the traditional and simplified daily use chinese characters? ChatGPT: Yes, the first block, "CJK Unified Ideographs" (U+4E00 to U+9FFF), includes both Traditional and Simplified Chinese characters commonly used in daily life. This block is designed to encompass a wide range of frequently used Chinese characters, irrespective of whether they are in Traditional or Simplified form. The Unicode system aims to unify characters with the same etymological roots, so many characters that have both Traditional and Simplified forms are encoded only once in this block. The distinction between Traditional and Simplified in Unicode is often a matter of font rendering and context rather than separate encoding. For specific characters that have different forms in Traditional and Simplified Chinese, like "舊" (U+820A) and "旧" (U+65E7), they are encoded separately, as they have distinct graphical representations. Such characters may appear in the same block (as is the case with these examples) or in different blocks, depending on their Unicode allocation. You: will some trad. or simpl. chinese have the same coding then? ChatGPT: Yes, many Traditional and Simplified Chinese characters share the same Unicode code point. This is because Unicode is designed to unify characters that are essentially the same, regardless of regional script variations. The distinction between Traditional and Simplified Chinese is often a matter of how the character is rendered (i.e., displayed or printed), rather than how it is encoded in Unicode. For example, a character that has the same form in both Traditional and Simplified Chinese scripts will have only one Unicode code point. It's the font and language settings of the user's system that determine how the character is displayed. In cases where the Traditional and Simplified forms of a character are graphically different, they are assigned separate Unicode code points, as seen with "舊" (U+820A) and "旧" (U+65E7). Therefore, while there are distinct Traditional and Simplified forms for some characters, many others are encoded just once in Unicode, covering both scripts. This approach reduces redundancy and simplifies text processing across different languages that use Chinese characters. You: show me one that is same in trad. and simpl chinese. ChatGPT: A common example of a character that is the same in both Traditional and Simplified Chinese is the character for "water," which is written as 水 in both scripts. In Unicode, this character has the code point U+6C34. This character exemplifies many instances where the Traditional and Simplified forms of a Chinese character are identical, and therefore, they share the same Unicode encoding. The character 水 (U+6C34) is used in a wide range of contexts and is a fundamental character in both Traditional and Simplified Chinese writing systems. Posted at 2023-11-17 by @gfwilliams So I guess that means 'no', you can't find a list of common characters? If we were to go for it, perhaps the only other option is to get a big corpus of representative Chinese text, and to check what character codes are in it. ... but I did fix the PBF file format now, so as long as you're on the latest cutting edge firmware it is possible to load a font file with all 20,000 glyphs onto the Bangle. Posted at 2023-11-17 by ccchan
incorrect wrapping is the major problem as sometimes the important info is out of the screen, thanks Posted at 2023-11-17 by ccchan hi
the msg appear itself as "it-show-up" , if i manually click into the msg app, and click view msg, it appear differently as "press-into". original msg: 【知乎】你的验证码是 488379,此验证码用于登录知乎或重置密码。10 分钟内有效。 i will suggest just add a new line after a punctuation. Attachments: Posted at 2023-11-17 by @gfwilliams
It has a lot to do with the wrapping - what version does it say on the about page of the Bangle?
Yes, the font size is different? Otherwise it appears to wrap at the same points? But I'm not sure what's going on with your firmware because I just tried the exact text you pasted on a fresh install of latest Gadgetbridge and Bangle.js firmware and I see this - which seems a lot better? Attachments: Posted at 2023-11-17 by @thyttan Looking at the commits on Gadgetbridge master branch you seem to have a recent enough Gadgetbridge nightly (https://codeberg.org/Freeyourgadget/Gadgetbridge/commits/branch/master). But judging by that Bangle.js screenshot of the about app the firmware is not updated to cutting edge version there. So update that via the app loader on https://banglejs.com/apps/?q=fwupdate (I guess you already know this step. But something seems to have made it so it didn't update for you). Posted at 2023-11-18 by houshou_m There's a new cutting edge update --- 2v19.67. This update might resolve the issue of Chinese text displaying improperly within the messaging app itself, which is where you end up if you tap on the message when it is displayed. This is a separate issue to the matter of text not displaying properly when first shown on the watch. I have been working with 2v19.60 and can confirm that Chinese text now displays properly outside of the app --- i.e. when one views the message without interacting with it. @gfwilliams I will get back to you regarding a list of common Chinese characters. I am in fact aware of one that isn't sourced from an unthinking algorithm, but I would like to see if there are alternatives. Posted at 2023-11-18 by ccchan
hi, if you ppl can kindly teach me how to identify the latest firmware, i may be able to try myself. thanks btw, it's simply click the cutting edge rom link in GB.js, and then click the "upload" button, correct? i did saw the v67 as the cutting edge rom. or i missed some steps? Posted at 2023-11-18 by ccchan i tried again in GB.js, clicking the .67 ROM and upload. Posted at 2023-11-18 by ccchan ok, it scared me, i worried i did something wrong when it say flashing... Posted at 2023-11-18 by ccchan (i cant test at the moment, the msg didn't pop up in the bangle) Posted at 2023-11-18 by ccchan worked somehow, thanks. ps: i need to re-install the msg lib and msg ui to make the msg to pop up. the whole msg is within the screen now so no info is lost. thankyou. btw, i'll test /w "msg list" later. Posted at 2023-11-18 by houshou_m This site has a download to a text file containing every character in the Big5 encoding scheme for Traditional Chinese characters, sorted by frequency of use. Although it's from the 90s, the corpus it pulls from is huge, having over a million tokens. It was compiled by an academic source. For Simplified Chinese, one can consult the PRC's Table of General Standard Chinese Characters (Tongyong Fangui Hanzi Biao 通用规范汉字表). It consists of over 8000 characters and is divided into three tiers from most to least frequent, per the government's analysis. You can download a text file of it here at Wikisource. Once downloaded, you will see that each tier is enclosed in curly brackets. Per the document, tiers one and two consist of 3500 & 3000 characters respectively and meet the needs of the sectors for education and publication. Tier three consists of an additional 1605 characters and includes characters you'll see in names, technical jargon, and idioms from classical literature (i.e. archaic words). The characters within each tier are not sorted by frequency (the first person pronoun wo 我 appears in spot 761, for example), so this source is more providing batches of characters one is likely to encounter. Between the two of them, the Traditional Chinese source has the better frequency sorting, but in the grand scheme of things, both should fulfill the need of knowing which characters are essential. Now, there's no definite answer to the question of how many characters you should take from each. Due to the poor organization of the PRC's character list, I wouldn't take less than 3500 so that the essentials can be covered. In the Traditional character list, that would cover characters up to the 99.68 percentile in usage. EDIT: I have talked it over with a colleague, and I think using 2500-3000 of the most common characters would sufficient if we're hurting for space. This site sorts Simplified characters by frequency, but they don't offer a text file of it. I don't really know what you'd need to make the locales, so do let me know. Depending on how you go about things, you don't need a separate list for Traditional characters, for example. EDIT 2 This would make an excellent reference for characters to have. Use unicode, of course, though. Posted at 2023-11-18 by ccchan so for the latest GB.js + latest firmware, Posted at 2023-11-20 by houshou_m I'm pleased to say that I can also report that the issue seems to be resolved now. No problems with either the initial notification or opening the message in the messaging app. Posted at 2023-11-20 by @gfwilliams That's great news! Nice to get this properly sorted. @houshou_m thanks for all your research on this! Actually even 8000 characters from the first 3 bits of the PRC list would probably fit just fine into the font file. I did manage to fix the issues with the font file format so we can have all chars, but it needs the cutting edge firmware (or 2v20 when released) so isn't ideal - it also takes ages to upload, so having a cut down file would be great! I just need to have a think about how best to package it up so apps can use it (I'd rather not have to have a Chinese fork of the message app to get it working!) Posted at 2023-11-21 by ccchan currently it works well, used for several days and all msg now read meaningful, Posted at 2023-11-22 by houshou_m I think it would be a good idea to edit the title of this thread to include "[Solved]" in it without the quotes so that others can see this issue has been taken care of. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2023-10-28 by ccchan
hi,
i am using chinese.
assume i tolerate to use font12.
but when the message is too long, only the middle part is shown.
rendering it quite useless.
possible to provide an option to wrap the msg?
the below include a working short google message,
the longer not-useful message, the msg itself, and the content of the log file.
i think i should have removed my personal info from them.
thanks
ps the log file is at bottom, as the one previously with (storagefile) is with problem, thx
in the message app, setting is "font min: small"
Attachments:
Beta Was this translation helpful? Give feedback.
All reactions