New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified 8-dot worldwide unified mapping for the hexadecimal value of Unicode characters #688
Comments
I'm pretty sure @dkager has an opinion :) |
I'm not a braille reader, so I cannot judge. I understand the need to make display of emojis short. However your system sounds pretty complicated. How does this solution fare with respect to:
|
@egli: It doesn't change the behaviour of all current existing braille tables – as long as the four mentioned prefix braille characters aren't already used. Furthermore this reduction will not only have effects on emojis. And there is a huge benefit regarding new upcoming Unicode characters: They haven't to be defined separately, because they already can be read with my above described system. Okay, that's not really new and depending on the selected braille table they already can be read right now. My system only saves space on a braille display, not more. But this is needed; the less cells I have to read on a braille display, the faster I can recognize the character behind it. And please don't compare my above described solution with "classic" braille tables, because its functionality is completely different – and easy to understand. You only need to know the Unicode hexadecimal code for those characters you really need. And if you don't know the character behind such a shrunken braille combination, it is really easy to convert the braille dots back into the hexadecimal values. You only need to know in which order they are written in braille. That's all. |
I recently updated the issue title for clarification. My system doesn't replace any characters which are already defined in a braille table. |
The question that comes up for me is how one would know that e.g. U+2840 refers to. In other words, your code shortens the text displayed in braille, but leaves out a mapping from a number to an emoji or other Unicode character. |
@dkager: The relation between the hexadecimal value for a Unicode character and its name can be found by a quick search in the web. By the way this is easier to find out the naming of a character behind the dots in compare with classic 8-dot tables like in "de-de-comp8.ctb", where in most cases only the hexadecimal values and the dots are noted down. So if you can read the hexadecimal values directly, which is the only way to make these tables (see also #689) usable for all languages the same way, you are saving one step. Well, and if you are offline, you can also take a look at the outdated, not used table "unicodedefs.cti" to find out the name of a Unicode character. And don't forget that my solution has even one more "big benefit" in compare with other tables: There aren't multiple meanings for one combination of three, four or five characters (see also #689). One combination is always related to a single Unicode character. So you haven't to learn quite a lot of new rules for these new tables, which will only be used for undefined Unicode characters. |
I created a test file for 6-dot and for 8-dot for demonstrating how my idea would look like with emoticons (U+1F600 to U+1F64F, UTF-16 encoding). You can include these text files into another table and test it with NVDA. Little hint: It would be very nice if the Liblouis manual would say anything about the file encoding (UTF-8 without BOM) and the type of line ending (LF). The last one was clear for me, but the BOM killed and killed and killed NVDA (no braille output after restarting). I always thought that I made a mistake with the including syntax. But no, after approximately one hour (or maybe just a half of an hour) I found the issue. Well, it doesn't matter now. 😉 [Update 2019-01-28 17:41 CET] The 6-dot table was added. [/Update] |
You might get more feedback if you post this idea on the mailing list. It's an interesting idea. But you have to help me understand the use case. It's clear that if you use this code for all text, you get an increase in number of braille cells needed. So to make sense, you can only use it for uncommon characters. Is this what you have in mind? I'm not a braille reader, but to me it would seem more logical to encode uncommon characters with a method that is maybe longer, but easier to memoize. Don't get me wrong, I see how your method is better than Liblouis' default method for undefined characters. However the improvement is only marginal assuming it is used for uncommon characters only. The biggest issue with the current method, and also your method, is that it takes time to look up the Unicode numbers (unless you can memoize the numbers). |
@bertfrees: Please read all my previous comments to this issue first. I guess it is absolutely clear that "⣥⣺⢽⣥⣺⡟⣥⣺⠵⣥⣺⠋⣥⣺⠋⣥⣺⠋⣥⣺⡋" instead of "DrSooom" makes no sense. Furthermore searching on the web for the hexadecimal value of a Unicode character is even faster than searching for the braille character ⣼ (dots 345678) – or even for a combination of braille characters, which are often used for 6-dot tables. I guess it's quite hard to find the rules for a specific table, which is written in a different language like the table itself (e.g. Japanese full documentation for German Grade 2). But the "marginal improvement" of my solution is to save five cells on a braille display, which is quite a lot of space. Based on NVDA 2018.x only five emoticons (U+1F600 to U+1F64F) can be shown at the same time on an 80 cell braille display (because of UTF-16/surrogate splitting, see nvaccess/nvda#9044 for more details). But with my solution only 30 cells (and maybe later on only 15 cells) are required for presenting the exactly same values. On other words: You can read more at the same time without having to scroll and scroll and scroll the braille display all the time. And in the end this always mean that you are able to recognize the hexadecimal values much faster because you only have to read three instead of eight cells. |
I'm not a braille reader so I'm probably not qualified to say anything, but this strikes a cord that I have to voice my concern. The thing about braille IMHO is that we have to make it accessible to people. Contractions might make it smaller on a display or and paper. But they do not make it more accessible. In fact they achieve the opposite: They make braille harder to produce, harder to learn and I would claim also harder to read. So personally while I see the space issue with emoticons, I am not into very intricate schemes to contract the braille to squeeze out a few braille cells at the cost of teaching this scheme to people. Could you instead not just show the textual representation of an emoticon, such as the CLDR Short Name for example. I would expect this to make much more sense to an unsuspecting user. Seems to be much better than complicated untangling of a contraction scheme of a hex number which would then have to be googled to finally find what it really means. To get back to your proposed solution: I'm not opposed to include such a table. I'm just not sure we should make it the standard. |
@DrSooom I have read all your previous comments, but not everybody will make the effort to do so, so I wanted to give you an opportunity to summarize what the use cases are that you have in mind. It is essential to understand this because every problem has a solution that works best for that specific case. Also it may be that a certain solution is intrinsically better, but that the effort it takes to develop a new braille code and to make it be adopted outweighs the advantages. I'm still not sure whether this is theoretical talking, or whether there is a concrete use case:
For the emoji use case, I was also thinking along the lines of what Christian suggested, i.e. a textual representation. I'm not saying that this would be a better solution, just trying to prove my point that different use cases may have different solutions. I'd like to suggest to share this idea on the mailing list. The mailing list is where the people are that know more about braille and the development of braille codes. Github is more where the technical people are. Technically I don't see anything wrong with your method, that is, it indeed does a good job to minimize the amount of braille cells needed, it can be decoded relatively easily (with the help of the web) and it is unambiguous. These are essentially the things that make up a good braille code. However even if technically sound, there may be practical considerations that we technical people are forgetting. In any case, I don't think it is our job to implement something that is not supported by a community or official entity, so the first step is to sell it to them. It's probably good to start with the BSKDL, and who knows, maybe more will follow? |
No, it's for all tables, where it is possible to include them – always as an optional, shorter mapping for undefined Unicode characters.
All Unicode characters from U+0000 to U+10FFFF. The emoticons were just for testing and demonstrating because I only had to define 81 characters. Maybe I will extend my 8-dot test table to all 2048 surrogates (U+D800 to U+DFFF), because the current mishmash in this Unicode area on my system is beginning to ride on my nerves. 😉
Yes and no. If you want to learn a new language, would you also want to learn a new braille table? Sometimes it isn't necessary and sometimes you have to. It depends on the language combination. Reading English and French with the German 8-dot table isn't a problem, but try Japanese.
Yes.
Yep, in the end I just want to provide a better solution in compare to '\xhhhh' and so on. The BSKDL is already informed about this issue here and #689. If desired I could give a talk at the SightCity-Forum 2020 in Frankfurt/Main (Germany) about this. The call for paper period for the SightCity-Forum 2019 has already ended. But if desired I could try to ask for a 30 minutes talk even for 2019, but the chances are bad. And to the emojis: They can be misunderstood by naming it if you are chatting in different languages (system is set to German, but the conversation is in French). So if you change their naming into the different languages, you also have to learn them with the other language table together. Read this article for more details. That's why in my opinion one braille combination fits one single Unicode character better here. In older board software like phpBB or PunBB |
OK, thanks. This confirms more or less what I thought. Forgive me for saying, but it all sounds rather theoretical at this point. I don't mean anything negative by this. I just didn't hear many concrete use cases. Real use cases (such as the musical symbols ♭, ♮ and ♯) would generally benefit from proper support in braille codes I think. So the main practical use of your proposed code would be for symbols that are either too uncommon to support in the native braille code, or as a transitional measure for more common symbols until they have a proper representation.
But how many of these do you actually encounter in the wild and have no braille representation yet? Regarding learning new languages, I would really like to hear some stories of experiences of blind people that learned a new language but couldn't use their own braille system and in which case learning the new braille system was not an effective way to do it. By the way, it is funny that you mention Japanese because especially for braille systems like Japanese and Chinese, which are mainly phonetic based, your method seems very uneffective. It was probably just a bad example though :)
I think that is very ambitious, but you have my support. I don't quite get what you are saying about emojis. Can't we assume that the system language is set to the language that corresponds with your braille system of choice?
Ah, cool. I've never been there but maybe I should go once. |
Not only for more common symbols, for all Unicode characters at once.
Thousands of thousands characters. I have quite a lot of C-, Canto-, Mando- and J-Pop and hundreds of soundtracks from Japan on my hard drive. Displaying their titles needs quite a lot of space on a braille display. And set the dot 0 for all undefined characters like in "fr-bfu-comp8.utb" isn't also no solution for me, because I want to select a title without using TTS. I must be able to identify them only on the braille display. So shorten them all, fits it best for me.
Well, too less knowledge on my side. 😉 But also read my previous comment regarding music titles written in Mandarin, Cantonese or Japanese. We can also use the 1071 Egyptian Hieroglyphs (U+13000 to U+1342F) as an additional example. I don't think that creating a braille table only for these characters in the different languages makes any sense. So shorten the hexadecimal value from 9 to 3 cells would be the only suitable way here – always as an option for the end user. In UTF+16 you will only need 6 instead of 16 cells for presenting the same information – and fully independent of the current used language.
If you want to replace them with a describing text like "Kissing Face with Smiling Eyes", then no – especially if you want to print them too. How emojis and all other Unicode characters are displayed, depends on the OS and on the (smartphone) application. The Unicode point helps here to make it easier to transfer text between different OS and applications. If there are no replacements (graphical or text-based) in the end application, the user should be able to read the hexadecimal value for this Unicode character itself. Well, and I'm going to visit the SightCity 2019 – my 7th time by the way. So I'm already able to talk there with some companies and organisations around the globe about this issue here and #689. |
I understand. But I'm talking about practical use. To matter in practice the characters need to be common enough, or you need to encounter a lot of them at once. Your example of Chinese/Japanese track names was a very enlightening one for me. Thanks. Even if you can't understand Chinese, it is perfectly possible to recognize the track names based on the dot patterns, or in my case, by recognizing the sequence of glyphs without knowing the meaning. So in this case it doesn't really makes sense to learn the Chinese braille system, just like it doesn't really make sense to learn how the tracks are pronounced. And of course when you encounter some Chinese text, it is not an isolated symbol, as is the case with an emoticon for instance, so you will indeed get the "scroll and scroll and scroll" issue if the code isn't short enough. |
As my mappings only depend on the hexadecimal values, the relevance is irrelevant at all. In the end I just define all 65536 Unicode characters (U+0000 to U+FFFF), which would be enough for UTF-16. For UTF-32 the next two 65536 Unicode areas are required (for the upcoming ten years). That's why I asked for a better solution for displaying them in my introducing comment. But with the table solution I still have to figure out what will happen if a character is defined twice. If Liblouis (or NVDA) is choosing always the first or always the last appearance, everything is good. Otherwise we have a problem. Update 2019-01-30 21:40 CET:The first appearance seems to be the critical one. So adding my test tables with |
After approximately one and a half hour the 8-dot test file for all 2048 surrogates (U+D800 to U+DFFF) was created – and of course successfully tested with NVDA 2018.1, which uses UTF-16. That went significant faster than I thought. 😀 |
Hi @DrSooom , can you turn this into a pull request? |
@dkager to come back to your comment, I think if there would be a speech dictionary for those undefined characters These would make it quite easy for users to find out which character it is. You just Need to listen to the voice. But in this case we should really also look at the Performance, especially in big documents with many undefined Unicode characters. |
@Adriani90: Such a file already exists on your hard drive – it's called "unicodedefs.cti". And it's part of Liblouis, but outdated, as I already mentioned here. And there is also a significant mistake in this file: "\x1D11E" (and so on) must be replaced with "\y1D11E". But in the end that's a different issue and has nothing to do with #688 and with #689. And no, I'm not going to add the naming of a Unicode character in my tables too, because they are already big enough (~1.6 MiB for the first 65536 characters in the 8-dot table; the 6-dot one is even bigger). btw: In the meantime I found the note regarding "UTF-8 without BOM" in this wiki article (chapter "How do I edit the YAML files?"). |
I'm going to create all 8- and 6-dot tables for UTF-16 and UTF-32. I will add a new comment here, after I've finished everything. |
As these days every single good idea needs its own website, I just created one for the Hexadecimal Unicode Characters Braille Tables. And here is the official announcement. @egli and @bertfrees: If you want to have the HUC8 Braille Tables already in Liblouis 3.9, feel free to copy the 19 files from the 7z archive into the "tables" folder in Liblouis, because I'm not going to open a PR regarding this in the upcoming days. |
Nice! Thank you. Whether we will include this in 3.9 depends on how much we'll have over the weekend. I'm not very hopeful about myself. |
As mentioned in #730, the tables should either be distributed independently (outside of liblouis) or a new mode should be implemented that emits HUC Braille for unknown Unicode characters. |
@egli: Could you explain me why you also closed issue #688 and #689 right now, as PR #730 only showed up one way how to fix this? Please read the section "Technical solution" in the issue description of issue #688 as well as this comment in issue #689 and nvaccess/nvda#8702, which should be the goal. Therefore I ask you to re-open issue #688 and #689, as both aren't fixed yet and they are now going to be fixed by another PR in the future. |
Thanks. Could you still remove the label "needs test" for this issue here? Then it would be identical to issue #689. PS: This comment in issue #664 was a little bit bad. Now I know how to improve the HUC Braille Tables regarding UTF-16 surrogate pairs. But before I'm going to do this, I have to finalize the FAQ section in the documentation first. |
For info, I challenged myself to make a POC of this mode (in Python for now, probably in C later). Available here -> https://github.com/Andre9642/HUC-braille-converter |
This issue is still open because we're keeping the possibility open to implement HUC via a new mode or a new opcode or other new feature, anything that doesn't result in a huge table. To improve the chance that this gets picked up, it would be nice to have some kind of YAML test that explains the requirement in a simple way. |
Introduction:
Several months ago I read issue #489 here and I also opened nvaccess/nvda#8702 on September 1, 2018. In the last few weeks I was thinking about how to shrink the amount of braille characters to display an emoji. I found a possible "solution" for it, which already included ⣑ (U+28D1, dots 1578) as an announcing/introducing prefix character. But I had to give up because I couldn't figure out a solution for 6-dot too as well. I planned to use almost the same braille dots for 8-dot and for 6-dot, like ⣑⡤⣺ which should be equal to ⠿⠤⠳⠽. But in the end there were too many issues for "blank" cells during converting from 8-dot into 6-dot. So, that's why the following solution is only designed for 8-dot and not for 6-dot too.
[Update 2019-01-25 09:48 CET] See issue #689 for the 6-dot solution. [/Update]
After I had opened #685 here on January 12, 2019, I also asked the head of the Brailleschriftkomitee der Deutschsprachigen Länder how Unicode characters should be implemented into the German 8-dot braille table. Well, they already knew this problem for many years, but never find a suitable solution until now. As he ask me for such a solution, I sent him on January 24, 2019, at 06:43 CET (today morning) a draft of the following solution via e-mail. My brainstorming wasn't finished at this time. But a few hours later and after even more researching, I guess I now have found the final solution, how we can display the first 196608 Unicode characters with only three braille characters. So, and here is it. And no, it's a completely other idea than the one describes in #489.
Definition:
Prefix braille characters:
U+28E5, dots 13678; Defines the first 65536 Unicode characters.
The prefix character is a combination of the letters u and c.
U+28ED, dots 134678; Defines the second 65536 Unicode characters.
The prefix character is a combination of the letters u and c and the digit 1.
U+28FD, dots 1345678; Defines the third 65536 Unicode characters.
The prefix character is a combination of the letters u and c and the digit 2.
U+28F5, dots 135678; Defines the other 917504 Unicode characters.
The prefix character is a combination of the letters u, e and c.
And here three braille characters are needed to define a Unicode character correctly.
U+30000 will be changed to U+030000 before converting into braille.
At the moment only 337 characters are defined in these blocks.
Converting hexadecimal values into braille:
Combining hexadecimal values:
Examples:
But: Two Digit Zero = 00 = U+0030U+0030 = '\x0030''\x0030' = ⣥⣺⣩⣥⣺⣩
Reducing it to ⣥⣺⣩⣺⣩ isn't allowed.
Obsolete mappings: ⣥⡁⠅⠐ and ⣥⣆⢭⡂⢁ (These were just part of my brainstorming.)
😀
= U+1F600 = '\xd83d''\xde00' = ⣭⡤⣺Technical solution:
Please read nvaccess/nvda#8702 first, where I already explained a solution how to reduce the amount of cells on a braille display for undefined characters. The two apostrophes, the backslash and the small x are now replaced with ⣥, ⣭, ⣽ or with ⣵, but only for undefined characters – if the user wants this. As it is now possible to save 13 braille cells for one single undefined Unicode character like the above shown emoji, I suggest to set this as default, but always with the option to show '\x0000' and so on, as this could be still helpful for others.
Sadly as I'm not so familiar with programming, I'm not able to suggest you how this could be implemented in Liblouis as well as in other software like screenreaders and braille printer software. As my solution is nothing than a replacement, I guess that the converting shouldn't such a big problem. Creating a huge braille table with all 1114112 Unicode characters as a second priority level table, which doesn't overwrite already defined characters in the primary chosen braille table, makes absolutely no sense in my opinion.
Every thought and suggestion from the community are highly welcome. Maybe I have overlooked something.
Additional sources:
The text was updated successfully, but these errors were encountered: