Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified 6-dot worldwide unified mapping for the hexadecimal value of Unicode characters #689

Open
DrSooom opened this issue Jan 25, 2019 · 14 comments

Comments

@DrSooom
Copy link

@DrSooom DrSooom commented Jan 25, 2019

Fund this issue

Introduction:

Yesterday I opened issue #688, where I described how to display the hexadecimal value of Unicode characters with just three and four 8-dot braille characters. After that I wasn't able to stop thinking about a suitable way for doing the same for 6-dot too. And well, I guess I have now solved the puzzle. The following solution for 6-dot is just a little bit more complex than the one for 8-dot, but the syntax of both are still quite easy to learn and to understand.
And to clarify it right at the beginning: The following 6-dot solution will not replace or influence any existing 6-dot braille table, unless the prefix braille character ⠿ (U+283F, dots 123456) is already used for another character. Sadly there are hundreds of rules for 6-dot braille tables around the globe, so finding the braille characters, which works everywhere the same way, isn't quite easy. I have chosen ⠿ for one specific reason: It's displayed for "deleted" characters on paper, if it isn't possible to put the wrong dots back into the paper.
And please don't forget: My solution here is nothing more than a replacement how hexadecimal values can be displayed in 6-dot braille. It doesn't redefine any Unicode character, it just convert its hexadecimal value to the shortest possible way in 6-dot.

Definition:

Prefix and suffix braille characters:

  • The prefix character must stand in front of every single Unicode character. To avoid confusion, grouping of two or more unseparated Unicode characters isn't allowed.
  • The very last hexadecimal value stands always in the area of the dots 1245 in the fourth braille character. The dots 3 and 6 define the suffix braille character, which is also always at the fourth position. In other words: The fourth braille character is always a combination of a hexadecimal value and the suffix.
  • The two dashes ("-") below between the prefix and the suffix are placeholders for three hexadecimal values. The fourth last one is at the position of the first braille character in the area of the dots 1245, the third last one is split between the first and second braille character (dots 36 and 14) and the second last one is placed in the area of the dots 2356 in the second braille character. The last two hexadecimal values on a Unicode character are the most important one. That's why they shouldn't split between two 6-dot braille characters.
  • And to avoid misunderstanding, ⠀ (U+2800, dot 0) isn't allowed as a suffix.
  • ⠿--⠄ = characters between U+0000 and U+FFFF
    U+283F and U+2804, dots 123456 and 3; Defines the first 65536 Unicode characters.
  • ⠿--⠠ = characters between U+10000 and U+1FFFF
    U+283F and U+2820, dots 123456 and 6; Defines the second 65536 Unicode characters.
  • ⠿--⠤⠇ = characters between U+20000 and U+2FFFF
    U+283F, U+2824 and U+2807, dots 123456, 36 and 123; Defines the third 65536 Unicode characters.
    And beginning from here four braille characters are needed to define a Unicode character correctly.
    At the moment 60859 characters are defined in the block U+2xxxx and 337 more in the blocks higher than U+30000.
    The first hexadecimal value for U+2xxxx must stand behind the fourth braille character in the area of the dots 1245 to define the correct Unicode block. And for the Unicode characters from U+100000 to U+10FFFF the braille character ⠥ (U+2825, dots 136) is used to define this Unicode block.
    Here the full list from U+20000 to U+10FFFF:
    • ⠿--⠤⠇ = characters between U+20000 and U+2FFFF
    • ⠿--⠤⠍ = characters between U+30000 and U+3FFFF
    • ⠿--⠤⠝ = characters between U+40000 and U+4FFFF
    • ⠿--⠤⠕ = characters between U+50000 and U+5FFFF
    • ⠿--⠤⠏ = characters between U+60000 and U+6FFFF
    • ⠿--⠤⠟ = characters between U+70000 and U+7FFFF
    • ⠿--⠤⠗ = characters between U+80000 and U+8FFFF
    • ⠿--⠤⠎ = characters between U+90000 and U+9FFFF
    • ⠿--⠤⠌ = characters between U+A0000 and U+AFFFF
    • ⠿--⠤⠜ = characters between U+B0000 and U+BFFFF
    • ⠿--⠤⠖ = characters between U+C0000 and U+CFFFF
    • ⠿--⠤⠆ = characters between U+D0000 and U+DFFFF
    • ⠿--⠤⠔ = characters between U+E0000 and U+EFFFF
    • ⠿--⠤⠄ = characters between U+F0000 and U+FFFFF
    • ⠿--⠤⠥ = characters between U+100000 and U+10FFFF

Converting hexadecimal values into braille:

  • 0 = ⠚, 1 = ⠁, 2 = ⠃, 3 = ⠉, 4 = ⠙, 5 = ⠑, 6 = ⠋, 7 = ⠛
  • 8 = ⠓, 9 = ⠊, A = ⠈, B = ⠘, C = ⠒, D = ⠂, E = ⠐, F = ⠀

Combining hexadecimal values:

  • 0000 = ⠺⠽⠚, 0001 = ⠺⠽⠁, 0010 = ⠺⠋⠚, 0100 = ⠞⠴⠚
  • 1000 = ⠡⠽⠚, FFEF = ⠀⠠⠀, FFFE = ⠀⠀⠐, FFFF = ⠀⠀⠀

Examples:

  • Digit Zero = 0 = U+0030 = '\x0030' = ⠿⠺⠛⠞
    But: Two Digit Zero = 00 = U+0030U+0030 = '\x0030''\x0030' = ⠿⠺⠛⠞⠿⠺⠛⠞
    Reducing it to ⠿⠺⠛⠞⠺⠛⠞ isn't allowed.
  • Music Sharp Sign = ♯ = U+266F = '\x266f' = ⠿⠧⠗⠄
  • Braille Pattern Dots-12 = ⠃ = U+2803 = '\x2803' = ⠿⠇⠽⠍
  • Musical Symbol G Clef = 𝄞 = U+1D11E = '\xd834''\xdd1e' = ⠿⠆⠂⠰
  • Grinning Face = 😀 = U+1F600 = '\xd83d''\xde00' = ⠿⠤⠵⠺

Technical solution:

Please read the same section in issue #688.
Every thought and suggestion from the community are highly welcome. Maybe I have overlooked something here too.

Additional sources:

@DrSooom
Copy link
Author

@DrSooom DrSooom commented Jan 28, 2019

I created a test file for 6-dot and for 8-dot for demonstrating how my idea would look like with emoticons (U+1F600 to U+1F64F, UTF-16 encoding). You can include these text files into another table and test it with NVDA.

@DrSooom
Copy link
Author

@DrSooom DrSooom commented Feb 7, 2019

I'm going to create all 8- and 6-dot tables for UTF-16 and UTF-32. I will add a new comment here, after I've finished everything.

@DrSooom
Copy link
Author

@DrSooom DrSooom commented Mar 22, 2019

Currently 25 % of the first HUC6 Braille Table are finished. I guess I will still need approximately four more weeks to finalize them all.

@DrSooom
Copy link
Author

@DrSooom DrSooom commented Apr 6, 2019

Currently 50 % of the first HUC6 Braille Table are finished. I'm planning to release the finalized HUC6 Braille Tables on May 1, 2019.

@DrSooom
Copy link
Author

@DrSooom DrSooom commented May 1, 2019

Today at 01:00 (CEST) the HUC6 Braille Tables were released and here is the official announcement. The offline version of the documentations will be released after I have fully translated the FAQ section into German. This will take a while, because I need a pause now.
@egli and @bertfrees: As I have now to prepare myself for the SightCity 2019, a PR will follow after that event. But feel free to add the full content of both 7z archives (HUC8 and HUC6 Braille Tables) to the "tables" folder right now, if you don't want to wait. SHA-256 check sums are available here (1.77 KB, txt) and here (1.77 KB, txt).

@DrSooom
Copy link
Author

@DrSooom DrSooom commented May 1, 2019

As I couldn't wait, I recently wrote down the text for the PR description. Therefore I'm going to try to open one for them tomorrow. Then it's done.

@bertfrees bertfrees added this to the 3.10 milestone May 16, 2019
@egli egli self-assigned this May 21, 2019
@egli
Copy link
Member

@egli egli commented May 28, 2019

Hi @DrSooom I finally had time to look at this.

If I understand correctly this is a clever scheme to pack a unicode codepoint into the the 2^6 bits available in a 6-dot braille cell. It shortens the braille display of a unicode codepoint from 6 cells to 4 cells.

While I think the scheme is quite ingenuous I have some issues with it:

  1. it is not a standard. You seem to be addressing that with your HUC website
  2. I'm not convinced if we should even present unicode codepoints to the user. I guess we do that because we have no braille mapping for that codepoints. But is that really in the users best interest? A more clever compression scheme doesn't solve that fundamental problem.
  3. Your proposed mapping is very regular. It could easily be implemented in a few dozens of lines of code. If we ship this functionality in liblouis then it should be shipped as code not in the form of megabytes of liblouis tables.

So in short I think this is an interesting proposal, but I think we need to have some more discussion before we can bring this into a release.

@egli egli removed this from the 3.10 milestone May 28, 2019
@egli egli removed their assignment May 28, 2019
@DrSooom
Copy link
Author

@DrSooom DrSooom commented May 28, 2019

If I understand correctly this is a clever scheme to pack a unicode codepoint into the the 2^6 bits available in a 6-dot braille cell. It shortens the braille display of a unicode codepoint from 6 cells to 4 cells.

That's not fully correct. It is better to read the definition for 8- and for 6-dot at the HUC Braille Tables website instead of the issue descriptions in issue #688 and #689, as they are obsolete since March 2019.

And regarding the other points:

  1. Yes, that's correct. But on the other hand: Who would be the right person/organization/consortium to declare this as a standard? The Unicode Consortium "only" assign and allocate Unicode characters to code points. It would be new for me if they would be responsible for braille as well. But I already contacted them as well.
  2. Please read the section Definition for 8- and 6-dot in plain language. There are limitations how much characters can be shown in braille. Maybe you should completely read the section Usage as well. And never forget: It can be used by the end user, but it needn't. In the end the end user always must have the option to decide how undefined Unicode characters should be displayed on a braille display or on paper. See also nvaccess/nvda#8702 for further details, as this issue describes the goal.
  3. Please read the section "Technical solution" in issue #688. A better solution is of course welcome.
@bertfrees
Copy link
Member

@bertfrees bertfrees commented May 28, 2019

Your proposed mapping is very regular. It could easily be implemented in a few dozens of lines of code. If we ship this functionality in liblouis then it should be shipped as code not in the form of megabytes of liblouis tables.

I don't fully agree. The nice thing about it being implemented in a table is that... well, it doesn't need to be implemented in C code. If we would implement it in C code we would need a new opcode for it. But firstly, we don't usually add an opcode for something so specific. There are some UEB specific opcodes, but it would be the first time such complex behavior is contained in a single opcode. And secondly, we don't usually add an opcode for something so obscure. A feature should at least be somewhat standard, or have gained some popularity before we can dedicate an opcode to it.

This is why I thought a table was a nice compromise. Anyone can download it from the HUC website and use it, with a standard version of Liblouis. And we could even include it in the default Liblouis distribution to help this thing getting picked up by people. The only thing I didn't think about was that it would become such a huge file.

I haven't looked very much into the HUC specification, but maybe it is possible to implement it with some advanced Liblouis rules and thereby greatly reducing the file size. Or maybe the table format needs to become more powerful in order to support it, I don't know. In my view, the ultimate goal of the Liblouis table format is to be able to do these kind of things.

@DrSooom
Copy link
Author

@DrSooom DrSooom commented May 28, 2019

Please also read question No. ⡡. I have no problem with it if only the first three tbi files are included into the two tbl files by default (due to possible performance issues; I couldn't test this). But if this is the case, the end user must be informed about this. UTF-16 end applications aren't affected by this anyway.

@bertfrees: I already wrote on January 31, 2019 – before I really started the creation process – the following as a comment in issue #688:

And no, I'm not going to add the naming of a Unicode character in my tables too, because they are already big enough (~1.6 MiB for the first 65536 characters in the 8-dot table; the 6-dot one is even bigger).

@aaclause
Copy link
Contributor

@aaclause aaclause commented May 29, 2019

Hi,
I agree with #689 (comment).
Indeed, I think the best solution would be to improve the ’undefined’ opcode by implementing this method. Thus, A simple ’undefined [HUC|hex|<dots patern>]’ rule in any table would suffice.

However in my opinion, it’s a very interesting proposal, congrats @DrSooom for the idea!

@bertfrees
Copy link
Member

@bertfrees bertfrees commented May 29, 2019

I already wrote on January 31, 2019

I guess I missed that. Apologies!

@DrSooom
Copy link
Author

@DrSooom DrSooom commented May 30, 2019

@bertfrees: You don't have to apologize.

@egli and @Andre9642: I was thinking about this as well and I also agree with you that adding such functionality directly into some piece of code would be the finally best method to realize the HUC Braille Tables. But with the creation of the 38 tables I already reached my primary goal – make them all accessible and usable by everybody right now and free of charge. Therefore now is the best time to think about how to get that in a suitable way.

And please, never forget nvaccess/nvda#8702. If you – I haven't the skill to do this – are going to put the HUC Braille Tables into a – let's say – 100 KiB file, you also have to think about the previous mentioned issue as well. Because that functionality shouldn't be limited to NVDA itself in my opinion. Everybody must have the choice how undefined Unicode characters shall be displayed. And I wrote down in that issue eight possible options (incl. the current braille table behavior) for that, which should fit everybody needs the best. Nerds like me want be able to read and recognize every Unicode character on the braille display and other users only want to read a special braille character for all undefined Unicode characters, because the hexadecimal value would be an overkill for them. We haven't the permission to choose their decisions – the end user must be able to choose whatever he wants. Therefore all eight possible options must be offered to them.

So, and now here my idea how the converting process should work (in my head):

  1. Figure out if the Unicode character already exists in the current used braille table.
  • If so, proceed with the current converting process.
  • If not, follow the following new converting process.
  1. Read out the code point of that undefined Unicode character and save its hexadecimal value in a variable.
  2. Now split that variable as follows:
  • HUC8: abcd » ab and cd and 0001abcd » 0001, ab and cd
  • HUC6: abcd » abc and d and 0001abcd » 0001, abc and d
  1. Convert the split variables into their corresponding dots as follows:
  • HUC8: 4 prefix, 256 main and 14 additional (U+30000 to U+10FFFF)
    Note: The main part exists twice – one for ab and one for cd.
  • HUC6: 1 prefix, 4096 main, 48 main-suffix and 15 additional (U+20000 to U+10FFFF)
    Note: As the middle hexadecimal value is split between two braille characters, 4096 definitions are required. Maybe there is also a better solution for that, but then the whole thing would get more complex and harder to understand.
  1. The converted dots have to be store in new variables.
  2. Those new dots variables have now to be merged together in the correct order.
  • HUC8: prefix, additional (if exists) and main (two times)
  • HUC6: prefix, main, main-suffix and additional (if exists)
  1. After that merged variable was stored, it now can be sent to the end application.

So far, so clear in my head. Now it's your turn again. But please note that this converting process will need more CPU usage instead of the 38 files. On the other hand, those 38 files require more RAM, if you really want to include all 17 tbi files into the two UTF-32 tbl files, which isn't really necessary yet in my opinion.

And finally, what do you think about adding only the UTF-16 variants of the HUC Braille Tables and the complete packages as 7z archives directly into the "tables" folder right now, as @bertfrees already suggested here in PR #730? The developers should be able to get them work, if they want. Then we wouldn't have such a storage issue for UTF-32 end applications at the moment. Shall I open a new PR for this?

@egli
Copy link
Member

@egli egli commented Aug 14, 2019

As mentioned in #730, the tables should either be distributed independently (outside of liblouis) or a new mode should be implemented that emits HUC Braille for unknown Unicode characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants