Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile liblouis with 32 bit widechars and add textUtils module to deal with py2/py3 unicode differences #9044

Closed
wants to merge 15 commits into
base: master
from

Conversation

Projects
None yet
1 participant
@leonardder
Copy link
Collaborator

leonardder commented Dec 10, 2018

Link to issue number:

Closes #9034
Closes #6695

Summary of the issue:

Liblouis currently uses a 2 byte encoding to process braille. This is pretty annoying when displaying emoji, as they are 32 bit unicode characters. For example, 馃槈 is usually printed as '\xd83d''\xde09'.

Description of how this pull request fixes the issue:

This pr does the following.

  1. Compiles liblouis with 32 bit wide characters instead of 16. This means only one replacement pattern is printed for 32 bit characters instead of two.
  2. Furthermore, it adds a textUtils module that deals with the differences between python 2 and 3 unicode strings. In python 2, unicode strings are internally saved with a two byte encoding. Therefore, 32 bit unicode characters take two indexes or offsets in a string. In python 3, one index/offset corresponds with a code point. Liblouis 2 byte wide characters played pretty nicely with Python 2 unicode strings, but with 32 bit wide characters, the rawToBraillePos and brailleToRawPos mappings do no longer match, as liblouis reads 馃槈 as one character whether Python 2 reads them as two. Python 3 would instantly fix this, but causes the same problem the other way around with several offset based TextInfos. For example, uniscribe uses 2 byte wide characters, and therefore 馃槈 is treated as two characters by uniscribe whereas Python 3 treats it as one.
    This is where textUtils.ENcodingAwareString comes into view. This new str subclass of unicode in python 2/str in python 3 keeps the decoded and encoded form of a string in one object. Indexing/slicing/length checking is based on the smallest with of the encoding used. For example, utf-16 is two bytes, so every index in the string corresponds with two bytes. Utf-32 uses four bytes for one character, so every index corresponds with four bytes in the string. This basically means that we have strings that behave like python 3 strings on python 2, and strings that behave like python 2 strings on python 3.
  3. Since #9034 describes a possible regression that would be caused by switching liblouis to UCS-4, this possible regression has been neutralized. When fetching character offsets in IA2 and the implementation returns a surrogate character, the implementation falls back to the OffsetsTextInfos character offset mechanism, which takes 32 bit unicode characters based on surrogates into account.
  4. the braille module is the first module to utilize the unicode_literals feature, i.e. all strings without the unicode literal are yet treated as such. This makes the braille module mimic Python 3 behavior more closely.

Testing performed:

Tested a try build with braille, routing worked well, no unexpected errors in the log so far.

Known issues with pull request:

No unit tests, and we really want them, I'm sure. however, first I want to make sure that we agree about the approach of the textUtils module.

Change log entry:

  • Bug fixes
    • Emoji and other 32 bit unicode characters now take less space on a braille display when they are undefined in a braille translation table. (#6695)
    • Cursor routing no longer moves the cursor one or more characters ahead when used on a line containing emoji. (#9034)
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Dec 10, 2018

@feerrenrut @michaelDCurran @jcsteh: I requested review from the three of you, because especially with the textUtils approach, I really want to take the right step. Note that I'm still open for feedback or massive changes. I just wanted to deliver something that actually seems to demonstrate the idea behind the module pretty well.

@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Jan 21, 2019

I'm going to close this for now:

  1. This prototype implementation aims at both python 2 and 3 compatibility. Since the official Python 3 transition work will be started not long after today, it makes much more sense to couple the UTF32 switch to the Python 3 switch.
  2. The textUtils module won't be very useful within the scope of this pr when the pr is targeted at python 3 only. It will be very useful with regard to textInfos though. Therefore, it makes sense to file a separate request for textUtils on its own, or provide it as part of a textInfos py3 compatibility pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.