Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling Liblouis with 32-bit Unicode support #6695

Closed
Andre9642 opened this issue Jan 4, 2017 · 28 comments
Assignees
Labels
p2
Milestone

Comments

@Andre9642
Copy link
Contributor

@Andre9642 Andre9642 commented Jan 4, 2017

In my opinion, the Liblouis DLL included in NVDA should be compiled with 32-bit Unicode.
Currently, we improve the French table and we'd like to add any 32-bit Unicode such as \y1D4D0, \y1D49C, \y1D4D2, \y1D49E, etc.
What do you think?

@josephsl

This comment has been minimized.

Copy link
Collaborator

@josephsl josephsl commented Jan 4, 2017

@Andre9642

This comment has been minimized.

Copy link
Contributor Author

@Andre9642 Andre9642 commented Jan 4, 2017

@josephsl yes, it is possible. See https://github.com/liblouis/liblouis/blob/master/README.windows

Edit the file configure.mk If you want 32-bit unicode change the 2 in the line UCS=2 to a 4.

@josephsl

This comment has been minimized.

Copy link
Collaborator

@josephsl josephsl commented Jan 4, 2017

@dkager

This comment has been minimized.

Copy link
Collaborator

@dkager dkager commented Jan 4, 2017

AFAIK that needs an update to the Python wrapper similar to what is done in Java.
Another use case for this feature is support for emoji, e.g. U+1F170.

@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Jan 4, 2017

  1. We don't build using the liblouis build stuff for Windows; it's integrated into NVDA's own build system. However, it's easy enough to switch that to 32 bit Unicode.
  2. As @dkager notes, the bigger issue is that the Python bindings are currently tied to the OS Python Unicode width, which is 16 bits on Windows. We'd need to somehow make this optional in the upstream Python bindings based on how liblouis was compiled. That's going to be a bit tricky because on Windows, Python Unicode still uses 16 bits, so the bindings will have to figure out when and what to convert. We also need to make sure it doesn't break Unix. :)
@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Jan 4, 2017

@Andre9642, how important is this for French? Does the table fail to work with NVDA if you include these characters or do they just get ignored? Are these characters frequently used in French?

@Andre9642

This comment has been minimized.

Copy link
Contributor Author

@Andre9642 Andre9642 commented Jan 5, 2017

@jcsteh:

Yes, the table fail to work with NVDA if I include these characters. Moreover Lou_checktable indicate that the table is invalid:

<table>:<line>: error: liblouis has not been compiled for 32-bit Unicode

No, these characters aren't frequently used in French. They are used in mathematics. In fact, our additions relate in part to Unicode Math Symbols. Extract:

uplow \x0391\x03b1 46-45-1,45-1 Α,α alpha
uplow \x0392\x03b2 46-45-12,45-12 Β,β beta
uplow \x0393\x03b3 46-45-1245,45-1245 Γ,γ gamma
...
#sign \y1D4D2 46-5-14 # MATHEMATICAL BOLD SCRIPT CAPITAL C
...
#sign \y1D49F 46-5-145 # MATHEMATICAL SCRIPT CAPITAL D
...

@jcsteh jcsteh added the p2 label Jan 5, 2017
@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Jan 5, 2017

P2 because this prevents tables from even working in NVDA if they contain 32 bit characters.

@bhavyashah

This comment has been minimized.

Copy link

@bhavyashah bhavyashah commented Aug 6, 2017

Just a gentle ping to bring the attention of NVDA code contributors to this P2 issue in case anyone desires to work on this.

@dkager

This comment has been minimized.

Copy link
Collaborator

@dkager dkager commented Aug 6, 2017

As noted in #6695 (comment) this needs some upstream work first. (In general, I'm not too fond of how liblouis deals with Unicode.)

@josephsl

This comment has been minimized.

Copy link
Collaborator

@josephsl josephsl commented Aug 6, 2017

@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Aug 6, 2017

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't. Python 3 Unicode does handle 32 bit characters (actually, internally, i believe it's variable width now). However, Windows is still UTF-16 and thus ctypes still treats Unicode in C functions as UTF-16, since we're running on Windows. So, this is not required for Python 3.

@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Aug 6, 2017

For anyone that does want to try to implement this into the liblouis Python bindings, here are some rough thoughts:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)
  2. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.
  3. It's probably simplest to tell ctypes to use c_char_p instead of c_wchar_p so that we get raw bytes.
  4. Wherever ctypes.sizeof(ctypes.c_wchar) is used, instead use the number of bits from 1) divided by 8 for bytes.
  5. When passing unicode strings to liblouis, first encode them using the encoding from 2); e.g. inbuf.encode("UTF-32")
  6. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Note that this proposed implementation eliminates the need to consider the ctypes unicode width altogether. However, it may be slightly less optimal when the ctypes unicode width does match the liblouis unicode width because it always uses a Python byte string as an intermediary, where perhaps ctypes is more efficient internally; I'm not sure. You could get around this by checking the ctypes unicode width with ctypes.sizeof(ctypes.c_wchar) and only using the proposed new behaviour if it differs from the liblouis width. However, this will make the code somewhat more complicated (especially with regard to testing on different configurations) and is probably premature optimisation unless a significant speed difference can be proved.

@dkager

This comment has been minimized.

Copy link
Collaborator

@dkager dkager commented Aug 7, 2017

There is also a function to get the widechar size, though that would require a library call from the wrapper:

/**
 * Return the size of widechar
 */
LIBLOUIS_API
int EXPORT_CALL lou_charSize(void);

A much more ambitious solution would be to make liblouis work with UTF-16 as opposed to UCS-2, i.e. add support for surrogate characters.

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Aug 30, 2018

I'm going to give this a try.

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Aug 30, 2018

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)

Shouldn't we get rid of autotools macro expansions in python bindings if possible?

@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Aug 30, 2018

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Aug 31, 2018

@jcsteh commented on 30 aug. 2018 18:58 CEST:

One way you could work around that is to
have some liblouis function which returns this information, but I'm not
sure whether that'd be acceptable to the maintainers.

There is such a function as @dkager pointed out in #6695 (comment)

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.

For reference, we should explicitly use the little endian versions of utf-16 and utf-32.

  1. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Now, that fails, since for c_char_p, the value is truncated as soon as a null character is encountered. Even when we would use c_wchar_p on Windows, the value will be truncated for cases like 'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Aug 31, 2018

Just filed liblouis/liblouis#633. I did not yet bother to create a try build due to several issues, including liblouis 3.7 refusing to build when building with NVDA's build system. Feedback is welcome though.

@dkager

This comment has been minimized.

Copy link
Collaborator

@dkager dkager commented Aug 31, 2018

Does their own NMake-based build still work? That seems to break every other release too.

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Sep 4, 2018

@jcsteh commented on 7 aug. 2017 00:25 CEST:

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't.

Actually, I'm afraid @josephsl has a point.

>>> import louis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")
Exception: mismatch in Unicode width: liblouis is 2 and python isn't
@jcsteh

This comment has been minimized.

Copy link
Contributor

@jcsteh jcsteh commented Sep 4, 2018

@leonardder commented on Sep 4, 2018, 6:44 PM GMT+10:

  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")

Sure, but that's because the bindings don't yet support both bit widths (regardless of Python bit width). Once that's fixed, this check can be removed from the bindings. That should be done as part of liblouis/liblouis#633.

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Sep 4, 2018

That's already covered in liblouis/liblouis#633 indeed.

@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Dec 4, 2018

Ugh, there's one major thing I didn't think about yet.
When compiling liblouis with UCS-4, cursor position breaks in a major way. This is because NVDA provides the cursor position based on the internal python 2 unicode representation, which has two offsets for characters involving surrogates in UTF-16, such as 😊. A switch to Python 3 would automatically fix this I believe, since Python 3 unicode strings behave similar to 4 byte widechar strings (i.e. one 4 byte liblouis widechar corresponds with one offset in a python 3 string).
We could work around this by working with an internal UTF-32 representation of the text, but that feels a bit hacky,

@zstanecic

This comment has been minimized.

Copy link
Contributor

@zstanecic zstanecic commented Dec 4, 2018

@leonardder leonardder self-assigned this Dec 8, 2018
@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Dec 8, 2018

I've found a way to do this anyway. A pull request will be filed shortly.

@Adriani90

This comment has been minimized.

Copy link
Collaborator

@Adriani90 Adriani90 commented Jun 18, 2019

@leonardder Should this issue be closed now that #9544 has been merged?

@leonardder leonardder added this to To do in Update NVDA to Python 3 via automation Jun 28, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator

@leonardder leonardder commented Jun 28, 2019

Fixed in #9544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
9 participants
You can’t perform that action at this time.