Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling Liblouis with 32-bit Unicode support #6695

Closed
aaclause opened this issue Jan 4, 2017 · 28 comments
Closed

Compiling Liblouis with 32-bit Unicode support #6695

aaclause opened this issue Jan 4, 2017 · 28 comments
Assignees
Labels
p3 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority
Milestone

Comments

@aaclause
Copy link
Contributor

aaclause commented Jan 4, 2017

In my opinion, the Liblouis DLL included in NVDA should be compiled with 32-bit Unicode.
Currently, we improve the French table and we'd like to add any 32-bit Unicode such as \y1D4D0, \y1D49C, \y1D4D2, \y1D49E, etc.
What do you think?

@josephsl
Copy link
Collaborator

josephsl commented Jan 4, 2017 via email

@aaclause
Copy link
Contributor Author

aaclause commented Jan 4, 2017

@josephsl yes, it is possible. See https://github.com/liblouis/liblouis/blob/master/README.windows

Edit the file configure.mk If you want 32-bit unicode change the 2 in the line UCS=2 to a 4.

@josephsl
Copy link
Collaborator

josephsl commented Jan 4, 2017 via email

@dkager
Copy link
Collaborator

dkager commented Jan 4, 2017

AFAIK that needs an update to the Python wrapper similar to what is done in Java.
Another use case for this feature is support for emoji, e.g. U+1F170.

@jcsteh
Copy link
Contributor

jcsteh commented Jan 4, 2017

  1. We don't build using the liblouis build stuff for Windows; it's integrated into NVDA's own build system. However, it's easy enough to switch that to 32 bit Unicode.
  2. As @dkager notes, the bigger issue is that the Python bindings are currently tied to the OS Python Unicode width, which is 16 bits on Windows. We'd need to somehow make this optional in the upstream Python bindings based on how liblouis was compiled. That's going to be a bit tricky because on Windows, Python Unicode still uses 16 bits, so the bindings will have to figure out when and what to convert. We also need to make sure it doesn't break Unix. :)

@jcsteh
Copy link
Contributor

jcsteh commented Jan 4, 2017

@Andre9642, how important is this for French? Does the table fail to work with NVDA if you include these characters or do they just get ignored? Are these characters frequently used in French?

@aaclause
Copy link
Contributor Author

aaclause commented Jan 5, 2017

@jcsteh:

Yes, the table fail to work with NVDA if I include these characters. Moreover Lou_checktable indicate that the table is invalid:

<table>:<line>: error: liblouis has not been compiled for 32-bit Unicode

No, these characters aren't frequently used in French. They are used in mathematics. In fact, our additions relate in part to Unicode Math Symbols. Extract:

uplow \x0391\x03b1 46-45-1,45-1 Α,α alpha
uplow \x0392\x03b2 46-45-12,45-12 Β,β beta
uplow \x0393\x03b3 46-45-1245,45-1245 Γ,γ gamma
...
#sign \y1D4D2 46-5-14 # MATHEMATICAL BOLD SCRIPT CAPITAL C
...
#sign \y1D49F 46-5-145 # MATHEMATICAL SCRIPT CAPITAL D
...

@jcsteh jcsteh added the p3 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority label Jan 5, 2017
@jcsteh
Copy link
Contributor

jcsteh commented Jan 5, 2017

P2 because this prevents tables from even working in NVDA if they contain 32 bit characters.

@bhavyashah
Copy link

Just a gentle ping to bring the attention of NVDA code contributors to this P2 issue in case anyone desires to work on this.

@dkager
Copy link
Collaborator

dkager commented Aug 6, 2017

As noted in #6695 (comment) this needs some upstream work first. (In general, I'm not too fond of how liblouis deals with Unicode.)

@josephsl
Copy link
Collaborator

josephsl commented Aug 6, 2017 via email

@jcsteh
Copy link
Contributor

jcsteh commented Aug 6, 2017

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't. Python 3 Unicode does handle 32 bit characters (actually, internally, i believe it's variable width now). However, Windows is still UTF-16 and thus ctypes still treats Unicode in C functions as UTF-16, since we're running on Windows. So, this is not required for Python 3.

@jcsteh
Copy link
Contributor

jcsteh commented Aug 6, 2017

For anyone that does want to try to implement this into the liblouis Python bindings, here are some rough thoughts:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)
  2. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.
  3. It's probably simplest to tell ctypes to use c_char_p instead of c_wchar_p so that we get raw bytes.
  4. Wherever ctypes.sizeof(ctypes.c_wchar) is used, instead use the number of bits from 1) divided by 8 for bytes.
  5. When passing unicode strings to liblouis, first encode them using the encoding from 2); e.g. inbuf.encode("UTF-32")
  6. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Note that this proposed implementation eliminates the need to consider the ctypes unicode width altogether. However, it may be slightly less optimal when the ctypes unicode width does match the liblouis unicode width because it always uses a Python byte string as an intermediary, where perhaps ctypes is more efficient internally; I'm not sure. You could get around this by checking the ctypes unicode width with ctypes.sizeof(ctypes.c_wchar) and only using the proposed new behaviour if it differs from the liblouis width. However, this will make the code somewhat more complicated (especially with regard to testing on different configurations) and is probably premature optimisation unless a significant speed difference can be proved.

@dkager
Copy link
Collaborator

dkager commented Aug 7, 2017

There is also a function to get the widechar size, though that would require a library call from the wrapper:

/**
 * Return the size of widechar
 */
LIBLOUIS_API
int EXPORT_CALL lou_charSize(void);

A much more ambitious solution would be to make liblouis work with UTF-16 as opposed to UCS-2, i.e. add support for surrogate characters.

@LeonarddeR
Copy link
Collaborator

I'm going to give this a try.

@LeonarddeR
Copy link
Collaborator

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. The Python bindings need to know the number of bits used for widechar (16 or 32) by the compiled liblouis. That'll need to be an autotools macro expansion, just as is done for ###LIBLOUIS_SONAME###. (NVDA's build system will also need to handle this expansion as it does for ###LIBLOUIS_SONAME###.)

Shouldn't we get rid of autotools macro expansions in python bindings if possible?

@jcsteh
Copy link
Contributor

jcsteh commented Aug 30, 2018 via email

@LeonarddeR
Copy link
Collaborator

@jcsteh commented on 30 aug. 2018 18:58 CEST:

One way you could work around that is to
have some liblouis function which returns this information, but I'm not
sure whether that'd be acceptable to the maintainers.

There is such a function as @dkager pointed out in #6695 (comment)

@jcsteh commented on 7 aug. 2017 01:04 CEST:

  1. This determines the encoding to use when passing strings to and from liblouis. If it's 16, we must use UTF-16. If it's 32, we must use UTF-32.

For reference, we should explicitly use the little endian versions of utf-16 and utf-32.

  1. After obtaining unicode strings from liblouis, decode them using the encoding from 2) before returning them; e.g. outbuf.value.decode("UTF-32")

Now, that fails, since for c_char_p, the value is truncated as soon as a null character is encountered. Even when we would use c_wchar_p on Windows, the value will be truncated for cases like 'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'

@LeonarddeR
Copy link
Collaborator

Just filed liblouis/liblouis#633. I did not yet bother to create a try build due to several issues, including liblouis 3.7 refusing to build when building with NVDA's build system. Feedback is welcome though.

@dkager
Copy link
Collaborator

dkager commented Aug 31, 2018

Does their own NMake-based build still work? That seems to break every other release too.

@LeonarddeR
Copy link
Collaborator

@jcsteh commented on 7 aug. 2017 00:25 CEST:

@josephsl commented on 7 Aug 2017, 03:31 GMT+10:

Hi, we need this for Python 3 (#7105(.

No, we don't.

Actually, I'm afraid @josephsl has a point.

>>> import louis
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")
Exception: mismatch in Unicode width: liblouis is 2 and python isn't

@jcsteh
Copy link
Contributor

jcsteh commented Sep 4, 2018

@leonardder commented on Sep 4, 2018, 6:44 PM GMT+10:

  File "d:\Development\NVDA\source repo\source\louis\__init__.py", line 358, in <module>
    raise Exception("mismatch in Unicode width: liblouis is 2 and python isn't")

Sure, but that's because the bindings don't yet support both bit widths (regardless of Python bit width). Once that's fixed, this check can be removed from the bindings. That should be done as part of liblouis/liblouis#633.

@LeonarddeR
Copy link
Collaborator

LeonarddeR commented Sep 4, 2018

That's already covered in liblouis/liblouis#633 indeed.

@LeonarddeR
Copy link
Collaborator

LeonarddeR commented Dec 4, 2018

Ugh, there's one major thing I didn't think about yet.
When compiling liblouis with UCS-4, cursor position breaks in a major way. This is because NVDA provides the cursor position based on the internal python 2 unicode representation, which has two offsets for characters involving surrogates in UTF-16, such as 😊. A switch to Python 3 would automatically fix this I believe, since Python 3 unicode strings behave similar to 4 byte widechar strings (i.e. one 4 byte liblouis widechar corresponds with one offset in a python 3 string).
We could work around this by working with an internal UTF-32 representation of the text, but that feels a bit hacky,

@zstanecic
Copy link
Contributor

zstanecic commented Dec 4, 2018 via email

@LeonarddeR LeonarddeR self-assigned this Dec 8, 2018
@LeonarddeR
Copy link
Collaborator

I've found a way to do this anyway. A pull request will be filed shortly.

@Adriani90
Copy link
Collaborator

@LeonarddeR Should this issue be closed now that #9544 has been merged?

@LeonarddeR
Copy link
Collaborator

Fixed in #9544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority
Projects
None yet
9 participants