-
-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters #9545
textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters #9545
Conversation
@feerrenrut: I wrote this code around half a year ago to make it possible to use 32 bit wide characters with liblouis, see #9044. However as noted, it might also be of use when converting the textInfos module to Python 3, especially when it comes to offset based textInfos. I guess we can revisit this as soon as we are going to walk into this issue with textInfos, which is probably pretty soon. |
@feerrenrut and I had a very quick discussion about this. As there is currently no python 3 code available to test this with, I'll try to give a quick example of where this code is supposed to assist in, using a list of steps. A good example is NVDAObjects.window.edit.EditTextInfo. Edit controls are using utf-16 for their internal text encoding.
note that on EditTextInfo
From this follows that:
|
This looks very thorough, but: |
@michaelDCurran commented on 31 May 2019, 09:01 CEST:
I think the only difference with regular strings is related to indexing, slicing and length. Other methods are overriden to make sure that concatenation and multiplication use the same class.
I agree that this makes the use of this class much more explicit, which is a good thing IMO.
I don't think it is that much work. Would be good to know what @feerrenrut thinks about this.
I think it is a valid concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I tend to agree that inheriting from str could cause some nasty problems. I think it would be better if our code explicitly opted to use this approach, and ensure that it can't leak elsewhere. I would also like to see unit tests for this code.
I just pushed this to a prototype that actually works in notepad. I'd like to gather some feedback on the prototype code before continuing. The unit testing code is actually pretty outdate, so please ignore that for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm happy with this approach, I think it's worth putting some thought into how to clarify the intent of the usage of WideStringOffsetConvert. Perhaps write a few unit tests based on the current usage in textInfos/offsets.py
and see if the interface can be clarified.
I'd like there to be unit tests for the additions to textUtils, and some more documentation.
else: | ||
encoding=locale.getlocale()[1] | ||
return buf.value.decode(encoding,errors="replace") | ||
return buf.value.decode(self.encoding,errors="replace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to be aware of and log errors in debug mode.
def strLength(self) -> int: | ||
return len(self.decoded) | ||
|
||
def strToWideOffsets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation for this? What does strict
do? Perhaps rename it to raiseOnError
source/textUtils.py
Outdated
strEnd: int, | ||
strict: bool =False | ||
) -> Tuple[int, int]: | ||
if strStart > self.strLength: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "greater or equal" (>=
).
source/textUtils.py
Outdated
if strict: | ||
raise IndexError("str start index out of range") | ||
strStart = min(strStart, self.strLength) | ||
if strEnd > self.strLength: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps should be "greater or equal". Although, if the end offset is excluded from the range this is ok... I'm not sure about it. Please add unit tests for these cases.
source/textUtils.py
Outdated
precedingStr: str = "" | ||
strStart: int = 0 | ||
else: | ||
precedingStr= self.encoded[:bytesStart].decode(self._encoding, errors="surrogatepass") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surrogatepass
is handy, I also saw surrogateescape
option which could be handy: https://docs.python.org/3/library/codecs.html#error-handlers
source/textInfos/offsets.py
Outdated
#Fall back to the older word offsets detection that only breaks on non alphanumeric | ||
if self.encoding == self._UNISCRIBE_ENCODING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this, line 317 has if self.useUniscribe
how does this block relate to that one?
source/textInfos/offsets.py
Outdated
if self.encoding == self._UNISCRIBE_ENCODING: | ||
offsetConverter = textUtils.WideStringOffsetConverter(self._getStoryText()) | ||
strStart, strEnd = offsetConverter.wideToStrOffsets(offset, offset + 1) | ||
return offsetConverter.strToWideOffsets(strStart, strEnd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this do wideToStr, then strToWide?
@feerrenrut: I think I've applied all your requests. |
@michaelDCurran: As UIA consoles now also rely on uniscribe, I will need to revisit this after a merge of master into threshold and then in threshold_py3_staging. |
source/NVDAObjects/window/edit.py
Outdated
# ANSI strings are terminated with one NULL terminated character. | ||
# As pointed out above, numChars contains the number of characters without the null terminator. | ||
# Therefore, if the string we got isn't null terminated at numCHars or numChars + 1, | ||
# the buffer definitely contains multibyte characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is something missing in the explanation, and I dont think this actually does what we expect.
If I am understanding right, it relies on the value of numChars
being half of the number of bytes in a wide character string and it being impossible to have a null
character (represented by two null bytes) at the mid-point of the wide character string. If it is an ANSI string, there MUST be at least one null
character at numChars+1
to indicate the end of the string. However note that it does not handle an odd number of chars well. For wide character strings it will be considering one byte from one character, and one byte from the next.
Also worth point out this relies on the memory being zeroed by the VirtualAllocEx
function, otherwise:
numChars = 2 # copy logical "hi" into buffer
# imaging a string copied to a buffer with "garbage" in it. The string copied is "hi"
# wide string memory "0h0i" copied over garbage
bW = "0h0i00garbage" # ok, bW[numChars] == 0 but bW[numChars+1] == 'i' so it is interpreted as unicode
# ANSI string memory "hi" copied over garbage
bA = "hi0garbage" # not ok, bW[numChars] == 0 but bW[numChars+1] == 'g' so it is interpreted as unicode
I think there is a broader issue here that odd numbers of characters are not handled correctly.
Consider where a wide string with numChars = 3
:
buf = b"\x00\x04\x41\x00\x00\x01\x00\x00" # = "ЀAĀ" followed by two null chars.
buf[numChars] != 0 # False, it is zero
buf[numChars+1 != 0 # False, it is zero: buf interpreted as ANSI
Would it be clearer to do something like:
if (
# The window is unicode, the text range contains multi byte characters.
self.obj.isWindowUnicode
# VirtualAllocEx zeroes out memory, so for an ANSI string there should ONLY be null bytes
# after the first numchars .
# For wide character strings, the number of bytes will be twice the number of characters, meaning
# numchars points to mid-string, there will be non-zero bytes after it.
or any(c != 0 for buf[numChars:bufLen])
):
text=ctypes.cast(buf,ctypes.c_wchar_p).value
else:
encoding=locale.getlocale()[1]
Also worth considering a single char.
buf = b"\x41\x00\x00\x00" # = "A" followed by two null chars. This is valid as a wide string or ANSI
buf = b"\x00\x45\x00\x00" # ??
There are enough cases it would be worth extracting this and creating a unit test.
Although for perf considerations perhaps we just consider adjusting the byte alignment and checking two bytes
wCharAlignedIndex = numChars+(numChars%2)
or not (buf[wCharAlignedIndex ] == 0 and buf[wCharAlignedIndex + 1] == 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this by voice, and I think I've been able to come up with working implementations that prove to be working in Wordpad and Notepad++. Note that in both situations, I also tried to fetch a range that was actually longer than the text in the control, so these cases are also covered.
While I really see your point about unit tests, note that we can't test this against neither Wordpad nor Notepad++. I think we should also split this logic out to a new function in textUtils that takes a buffer, a number of bytes and an encoding. The more I think about that, the more I consider it sensible, so that's what I'll do.
tests/unit/test_textUtils.py
Outdated
self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3)) | ||
|
||
def test_surrogatePairs(self): | ||
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 🤦😊👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are several characters being combined in different ways, it might be clearer to specify them as constants and join them together in their various combinations for each test:
facePalm = u"\U0001f926" # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍
...
converter = WideStringOffsetConverter(text=u"".join([facePalm, smile, thumbsUp]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would make things a lot more readable here. Rather than using join
as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.
facePalm = u"\U0001f926" # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍
...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)
…here is enough space in our buffer
I found some issues which I have fixed in my last few commits. |
Hi, By the way, unittests are failing because textUtils module has no _WCHAR_ENCODING. Thanks. |
@feerrenrut: I just looked into the change of textInfos.offsets.Offsets to a dataclass, and noticed that Offsets instances are now unhashable. Since Offsets is a mutable object, I tempted to leave this as is. #9757 defined a hash method for every object where we defined eq, but I think this might have been a bit too aggressive. It also defined hash on objects for which we do not rely on its hashability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've made a lot of formatting changes in this PR. I've managed to review it, but these changes make it much harder to spot the changes, if these changes weren't intentional, please check your editor settings.
Generally, just a few small changes and I think this is good to go. Thanks for your work on this!
tests/unit/test_textUtils.py
Outdated
self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3)) | ||
|
||
def test_surrogatePairs(self): | ||
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 🤦😊👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would make things a lot more readable here. Rather than using join
as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.
facePalm = u"\U0001f926" # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍
...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)
What's new entry suggested by Leonard de Ruijter (Babbage).
What's new entry suggested by Leonard de Ruijter (Babbage).
Link to issue number:
Closes #8981
Related to #8981
Follow up of #9044
Summary of the issue:
On Windows, wide characters are two bytes in size. This is also the case in python 2. This is best explained with an example:
In Python 3 however, strings are saved using a variable byte size, based on the number of bytes that is needed to store the highest code point in the string. One index always corresponds with one code point.
A much more detailed description of the problem can be found in #8981.
Description of how this pull request fixes the issue:
This pr currently only introduces a new textUtils module that intends to mitigate issues introduced with the Python 3 transition. Most offset based TextInfos are based on a two bytes wide character string representation. For example, uniscribe uses 2 byte wide characters, and therefore 😉 is treated as two characters by uniscribe whereas Python 3 treats it as one.
This is where textUtils.WideStringOffsetConverter comes into view. This new class keeps the decoded and encoded form of a string in one object. This object can be used to convert string offsets between two implementations, namely the Python 3 one offset per code poitn implementation, and the Windows wide character implementation with surrogate offsets.
The initial version of this pr implemented a broader approach with a class that inherrited from str. After discussion with @feerrenrut and @michaelDCurran, it has been decided not to do this.
The initial pr also contained support for other encodings, particularly UTF-8. However, investigation revealed that this is far from ideal to store in one class. Furthermore, the only offsets implementation that uses UTF-8 for its internal encoding is Scintilla/Notepad++, and that implementation offers everything we need for text range ffetching.
Testing performed:
Tested using the python 3 interpreter, running this branch from source with the fixes from the try-py3_ignoreSomeIssues branch applied. Also, most importantly, we have working unitTests. note that both the test_textUtils and test_textInfos modules apply here.
Known issues with pull request:
None
Change log entry:
May be we need to file some information in the changes for developers section, particularly about the new encoding property on OffsetsTextInfo