Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters #9545

Merged
merged 23 commits into from Jul 2, 2019

Conversation

@leonardder
Copy link
Collaborator

leonardder commented May 7, 2019

Link to issue number:

Closes #8981
Related to #8981
Follow up of #9044

Summary of the issue:

On Windows, wide characters are two bytes in size. This is also the case in python 2. This is best explained with an example:

>>> len(u"馃槈")
2

In Python 3 however, strings are saved using a variable byte size, based on the number of bytes that is needed to store the highest code point in the string. One index always corresponds with one code point.

A much more detailed description of the problem can be found in #8981.

Description of how this pull request fixes the issue:

This pr currently only introduces a new textUtils module that intends to mitigate issues introduced with the Python 3 transition. Most offset based TextInfos are based on a two bytes wide character string representation. For example, uniscribe uses 2 byte wide characters, and therefore 馃槈 is treated as two characters by uniscribe whereas Python 3 treats it as one.

This is where textUtils.WideStringOffsetConverter comes into view. This new class keeps the decoded and encoded form of a string in one object. This object can be used to convert string offsets between two implementations, namely the Python 3 one offset per code poitn implementation, and the Windows wide character implementation with surrogate offsets.

The initial version of this pr implemented a broader approach with a class that inherrited from str. After discussion with @feerrenrut and @michaelDCurran, it has been decided not to do this.
The initial pr also contained support for other encodings, particularly UTF-8. However, investigation revealed that this is far from ideal to store in one class. Furthermore, the only offsets implementation that uses UTF-8 for its internal encoding is Scintilla/Notepad++, and that implementation offers everything we need for text range ffetching.

Testing performed:

Tested using the python 3 interpreter, running this branch from source with the fixes from the try-py3_ignoreSomeIssues branch applied. Also, most importantly, we have working unitTests. note that both the test_textUtils and test_textInfos modules apply here.

Known issues with pull request:

None

Change log entry:

May be we need to file some information in the changes for developers section, particularly about the new encoding property on OffsetsTextInfo

@leonardder leonardder requested a review from feerrenrut May 7, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented May 7, 2019

@feerrenrut: I wrote this code around half a year ago to make it possible to use 32 bit wide characters with liblouis, see #9044. However as noted, it might also be of use when converting the textInfos module to Python 3, especially when it comes to offset based textInfos. I guess we can revisit this as soon as we are going to walk into this issue with textInfos, which is probably pretty soon.

@feerrenrut feerrenrut added this to In progress in Update NVDA to Python 3 via automation May 9, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented May 21, 2019

@feerrenrut and I had a very quick discussion about this. As there is currently no python 3 code available to test this with, I'll try to give a quick example of where this code is supposed to assist in, using a list of steps. A good example is NVDAObjects.window.edit.EditTextInfo. Edit controls are using utf-16 for their internal text encoding.

  1. Open notepad
  2. Paste these two emoji: 馃槈馃槉
  3. Using the Python console, get the text, its length and several ranges in the text:
>>> info=nav.makeTextInfo("all")
>>> info.text
u'\U0001f609\U0001f60a'
>>> # SHouldn't the result of this be 4 instad of 5?
>>> info._getStoryLength()
5L
>>> info._startOffset
0
>>> info._endOffset
5L
>>> info._getTextRange(0,1)
u'\ud83d'
>>> info._getTextRange(0,2)
u'\U0001f609'
>>> info._getTextRange(2,3)
u'\ud83d'
>>> info._getTextRange(2,4)
u'\U0001f60a'

note that on EditTextInfo

  • _get_text is defined on its superclass and calls _getTextRange
  • _getTextRange uses string indexing on _getStoryText

From this follows that:

  • Getting the text for a particular range uses python string slicing
  • Python 3 only uses one index for every code point. The string containing the two emoji is therefore two characters long in Python 3
  • Selecting the first emoji (offset 0 and 1 in the text control) translates to getting self._getStoryText()[0:2] which in Python 3, results in the two emoji being returned
  • Selecting the second emoji (offset 2 and 3 in the text control) translates to getting self._getStoryText()[2:4] which in Python 3, results in an empty string being returned
@leonardder leonardder added the Python 3 label May 29, 2019
@michaelDCurran

This comment has been minimized.

Copy link
Contributor

michaelDCurran commented May 31, 2019

This looks very thorough, but:
I'm rather worried about this class pretending that it is an str (I.e. it inherits from it) although the majority of its methods and indexing behave differently.
I am wondering if we can instead have a class that inherits from str, but includes just one extra member, which is this object with the magical indexing etc. Then our offset code should always check for this member on strs and use that for indexing instead of the main object.
Perhaps this would be way too much work and be more confusing than what you currently have. But I am very worried about these objects leaking out to other parts of NVDA or in deed other libries thinking that it is a normal str.
Perhaps I am too worried. Your thoughts?

@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented May 31, 2019

@michaelDCurran commented on 31 May 2019, 09:01 CEST:

This looks very thorough, but:
I'm rather worried about this class pretending that it is an str (I.e. it inherits from it) although the majority of its methods and indexing behave differently.

I think the only difference with regular strings is related to indexing, slicing and length. Other methods are overriden to make sure that concatenation and multiplication use the same class.

I am wondering if we can instead have a class that inherits from str, but includes just one extra member, which is this object with the magical indexing etc. Then our offset code should always check for this member on strs and use that for indexing instead of the main object.

I agree that this makes the use of this class much more explicit, which is a good thing IMO.

Perhaps this would be way too much work and be more confusing than what you currently have.

I don't think it is that much work. Would be good to know what @feerrenrut thinks about this.

But I am very worried about these objects leaking out to other parts of NVDA or in deed other libries thinking that it is a normal str.

I think it is a valid concern.

Update NVDA to Python 3 automation moved this from In progress to Needs review Jun 5, 2019
Copy link
Contributor

feerrenrut left a comment

Yes. I tend to agree that inheriting from str could cause some nasty problems. I think it would be better if our code explicitly opted to use this approach, and ensure that it can't leak elsewhere. I would also like to see unit tests for this code.

@leonardder leonardder changed the base branch from threshold to threshold_py3_staging Jun 11, 2019
@leonardder leonardder force-pushed the BabbageCom:textUtils-py3 branch from b71d30c to f603404 Jun 11, 2019
@leonardder leonardder force-pushed the BabbageCom:textUtils-py3 branch from 223990f to aa2cbac Jun 15, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Jun 15, 2019

I just pushed this to a prototype that actually works in notepad. I'd like to gather some feedback on the prototype code before continuing. The unit testing code is actually pretty outdate, so please ignore that for now.

@leonardder leonardder changed the title textUtils module to deal with different internal text encodings textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters Jun 15, 2019
@leonardder leonardder requested a review from feerrenrut Jun 17, 2019
Copy link
Contributor

feerrenrut left a comment

I think I'm happy with this approach, I think it's worth putting some thought into how to clarify the intent of the usage of WideStringOffsetConvert. Perhaps write a few unit tests based on the current usage in textInfos/offsets.py and see if the interface can be clarified.

I'd like there to be unit tests for the additions to textUtils, and some more documentation.

source/NVDAObjects/window/edit.py Show resolved Hide resolved
source/NVDAObjects/window/edit.py Outdated Show resolved Hide resolved
else:
encoding=locale.getlocale()[1]
return buf.value.decode(encoding,errors="replace")
return buf.value.decode(self.encoding,errors="replace")

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

I think it would be good to be aware of and log errors in debug mode.

source/textUtils.py Outdated Show resolved Hide resolved
def strLength(self) -> int:
return len(self.decoded)

def strToWideOffsets(

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

Can you add some documentation for this? What does strict do? Perhaps rename it to raiseOnError

strEnd: int,
strict: bool =False
) -> Tuple[int, int]:
if strStart > self.strLength:

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

Should this be "greater or equal" (>=).

if strict:
raise IndexError("str start index out of range")
strStart = min(strStart, self.strLength)
if strEnd > self.strLength:

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

Also, perhaps should be "greater or equal". Although, if the end offset is excluded from the range this is ok... I'm not sure about it. Please add unit tests for these cases.

precedingStr: str = ""
strStart: int = 0
else:
precedingStr= self.encoded[:bytesStart].decode(self._encoding, errors="surrogatepass")

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

surrogatepass is handy, I also saw surrogateescape option which could be handy: https://docs.python.org/3/library/codecs.html#error-handlers

#Fall back to the older word offsets detection that only breaks on non alphanumeric
if self.encoding == self._UNISCRIBE_ENCODING:

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

I'm a bit confused about this, line 317 has if self.useUniscribe how does this block relate to that one?

if self.encoding == self._UNISCRIBE_ENCODING:
offsetConverter = textUtils.WideStringOffsetConverter(self._getStoryText())
strStart, strEnd = offsetConverter.wideToStrOffsets(offset, offset + 1)
return offsetConverter.strToWideOffsets(strStart, strEnd)

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 17, 2019

Contributor

Why does this do wideToStr, then strToWide?

Leonard de Ruijter added 2 commits Jun 19, 2019
@leonardder leonardder force-pushed the BabbageCom:textUtils-py3 branch from 9d4fe9b to 2a554f0 Jun 19, 2019
@leonardder leonardder marked this pull request as ready for review Jun 19, 2019
@leonardder leonardder requested a review from feerrenrut Jun 19, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Jun 19, 2019

@feerrenrut: I think I've applied all your requests.

# ANSI strings are terminated with one NULL terminated character.
# As pointed out above, numChars contains the number of characters without the null terminator.
# Therefore, if the string we got isn't null terminated at numCHars or numChars + 1,
# the buffer definitely contains multibyte characters.

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 21, 2019

Contributor

I think there is something missing in the explanation, and I dont think this actually does what we expect.

If I am understanding right, it relies on the value of numChars being half of the number of bytes in a wide character string and it being impossible to have a null character (represented by two null bytes) at the mid-point of the wide character string. If it is an ANSI string, there MUST be at least one null character at numChars+1 to indicate the end of the string. However note that it does not handle an odd number of chars well. For wide character strings it will be considering one byte from one character, and one byte from the next.

Also worth point out this relies on the memory being zeroed by the VirtualAllocEx function, otherwise:

numChars = 2 #  copy logical "hi" into buffer
# imaging a string copied to a buffer with "garbage" in it. The string copied is "hi"
# wide string memory "0h0i" copied over garbage
bW = "0h0i00garbage" # ok, bW[numChars] == 0 but bW[numChars+1] == 'i' so it is interpreted as unicode
# ANSI string memory "hi" copied over garbage
bA = "hi0garbage" # not ok, bW[numChars] == 0 but bW[numChars+1] == 'g' so it is interpreted as unicode

I think there is a broader issue here that odd numbers of characters are not handled correctly.

Consider where a wide string with numChars = 3:

buf = b"\x00\x04\x41\x00\x00\x01\x00\x00" # = "衻A膧" followed by two null chars.
buf[numChars] != 0 # False, it is zero
buf[numChars+1 != 0 # False, it is zero: buf interpreted as ANSI

Would it be clearer to do something like:

if (
	# The window is unicode, the text range contains multi byte characters.
	self.obj.isWindowUnicode
	# VirtualAllocEx zeroes out memory, so for an ANSI string there should ONLY be null bytes 
	# after the first numchars . 
	# For wide character strings, the number of bytes will be twice the number of characters, meaning 
	# numchars points to mid-string, there will be non-zero bytes after it.
	or any(c != 0 for buf[numChars:bufLen])
):
	text=ctypes.cast(buf,ctypes.c_wchar_p).value
else:
	encoding=locale.getlocale()[1]

Also worth considering a single char.

buf = b"\x41\x00\x00\x00" # = "A" followed by two null chars. This is valid as a wide string or ANSI
buf = b"\x00\x45\x00\x00" # ??

There are enough cases it would be worth extracting this and creating a unit test.

Although for perf considerations perhaps we just consider adjusting the byte alignment and checking two bytes

wCharAlignedIndex = numChars+(numChars%2)
	or not (buf[wCharAlignedIndex ] == 0 and buf[wCharAlignedIndex + 1] == 0) 

This comment has been minimized.

Copy link
@leonardder

leonardder Jun 21, 2019

Author Collaborator

We discussed this by voice, and I think I've been able to come up with working implementations that prove to be working in Wordpad and Notepad++. Note that in both situations, I also tried to fetch a range that was actually longer than the text in the control, so these cases are also covered.

While I really see your point about unit tests, note that we can't test this against neither Wordpad nor Notepad++. I think we should also split this logic out to a new function in textUtils that takes a buffer, a number of bytes and an encoding. The more I think about that, the more I consider it sensible, so that's what I'll do.

self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3))

def test_surrogatePairs(self):
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 馃う馃槉馃憤

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 21, 2019

Contributor

Since there are several characters being combined in different ways, it might be clearer to specify them as constants and join them together in their various combinations for each test:

facePalm = u"\U0001f926"  # 馃う
smile = u"U0001f60a" # 馃槉
thumbsUp = u"\U0001f44d" # 馃憤

...
converter = WideStringOffsetConverter(text=u"".join([facePalm, smile, thumbsUp]))

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 28, 2019

Contributor

I think this would make things a lot more readable here. Rather than using join as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.

facePalm = u"\U0001f926"  # 馃う
smile = u"U0001f60a" # 馃槉
thumbsUp = u"\U0001f44d" # 馃憤

...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)
@leonardder leonardder force-pushed the BabbageCom:textUtils-py3 branch from 900e24f to 1653e1f Jun 21, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Jun 22, 2019

I found some issues which I have fixed in my last few commits.
I also discovered an issue that emoji aren't read when moving by characters in virtual buffers. This is a bit strange, since they end up in the text range just fine. I rather think that there might be something wrong with expanding to character, but I will investigate that later today or on Tuesday.

@josephsl

This comment has been minimized.

Copy link
Collaborator

josephsl commented Jun 23, 2019

Hi,

By the way, unittests are failing because textUtils module has no _WCHAR_ENCODING.

Thanks.

Leonard de Ruijter added 3 commits Jun 26, 2019
@leonardder

This comment has been minimized.

Copy link
Collaborator Author

leonardder commented Jun 28, 2019

@feerrenrut: I just looked into the change of textInfos.offsets.Offsets to a dataclass, and noticed that Offsets instances are now unhashable. Since Offsets is a mutable object, I tempted to leave this as is. #9757 defined a hash method for every object where we defined eq, but I think this might have been a bit too aggressive. It also defined hash on objects for which we do not rely on its hashability.

Copy link
Contributor

feerrenrut left a comment

You've made a lot of formatting changes in this PR. I've managed to review it, but these changes make it much harder to spot the changes, if these changes weren't intentional, please check your editor settings.

Generally, just a few small changes and I think this is good to go. Thanks for your work on this!

source/textInfos/offsets.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3))

def test_surrogatePairs(self):
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 馃う馃槉馃憤

This comment has been minimized.

Copy link
@feerrenrut

feerrenrut Jun 28, 2019

Contributor

I think this would make things a lot more readable here. Rather than using join as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.

facePalm = u"\U0001f926"  # 馃う
smile = u"U0001f60a" # 馃槉
thumbsUp = u"\U0001f44d" # 馃憤

...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)
Leonard de Ruijter
Update NVDA to Python 3 automation moved this from Needs review to Reviewer approved Jul 2, 2019
@feerrenrut feerrenrut merged commit 15d8374 into nvaccess:threshold_py3_staging Jul 2, 2019
1 check passed
1 check passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
Update NVDA to Python 3 automation moved this from Reviewer approved to Done Jul 2, 2019
@nvaccessAuto nvaccessAuto added this to the 2019.3 milestone Jul 2, 2019
@leonardder leonardder referenced this pull request Jul 18, 2019
107 of 107 tasks complete
josephsl added a commit to josephsl/nvda that referenced this pull request Jul 18, 2019
What's new entry suggested by Leonard de Ruijter (Babbage).
josephsl added a commit to josephsl/nvda that referenced this pull request Jul 27, 2019
What's new entry suggested by Leonard de Ruijter (Babbage).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
5 participants
You can鈥檛 perform that action at this time.