Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters #9545

Merged
merged 23 commits into from
Jul 2, 2019

Conversation

LeonarddeR
Copy link
Collaborator

@LeonarddeR LeonarddeR commented May 7, 2019

Link to issue number:

Closes #8981
Related to #8981
Follow up of #9044

Summary of the issue:

On Windows, wide characters are two bytes in size. This is also the case in python 2. This is best explained with an example:

>>> len(u"😉")
2

In Python 3 however, strings are saved using a variable byte size, based on the number of bytes that is needed to store the highest code point in the string. One index always corresponds with one code point.

A much more detailed description of the problem can be found in #8981.

Description of how this pull request fixes the issue:

This pr currently only introduces a new textUtils module that intends to mitigate issues introduced with the Python 3 transition. Most offset based TextInfos are based on a two bytes wide character string representation. For example, uniscribe uses 2 byte wide characters, and therefore 😉 is treated as two characters by uniscribe whereas Python 3 treats it as one.

This is where textUtils.WideStringOffsetConverter comes into view. This new class keeps the decoded and encoded form of a string in one object. This object can be used to convert string offsets between two implementations, namely the Python 3 one offset per code poitn implementation, and the Windows wide character implementation with surrogate offsets.

The initial version of this pr implemented a broader approach with a class that inherrited from str. After discussion with @feerrenrut and @michaelDCurran, it has been decided not to do this.
The initial pr also contained support for other encodings, particularly UTF-8. However, investigation revealed that this is far from ideal to store in one class. Furthermore, the only offsets implementation that uses UTF-8 for its internal encoding is Scintilla/Notepad++, and that implementation offers everything we need for text range ffetching.

Testing performed:

Tested using the python 3 interpreter, running this branch from source with the fixes from the try-py3_ignoreSomeIssues branch applied. Also, most importantly, we have working unitTests. note that both the test_textUtils and test_textInfos modules apply here.

Known issues with pull request:

None

Change log entry:

May be we need to file some information in the changes for developers section, particularly about the new encoding property on OffsetsTextInfo

@LeonarddeR LeonarddeR requested a review from feerrenrut May 7, 2019 17:06
@LeonarddeR
Copy link
Collaborator Author

@feerrenrut: I wrote this code around half a year ago to make it possible to use 32 bit wide characters with liblouis, see #9044. However as noted, it might also be of use when converting the textInfos module to Python 3, especially when it comes to offset based textInfos. I guess we can revisit this as soon as we are going to walk into this issue with textInfos, which is probably pretty soon.

@LeonarddeR
Copy link
Collaborator Author

@feerrenrut and I had a very quick discussion about this. As there is currently no python 3 code available to test this with, I'll try to give a quick example of where this code is supposed to assist in, using a list of steps. A good example is NVDAObjects.window.edit.EditTextInfo. Edit controls are using utf-16 for their internal text encoding.

  1. Open notepad
  2. Paste these two emoji: 😉😊
  3. Using the Python console, get the text, its length and several ranges in the text:
>>> info=nav.makeTextInfo("all")
>>> info.text
u'\U0001f609\U0001f60a'
>>> # SHouldn't the result of this be 4 instad of 5?
>>> info._getStoryLength()
5L
>>> info._startOffset
0
>>> info._endOffset
5L
>>> info._getTextRange(0,1)
u'\ud83d'
>>> info._getTextRange(0,2)
u'\U0001f609'
>>> info._getTextRange(2,3)
u'\ud83d'
>>> info._getTextRange(2,4)
u'\U0001f60a'

note that on EditTextInfo

  • _get_text is defined on its superclass and calls _getTextRange
  • _getTextRange uses string indexing on _getStoryText

From this follows that:

  • Getting the text for a particular range uses python string slicing
  • Python 3 only uses one index for every code point. The string containing the two emoji is therefore two characters long in Python 3
  • Selecting the first emoji (offset 0 and 1 in the text control) translates to getting self._getStoryText()[0:2] which in Python 3, results in the two emoji being returned
  • Selecting the second emoji (offset 2 and 3 in the text control) translates to getting self._getStoryText()[2:4] which in Python 3, results in an empty string being returned

@LeonarddeR LeonarddeR added the z Python 3 transition (archived) Python 3 transition label May 29, 2019
@michaelDCurran
Copy link
Member

This looks very thorough, but:
I'm rather worried about this class pretending that it is an str (I.e. it inherits from it) although the majority of its methods and indexing behave differently.
I am wondering if we can instead have a class that inherits from str, but includes just one extra member, which is this object with the magical indexing etc. Then our offset code should always check for this member on strs and use that for indexing instead of the main object.
Perhaps this would be way too much work and be more confusing than what you currently have. But I am very worried about these objects leaking out to other parts of NVDA or in deed other libries thinking that it is a normal str.
Perhaps I am too worried. Your thoughts?

@LeonarddeR
Copy link
Collaborator Author

@michaelDCurran commented on 31 May 2019, 09:01 CEST:

This looks very thorough, but:
I'm rather worried about this class pretending that it is an str (I.e. it inherits from it) although the majority of its methods and indexing behave differently.

I think the only difference with regular strings is related to indexing, slicing and length. Other methods are overriden to make sure that concatenation and multiplication use the same class.

I am wondering if we can instead have a class that inherits from str, but includes just one extra member, which is this object with the magical indexing etc. Then our offset code should always check for this member on strs and use that for indexing instead of the main object.

I agree that this makes the use of this class much more explicit, which is a good thing IMO.

Perhaps this would be way too much work and be more confusing than what you currently have.

I don't think it is that much work. Would be good to know what @feerrenrut thinks about this.

But I am very worried about these objects leaking out to other parts of NVDA or in deed other libries thinking that it is a normal str.

I think it is a valid concern.

Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I tend to agree that inheriting from str could cause some nasty problems. I think it would be better if our code explicitly opted to use this approach, and ensure that it can't leak elsewhere. I would also like to see unit tests for this code.

@LeonarddeR LeonarddeR changed the base branch from threshold to threshold_py3_staging June 11, 2019 07:16
@LeonarddeR
Copy link
Collaborator Author

I just pushed this to a prototype that actually works in notepad. I'd like to gather some feedback on the prototype code before continuing. The unit testing code is actually pretty outdate, so please ignore that for now.

@LeonarddeR LeonarddeR changed the title textUtils module to deal with different internal text encodings textUtils module to deal with offset differences between Python 3 strings and Windows wide character strings with surrogate characters Jun 15, 2019
Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm happy with this approach, I think it's worth putting some thought into how to clarify the intent of the usage of WideStringOffsetConvert. Perhaps write a few unit tests based on the current usage in textInfos/offsets.py and see if the interface can be clarified.

I'd like there to be unit tests for the additions to textUtils, and some more documentation.

source/NVDAObjects/window/edit.py Show resolved Hide resolved
source/NVDAObjects/window/edit.py Outdated Show resolved Hide resolved
else:
encoding=locale.getlocale()[1]
return buf.value.decode(encoding,errors="replace")
return buf.value.decode(self.encoding,errors="replace")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to be aware of and log errors in debug mode.

source/textUtils.py Outdated Show resolved Hide resolved
def strLength(self) -> int:
return len(self.decoded)

def strToWideOffsets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation for this? What does strict do? Perhaps rename it to raiseOnError

strEnd: int,
strict: bool =False
) -> Tuple[int, int]:
if strStart > self.strLength:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "greater or equal" (>=).

if strict:
raise IndexError("str start index out of range")
strStart = min(strStart, self.strLength)
if strEnd > self.strLength:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, perhaps should be "greater or equal". Although, if the end offset is excluded from the range this is ok... I'm not sure about it. Please add unit tests for these cases.

precedingStr: str = ""
strStart: int = 0
else:
precedingStr= self.encoded[:bytesStart].decode(self._encoding, errors="surrogatepass")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surrogatepass is handy, I also saw surrogateescape option which could be handy: https://docs.python.org/3/library/codecs.html#error-handlers

#Fall back to the older word offsets detection that only breaks on non alphanumeric
if self.encoding == self._UNISCRIBE_ENCODING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this, line 317 has if self.useUniscribe how does this block relate to that one?

if self.encoding == self._UNISCRIBE_ENCODING:
offsetConverter = textUtils.WideStringOffsetConverter(self._getStoryText())
strStart, strEnd = offsetConverter.wideToStrOffsets(offset, offset + 1)
return offsetConverter.strToWideOffsets(strStart, strEnd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this do wideToStr, then strToWide?

@LeonarddeR LeonarddeR marked this pull request as ready for review June 19, 2019 14:11
@LeonarddeR
Copy link
Collaborator Author

@feerrenrut: I think I've applied all your requests.

@LeonarddeR
Copy link
Collaborator Author

@michaelDCurran: As UIA consoles now also rely on uniscribe, I will need to revisit this after a merge of master into threshold and then in threshold_py3_staging.

Co-Authored-By: Reef Turner <feerrenrut@users.noreply.github.com>
# ANSI strings are terminated with one NULL terminated character.
# As pointed out above, numChars contains the number of characters without the null terminator.
# Therefore, if the string we got isn't null terminated at numCHars or numChars + 1,
# the buffer definitely contains multibyte characters.
Copy link
Contributor

@feerrenrut feerrenrut Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is something missing in the explanation, and I dont think this actually does what we expect.

If I am understanding right, it relies on the value of numChars being half of the number of bytes in a wide character string and it being impossible to have a null character (represented by two null bytes) at the mid-point of the wide character string. If it is an ANSI string, there MUST be at least one null character at numChars+1 to indicate the end of the string. However note that it does not handle an odd number of chars well. For wide character strings it will be considering one byte from one character, and one byte from the next.

Also worth point out this relies on the memory being zeroed by the VirtualAllocEx function, otherwise:

numChars = 2 #  copy logical "hi" into buffer
# imaging a string copied to a buffer with "garbage" in it. The string copied is "hi"
# wide string memory "0h0i" copied over garbage
bW = "0h0i00garbage" # ok, bW[numChars] == 0 but bW[numChars+1] == 'i' so it is interpreted as unicode
# ANSI string memory "hi" copied over garbage
bA = "hi0garbage" # not ok, bW[numChars] == 0 but bW[numChars+1] == 'g' so it is interpreted as unicode

I think there is a broader issue here that odd numbers of characters are not handled correctly.

Consider where a wide string with numChars = 3:

buf = b"\x00\x04\x41\x00\x00\x01\x00\x00" # = "ЀAĀ" followed by two null chars.
buf[numChars] != 0 # False, it is zero
buf[numChars+1 != 0 # False, it is zero: buf interpreted as ANSI

Would it be clearer to do something like:

if (
	# The window is unicode, the text range contains multi byte characters.
	self.obj.isWindowUnicode
	# VirtualAllocEx zeroes out memory, so for an ANSI string there should ONLY be null bytes 
	# after the first numchars . 
	# For wide character strings, the number of bytes will be twice the number of characters, meaning 
	# numchars points to mid-string, there will be non-zero bytes after it.
	or any(c != 0 for buf[numChars:bufLen])
):
	text=ctypes.cast(buf,ctypes.c_wchar_p).value
else:
	encoding=locale.getlocale()[1]

Also worth considering a single char.

buf = b"\x41\x00\x00\x00" # = "A" followed by two null chars. This is valid as a wide string or ANSI
buf = b"\x00\x45\x00\x00" # ??

There are enough cases it would be worth extracting this and creating a unit test.

Although for perf considerations perhaps we just consider adjusting the byte alignment and checking two bytes

wCharAlignedIndex = numChars+(numChars%2)
	or not (buf[wCharAlignedIndex ] == 0 and buf[wCharAlignedIndex + 1] == 0) 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this by voice, and I think I've been able to come up with working implementations that prove to be working in Wordpad and Notepad++. Note that in both situations, I also tried to fetch a range that was actually longer than the text in the control, so these cases are also covered.

While I really see your point about unit tests, note that we can't test this against neither Wordpad nor Notepad++. I think we should also split this logic out to a new function in textUtils that takes a buffer, a number of bytes and an encoding. The more I think about that, the more I consider it sensible, so that's what I'll do.

self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3))

def test_surrogatePairs(self):
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 🤦😊👍
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are several characters being combined in different ways, it might be clearer to specify them as constants and join them together in their various combinations for each test:

facePalm = u"\U0001f926"  # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍

...
converter = WideStringOffsetConverter(text=u"".join([facePalm, smile, thumbsUp]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would make things a lot more readable here. Rather than using join as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.

facePalm = u"\U0001f926"  # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍

...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)

@LeonarddeR
Copy link
Collaborator Author

I found some issues which I have fixed in my last few commits.
I also discovered an issue that emoji aren't read when moving by characters in virtual buffers. This is a bit strange, since they end up in the text range just fine. I rather think that there might be something wrong with expanding to character, but I will investigate that later today or on Tuesday.

@josephsl
Copy link
Collaborator

Hi,

By the way, unittests are failing because textUtils module has no _WCHAR_ENCODING.

Thanks.

@LeonarddeR
Copy link
Collaborator Author

@feerrenrut: I just looked into the change of textInfos.offsets.Offsets to a dataclass, and noticed that Offsets instances are now unhashable. Since Offsets is a mutable object, I tempted to leave this as is. #9757 defined a hash method for every object where we defined eq, but I think this might have been a bit too aggressive. It also defined hash on objects for which we do not rely on its hashability.

Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've made a lot of formatting changes in this PR. I've managed to review it, but these changes make it much harder to spot the changes, if these changes weren't intentional, please check your editor settings.

Generally, just a few small changes and I think this is good to go. Thanks for your work on this!

source/textInfos/offsets.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
source/textUtils.py Outdated Show resolved Hide resolved
self.assertEqual(converter.strToWideOffsets(3, 3), (3, 3))

def test_surrogatePairs(self):
converter = WideStringOffsetConverter(text=u"\U0001f926\U0001f60a\U0001f44d") # 🤦😊👍
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would make things a lot more readable here. Rather than using join as I suggested before, just add them together, the difference in performance will be negligible. Unless there is something you are trying to convey by having the unicode values there? I get that there is some benefit to seeing the number of bytes for each character, perhaps there is another way to achieve that? At the moment its very hard to tell that it is the same value being used throughout.

facePalm = u"\U0001f926"  # 🤦
smile = u"U0001f60a" # 😊
thumbsUp = u"\U0001f44d" # 👍

...
converter = WideStringOffsetConverter(text=facePalm + smile + thumbsUp)

@feerrenrut feerrenrut merged commit 15d8374 into nvaccess:threshold_py3_staging Jul 2, 2019
@nvaccessAuto nvaccessAuto added this to the 2019.3 milestone Jul 2, 2019
josephsl added a commit to josephsl/nvda that referenced this pull request Jul 18, 2019
What's new entry suggested by Leonard de Ruijter (Babbage).
josephsl added a commit to josephsl/nvda that referenced this pull request Jul 27, 2019
What's new entry suggested by Leonard de Ruijter (Babbage).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants