New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use uniscribe to calculate character offsets where allowed #10550
Conversation
… calculate the bounds for a character. This allows us to treat something like e-acute as one character.
…cterOffsets, allowing unit tests to pass again.
I recall trying this before for something, but then decided not to go this way because it was slower. It might help to do some performance tests,e.g. moving 4000 characters forward at once. I'm pretty sure the placemarkers add-on uses that logic.
|
Some quick benchmarking:
With the World War I wikipedia article loaded in Firefox,
And with the review cursor at the top of the document in browse mode:
```
import time
r=review.copy()
t=time.time()
r.move(textInfos.UNIT_CHARACTER,4000)
time.time()-t
```
Runs seem to be between 0.9 and 1.5 seconds both with this change and
without this change.
In other words, both the new and old code seem to be affected quite
significantly by other things in the environment (which is not
surprizing as it is a loop that runs 4000 times). And the added usage of
uniscribe does not seem to slow things down as far as I can tell.
It is also worth noting that _getCharacterOffsets always fetched the
text for the current line. The only difference is the actual uniscribe call.
I accept that there is usage in the wild such as the placeMarkers add-on
that calls move with a large number. However, with any other text api
(UIA, other object models etc) this call would probably be much much worse.
Still, if we do notice a performance decrease in real usage, we of
course should take this into consideration.
|
Thanks for these benchmarks, that proves that this is really a valuable
change after all. I will review it codewise later today.
|
I did a second test, once my machine had stopped installing a Windows
update in the background :p
This time comparing with and without the change, both doing a move of
20000 characters:
Without the change: 5.2 seconds
With the change: 5.7 seconds
It is about an increase of 1.09 times.
So yes, if the move is very large (like 20000) then the difference is
noticeable.
|
… code in both calculateWordOffsets and calculateCharacterOffsets.
@leonardder I have addressed all your review actions I believe. When abstracting _calculateUniscribeOffsets in textUtils.cpp, I still copied the two basic for loops that walk the offsets, otherwise it would have become very complex to read with fWordStop changed to fCharStop dynamically changed based on the unit somehow. |
I'm afraid that this pull request introduces off by n errors in braille.TextInfoRegion.getTextInfoForBraillePos. However, it is pretty difficult to avoid this. We should somehow be able to decouple braille positions from characters. |
This is of course not the only place where characterOffsets is
overridden by an API that can return offset bounds wider than 1.
Should I however revert this for now as we don't want to make this worse?
|
No, no need to revert this. At least for uniscribe, we can disable it at the TextInfo object level before moving by character. |
Link to issue number:
None.
Summary of the issue:
When moving by character in an NVDA virtualBuffer, each and every unicode code point is treated as its own character, even if is is visually combined with another code point to create one composit character. Examples are:
Similarly, when moving by character in Notepad with the arrow keys, NVDA only reads the first code point in composit characters.
Description of how this pull request fixes the issue:
Just like how we use the Windows uniscribe library to calculate word offsets in some places, use it to also calculate character offsets.
This involved:
Testing performed:
Known issues with pull request:
Although NVDA now matches the behaviour of notepad and other standard edit controls, which includes treating acutes, variation selectors and some other modifiers as being a part of the previous symbol, complex tied emojies that use multiple unicode characters connected with a tie u+200d symbol, are still not treated as one single composit character. But if we did, this would differ from Windows' own standard edit control behaviour.
Change log entry:
Bug fixes: