Skip to content

Commit 92f345e

Browse files
authored
Fix issues in uniscribe code (#18744)
Closes #18722 Reverts 021bcd8 Summary of the issue: The code that split a string at character boundaries discarded characters like \u0301 when such a character was the only character in a string. Description of user facing changes: When unicode normalization is on, navigating by character will again correctly announce characters like acute (\u0301) Description of developer facing changes: NVDAHelper local calculateCharacterBoundaries now needs an offsets array that is textLength + 1 in size, rather than only the text length. This is to store the end offset of the last character. This reverts 021bcd8, therefore we don't pass additional extraneous alphanumeric characters to uniscribe to make it happy. Description of development approach: While debugging this issue, I found several issues in NVDAHelper local textUtils _getLogAttrArray function that is the base of the several calculation functions: The buffer passed to ScriptItemize was too small when a string only contained one character. The buffer should have space for at least two SCRIPT_ITEM structures in size. This is my hypothesis about the necessity of 021bcd8, namely the addition of two alphanumeric characters passed to the uniscribe methods. While there are no exactly known str to reproduce the issue behind 021bcd8, I tried several combinations of shorter and longer strings without alpha numeric characters, and I couldn't reproduce any of the behavior described in it after my changes to the c++ code. Therefore I reverted 021bcd8, as there is enough time to test this in Alpha thoroughly. Calculation of character offsets would never set the end offset when requesting the offsets of the last character in the string, since the end offset calculation starts at offset + 1 and offset + 1 is equal to textLength. Failed calls of ScriptItemize and ScriptBreak were never logged. ScriptItemize definitely failed when passing a string with only one character in it because the expected buffer was to small.
1 parent 9f23ecf commit 92f345e

File tree

4 files changed

+28
-20
lines changed

4 files changed

+28
-20
lines changed

nvdaHelper/local/textUtils.cpp

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,20 +28,34 @@ vector<SCRIPT_LOGATTR> _getLogAttrArray(const wchar_t* text, int textLength) {
2828
if (textLength <= 0 || !text) {
2929
return {};
3030
}
31-
vector<SCRIPT_ITEM> items(textLength + 1);
31+
// It is invalid to call ScriptItemize with a buffer to hold less than two SCRIPT_ITEM structures.
32+
auto cMaxItems = textLength + 1;
33+
// The buffer should be (cMaxItems + 1)
34+
vector<SCRIPT_ITEM> items(cMaxItems + 1);
3235
int numItems = 0;
33-
if (ScriptItemize(text, textLength, textLength, nullptr, nullptr, items.data(), &numItems) != S_OK || numItems == 0) {
36+
HRESULT hr;
37+
if ((hr = ScriptItemize(text, textLength, cMaxItems, nullptr, nullptr, items.data(), &numItems)) != S_OK || numItems == 0) {
38+
LOG_ERROR(L"ScriptItemize failed for text '" << text << L"'; hr=" << hr);
3439
return {};
3540
}
3641

3742
vector<SCRIPT_LOGATTR> logAttrArray(textLength);
38-
int nextICharPos = textLength;
43+
// The function always adds a terminal item to the item analysis array.
44+
// numItems contains the number of actually processed items excluding the terminating item.
45+
int nextICharPos = textLength; // should be equal to items[numItems].iCharPos
3946
for (int itemIndex = numItems - 1; itemIndex >= 0; --itemIndex) {
4047
int iCharPos = items[itemIndex].iCharPos;
4148
int iCharLength = nextICharPos - iCharPos;
42-
if (ScriptBreak(text + iCharPos, iCharLength, &(items[itemIndex].a), logAttrArray.data() + iCharPos) != S_OK) {
49+
if ((hr = ScriptBreak(text + iCharPos, iCharLength, &(items[itemIndex].a), logAttrArray.data() + iCharPos)) != S_OK) {
50+
LOG_ERROR(L"ScriptBreak failed for text '" << text << L"' at run " << itemIndex << L"; hr=" << hr);
4351
return {};
4452
}
53+
// Note, ideally we'd set nextICharPos to iCharPos, so that the
54+
// next iteration of the loop will only call ScriptBreak for the text that belongs to the current item.
55+
// Now that we don't do this, every call of ScriptBreak refills logAttrArray
56+
// for the characters after this item based on the SCRIPT_ANALYSIS for the current item,
57+
// effectively treating all the characters as belonging to the script at itemIndex = 0.
58+
// However, resetting nextICharPos causes word segmentation to differ from the one used in notepad.
4559
}
4660
return logAttrArray;
4761
}
@@ -55,8 +69,8 @@ bool calculateCharacterBoundaries(const wchar_t* text, int textLength, int* offs
5569
return false;
5670
}
5771
int count = 0;
58-
for (int i = 0; i < textLength; ++i) {
59-
if (logAttrArray[i].fCharStop) {
72+
for (int i = 0; i <= textLength; ++i) {
73+
if (i == textLength || logAttrArray[i].fCharStop) {
6074
offsets[count++] = i;
6175
}
6276
}
@@ -88,6 +102,7 @@ bool _calculateUniscribeOffsets(enum UNIT unit, wchar_t* text, int textLength, i
88102
break;
89103
}
90104
}
105+
*endOffset = textLength;
91106
for(int i=offset+1;i<textLength;++i) {
92107
if (logAttrArray[i].fCharStop) {
93108
*endOffset=i;

source/textInfos/offsets.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -344,20 +344,15 @@ def _calculateUniscribeOffsets(
344344
raise NotImplementedError(f"Unit: {unit}")
345345
relStart = ctypes.c_int()
346346
relEnd = ctypes.c_int()
347-
# uniscribe does some strange things
348-
# when you give it a string with not more than two alphanumeric chars in a row.
349-
# Inject two alphanumeric characters at the end to fix this
350-
uniscribeLineText = lineText + "xx"
351347
# We can't rely on len(lineText) to calculate the length of the line.
352348
offsetConverter = textUtils.WideStringOffsetConverter(lineText)
353349
lineLength = offsetConverter.encodedStringLength
354350
if self.encoding != textUtils.WCHAR_ENCODING:
355351
# We need to convert the str based line offsets to wide string offsets.
356352
relOffset = offsetConverter.strToEncodedOffsets(relOffset, relOffset)[0]
357-
uniscribeLineLength = lineLength + 2
358353
if helperFunc(
359-
uniscribeLineText,
360-
uniscribeLineLength,
354+
lineText,
355+
lineLength,
361356
relOffset,
362357
ctypes.byref(relStart),
363358
ctypes.byref(relEnd),

source/textUtils/uniscribe.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,10 @@ def splitAtCharacterBoundaries(text: str) -> Generator[str, None, None]:
2020
raise RuntimeError("NVDAHelper not initialized")
2121
if not text:
2222
return
23-
# uniscribe does some strange things
24-
# when you give it a string with not more than two alphanumeric chars in a row.
25-
# Inject two alphanumeric characters at the end to fix this
26-
uniscribeText = text + "xx"
27-
buffer = ctypes.create_unicode_buffer(uniscribeText)
23+
buffer = ctypes.create_unicode_buffer(text)
2824
textLength = len(buffer) - 1 # Length without terminating NULL character
2925
offsetsCount = ctypes.c_int()
30-
offsets = (ctypes.c_int * textLength)()
26+
offsets = (ctypes.c_int * (textLength + 1))()
3127
if not NVDAHelper.localLib.calculateCharacterBoundaries(
3228
buffer,
3329
textLength,
@@ -36,7 +32,7 @@ def splitAtCharacterBoundaries(text: str) -> Generator[str, None, None]:
3632
):
3733
raise RuntimeError("NVDAHelper calculateCharacterBoundaries failed")
3834
# Get the end offsets of the characters we need.
39-
calculatedOffsets = offsets[1 : (offsetsCount.value - 1)]
35+
calculatedOffsets = offsets[1 : (offsetsCount.value + 1)]
4036
start = 0
4137
for end in calculatedOffsets:
4238
yield buffer[start:end]

user_docs/en/changes.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ This can be enabled using the "Report when lists support multiple selection" set
2020

2121
### Bug Fixes
2222

23+
* When unicode normalization is enabled for speech, navigating by character will again correctly announce combining diacritic characters like acute ( &#x0301; ). (#18722, @LeonarddeR)
24+
2325
### Changes for Developers
2426

2527
Please refer to [the developer guide](https://download.nvaccess.org/documentation/developerGuide.html#API) for information on NVDA's API deprecation and removal process.

0 commit comments

Comments
 (0)