Camel case word splitting may fail for words containing non-ASCII characters #13174

CyrilleB79 · 2021-12-17T14:56:10Z

Issue: the rules in builtin.dic files to split CamelCaseText considers text written only with ASCII characters.

Steps to reproduce:

Set OneCore Microsoft Hortense French voice (or IBMTTS French voice)
Read the following lines:
JEANÉdouard
JEAN Édouard

Actual behavior:

The two lines are not pronounced the same way. More specifically, "JEAN" is spelt, what is the normal behaviour of OneCore French voices when a mix of upper and lower case is encountered.

Expected behavior:

NVDA has rules in the builtin.dic file that should split each part of a camelCaseText. Thus the text in CamelCase should be split by this rules and the two lines should be pronounced the same way.

Additional examples

For English examples, you could use OneCore Zira voice and compare how the following lines are read:
DJÖtzi
DJ Ötzi

Or still more obvious, always with Zira:
StÉtienne
St Étienne

Notes

Other examples can be reproduced with IBMTTS.
I did not succeed in producing examples with eSpeak. Maybe it has an internal CamelCase processing?
I have opened this issue after having had a look at the builtin.dic file. However, I am not impacted by it in may daily work and the produced examples are not real-life example but examples builton purpose to demonstrate the issue.
Maybe there are languages where this issue is more significant: Greek, Russian? If yes, feel free to comment here with examples to illustrate the issue.

System configuration

NVDA installed/portable/running from source:

installed

NVDA version:

2021.3.1rc1

Windows version:

Windows 10 20H2 (64-bit) build 19042.1348

Name and version of other software in use when reproducing the issue:

N/A

Other information about your system:

Technical

To solve this issue, the rules should be modified to include non-ASCII characters in the uppercase/lowercase character classes.
Here are informations for a starting point: https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camel case word splitting may fail for words containing non-ASCII characters #13174

Camel case word splitting may fail for words containing non-ASCII characters #13174

CyrilleB79 commented Dec 17, 2021

Camel case word splitting may fail for words containing non-ASCII characters #13174

Camel case word splitting may fail for words containing non-ASCII characters #13174

Comments

CyrilleB79 commented Dec 17, 2021

Steps to reproduce:

Actual behavior:

Expected behavior:

Additional examples

Notes

System configuration

NVDA installed/portable/running from source:

NVDA version:

Windows version:

Name and version of other software in use when reproducing the issue:

Other information about your system:

Other questions

Does the issue still occur after restarting your computer?

Have you tried any other versions of NVDA? If so, please report their behaviors.

If NVDA add-ons are disabled, is your problem still occurring?

Does the issue still occur after you run the COM Registration Fixing Tool in NVDA's tools menu?

Technical