Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Camel case word splitting may fail for words containing non-ASCII characters #13174

Open
CyrilleB79 opened this issue Dec 17, 2021 · 0 comments

Comments

@CyrilleB79
Copy link
Collaborator

Issue: the rules in builtin.dic files to split CamelCaseText considers text written only with ASCII characters.

Steps to reproduce:

  • Set OneCore Microsoft Hortense French voice (or IBMTTS French voice)
  • Read the following lines:
    JEANÉdouard
    JEAN Édouard

Actual behavior:

The two lines are not pronounced the same way. More specifically, "JEAN" is spelt, what is the normal behaviour of OneCore French voices when a mix of upper and lower case is encountered.

Expected behavior:

NVDA has rules in the builtin.dic file that should split each part of a camelCaseText. Thus the text in CamelCase should be split by this rules and the two lines should be pronounced the same way.

Additional examples

For English examples, you could use OneCore Zira voice and compare how the following lines are read:
DJÖtzi
DJ Ötzi

Or still more obvious, always with Zira:
StÉtienne
St Étienne

Notes

  • Other examples can be reproduced with IBMTTS.
  • I did not succeed in producing examples with eSpeak. Maybe it has an internal CamelCase processing?
  • I have opened this issue after having had a look at the builtin.dic file. However, I am not impacted by it in may daily work and the produced examples are not real-life example but examples builton purpose to demonstrate the issue.
  • Maybe there are languages where this issue is more significant: Greek, Russian? If yes, feel free to comment here with examples to illustrate the issue.

System configuration

NVDA installed/portable/running from source:

installed

NVDA version:

2021.3.1rc1

Windows version:

Windows 10 20H2 (64-bit) build 19042.1348

Name and version of other software in use when reproducing the issue:

N/A

Other information about your system:

Other questions

Does the issue still occur after restarting your computer?

Yes

Have you tried any other versions of NVDA? If so, please report their behaviors.

No but it should be the same.

If NVDA add-ons are disabled, is your problem still occurring?

Yes

Does the issue still occur after you run the COM Registration Fixing Tool in NVDA's tools menu?

Did not test. But should not have an impact.
In any case the rc1 release has been installed recently, so the tool has been run during the installation recently.

Technical

To solve this issue, the rules should be modified to include non-ASCII characters in the uppercase/lowercase character classes.
Here are informations for a starting point: https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant