Use Unicode CLDR to create speech symbol dictionaries with emojis #8758

LeonarddeR · 2018-09-18T12:48:21Z

Link to issue number:

Summary of the issue:

NVDA has no built-in mechanism to read emoji descriptions. Currently, it relies on the dictionaries that are available for speech synthesizers, such as Windows OneCore and ESpeak. However, synthesizers like Vocalizer do not have this data and therefore can't speak emojis.

Description of how this pull request fixes the issue:

This pr includes the emoji descriptions from the Unicode Common Locale Data Repository. The emoji descriptions are build with NVDA and added to locale specific speech symbol dictionaries using scons, making it very easy to update the emoji sources whenever the CLDR is updated. For this, we use a nice github repository hosted and maintained by @fujiwarat.

To NVDA's speech settings, I added the option "Include Unicode Consortium data when processing characters and symbols" which makes it easy to disable the inclusion of these databases.

I also had to add some functionality to the config manager in order to allow making a dump of the configuration in Python dictionary format (i.e. a deep copy like how configobj does this). This changes include a new prevConf argument that is passed to config.post_configProfileSwitch, allowing handlers to compare the current configuration against the previous and decide on what to do. This is used to clear the caches of the character processing framework.

Testing performed

Made emojis read within the Windows 10 emoji panel. When switching from Dutch to English, the emoji descriptions were accurately switched from Dutch to English.
Switched CLDR data on and off in NVDA's speech settings, the data was accurately included and excluded when processing symbols.
Switched CLDR data on and off by means of a configuration profile.

Known issues

I agree that the wording of the new GUI option is somewhat vague, but this is because the annotations database doesn't only contain emoji, but it also contains descriptions for characters like ® (trademark). Therefore, talking about emoji doesn't cover the whole spectrum of what this database adds to the list of speech symbols.
Other languages than Dutch and English have to be tested for accuracy. I'm quite confident that the Unicode data has decent quality, but you'll never know.
The user guide has yet to be updated as soon as we agree about the wording of the GUI option.
There is currently no braille support, as basically, there is no braille symbol processing mechanism yet. It felt a bit too far-fetched to implement that as part of this pr.

Change log entries

New features
- NVDA is now able to read descriptions for emoji as well as other characters that are part of the Unicode Common Locale Data Repository. (nvda can read emojies #6523)
Changes vor developers
- The config.post_configProfileSwitch action now takes the optional prevConf keyword argument, allowing handlers to take action based on differences between configuration before and after the profile switch.

LeonarddeR · 2018-09-18T12:48:47Z

@derekriemer: Also requested review from you because I recall you're pretty good with scons.

LeonarddeR · 2018-09-18T12:57:16Z

@jcsteh: I did not ask review from you specifically, however since the implementation is roughly based on your research and proposal, you might be interested to have a look.

jcsteh · 2018-09-18T18:14:16Z

I took a quick look at the code. It looks good to me! 👍 That GUI option name really worries me, though. I understand the reasoning behind it, but almost no user is going to understand what it means. I'd almost rather call it "Speak Emoji" and just document that it also includes some other symbols, even despite the fact that this is a bit misleading. Do we know what the criteria is for inclusion in this annotations database? For example, why trademark and not mathematical symbols? That might be worth looking into. Alternatively, we could filter out anything but Emoji when generating the dictionary in scons (by testing for specific Unicode ranges). It does seem sad to miss out on translated names for other symbols, though. Finally, we could consider naming the option something like "Include Unicode Consortium data (including Emoji) when processing characters and symbols". That's ugly, but at least it gives the user some idea what it's talking about.

michaelDCurran

Wonderful. And even working much more accurately than eSpeak's own support at the moment.

michaelDCurran · 2018-09-18T20:29:50Z

sconstruct

@@ -174,7 +174,7 @@ env64['projectResFile'] = resFile
 #Fill sourceDir with anything provided for it by miscDeps
 env.recursiveCopy(sourceDir,Dir('miscdeps/source'))

-env.SConscript('source/comInterfaces_sconscript',exports=['env'])
+env.SConscript('source/comInterfaces_sconscript',exports=['env', 'sourceDir'])


Was this change meant to be here?

josephsl · 2018-09-18T22:42:22Z

Hi, one ting to keep in mind: if we move to Python 3, unicodedata may provide an interesting solution, as Python 3.7 uses Unicode 11.0 which does include emoji chars. Thanks. From: Michael Curran <notifications@github.com> Sent: Tuesday, September 18, 2018 3:40 PM To: nvaccess/nvda <nvda@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [nvaccess/nvda] Use Unicode CLDR to create speech symbol dictionaries with emojis (#8758) @michaelDCurran approved this pull request. Wonderful. And even working much more accurately than eSpeak's own support at the moment.

_____ In sconstruct <#8758 (comment)> :

@@ -174,7 +174,7 @@ env64['projectResFile'] = resFile

#Fill sourceDir with anything provided for it by miscDeps env.recursiveCopy(sourceDir,Dir('miscdeps/source'))

…

-env.SConscript('source/comInterfaces_sconscript',exports=['env']) +env.SConscript('source/comInterfaces_sconscript',exports=['env', 'sourceDir']) Was this change meant to be here? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8758 (review)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AHgLkJKS9pvoJOfGp4QI1duLY8K5gNkAks5ucXZGgaJpZM4Wt2Jm> .

jcsteh · 2018-09-19T00:40:27Z

We should consider removing the eSpeak Emoji data (if that's possible) in favour of this solution, rather than doubling up on the data. Python 3 unicodedata is only useful if it includes CLDR annotation data for multiple languages. I'm not sure that it does.

LeonarddeR · 2018-09-19T04:42:17Z

@michaelDCurran, what do you think about "jcsteh's concerns regarding the wording of the new gui option?

michaelDCurran · 2018-09-19T04:49:12Z

@LeonarddeR: I'm really not too bothered, but I'd be fine with @jcsteh's final suggestion.

LeonarddeR · 2018-09-19T07:50:26Z

The espeak emoji dictionaries are now deleted, basically using a copy of #7810.

LeonarddeR · 2018-09-19T11:48:07Z

Python 3 unicodedata is only useful if it includes CLDR annotation data for multiple languages. I'm not sure that it does.

I can't find anything about this in the python docs, so I'm afraid that it does not.

jage9 · 2018-09-25T18:45:28Z

This works great when reading text. It doesn't currently function when arrowing by character.

Open Notepad
Insert an Emoji such as 🌮
Arrow over the emoji by character.

You can also arrow over the taco emoji above to observe the same results. Using the up and down arrow keys reads it correctly.

I'm not sure if some emojis are represented by two characters, especially in browse mode. I can create a new issue for this if desired.

LeonarddeR · 2018-09-25T19:15:31Z

Feel free to create a new issue for this please.

tmthywynn8 · 2018-11-20T17:59:31Z

Is there an explanation for the rationale behind choosing some for the Level field rather than none? After all, if this is able to be toggled anyway, you'd have to know to change your punctuation level to some if you manually enabled it dependent on situation, i.e., through a configuration profile.

LeonarddeR · 2018-11-20T18:49:24Z

Yes. The some level is NVDA's default level, and we want emoji to be spoken at that level. On the other hand, emoji are symbols, and if a user choses not to hear symbols at all, we want emoji to be left unspoken as well. What would you suggest otherwise?

tmthywynn8 · 2018-11-21T18:14:11Z

I'm not sure what the solution should be, though I understand the rationale now. It just didn't seem intuitive, as if I had the box checked, then I'd expect the Emojis to read across the board regardless of punctuation level; if I didn't want it read, instead of changing punctuation to a lower level, I'd just uncheck the box and I'd be set. Granted, I'm not exactly the common case here, as my punctuation level is at none for most scenarios, and the box is unchecked until I encounter an Emoji, which, depending on synthesizer, can either alert you of its existence or not. If the OneCore voices are the default though, then having the box unchecked will have the Emoji read, if known, regardless of punctuation level anyway so there's a consistency issue to consider.

LeonarddeR · 2018-11-22T07:52:19Z

If the OneCore voices are the default though, then having the box unchecked will have the Emoji read, if known, regardless of punctuation level anyway so there's a consistency issue to consider.

This is certainly a valid point!

tmthywynn8 · 2018-11-29T00:48:15Z

Looks like we've an unexpected error if you change the Unicode Consortium data via the Punctuation/symbol pronunciation dialog.

Steps to reproduce:

Under the Speech catrogory of NVDA's preferences, check "Include Unicode Consortium data (including emoji) when processing characters and symbols".
Make a change to one of the Emojis, e.g., change the level of the "face savoring food" Emoji (😋) to none using the Punctuation/symbol pronunciation dialog.
Under the Speech catrogory of NVDA's preferences, uncheck "Include Unicode Consortium data (including emoji) when processing characters and symbols".
Find the Emoji somewhere, e.g., through the cldr.dic file, and read it.

Expected: The synthesizer processes the line pre-Unicode Consortium data functionality.

Actual: An error noise, synthesizer staying silent, with the following in the log:

Replacement not defined in locale en for symbol: 😋
ERROR - eventHandler.executeEvent (HH:mm:ss.SSS"):
error executing event: gainFocus on <baseObject.Dynamic_IAccessibleEditWindowNVDAObject object at 0x05078AB0> with extra args of {}
Traceback (most recent call last):
  File "eventHandler.pyo", line 155, in executeEvent
  File "eventHandler.pyo", line 92, in __init__
  File "eventHandler.pyo", line 100, in next
  File "NVDAObjects\behaviors.pyo", line 169, in event_gainFocus
  File "NVDAObjects\__init__.pyo", line 961, in event_gainFocus
  File "NVDAObjects\__init__.pyo", line 849, in reportFocus
  File "speech.pyo", line 418, in speakObject
  File "speech.pyo", line 1029, in speakTextInfo
  File "speech.pyo", line 568, in speak
  File "speech.pyo", line 80, in processText
  File "characterProcessing.pyo", line 635, in processSpeechSymbols
  File "characterProcessing.pyo", line 550, in processText
  File "characterProcessing.pyo", line 534, in _regexpRepl
KeyError: u'\U0001f60b'

Possible solutions:

Don't allow users to change the symbol definitions for the Unicode Consortium data.
Suppress the error somehow, passing through the original text with the symbol not defined.
Have a separate cldr-en.dic (or whatever locale) instead of putting the definition in symbols-en.dic. That way, you can check if NVDA is using the Unicode Consortium data or not, and if it is, then load in the custom cldr entries, otherwise don't load the entries from that file.

I don't know how complicated any of the proposed solutions would be as I'm not a coder, but hopefully a solution can be divides before 2018.4 is released.

LeonarddeR · 2018-11-29T07:35:35Z

Looks like we've an unexpected error if you change the Unicode Consortium data via the Punctuation/symbol pronunciation dialog.

There is a pr for this: #8932

Leonard de Ruijter added 11 commits September 17, 2018 17:02

Add emoji dictionaries as a git subnmodule

65f7919

Scons implementation

ecc5a40

Load emoji dictionaries in characterProcessing

7e835c8

Add speak emoji descriptions entry to settings

726c868

Support multiple emoji sources per locale

aef6575

Update emojiDict_sconscript

51fe9df

Rename emoji dictionaries to cldr

72f6c22

Cache locales for which no symmbols are available

920e9d8

Clear CLDR data when saving the option is changed

cb260d0

Handle config profile switches properly for CLDR data

61493d3

Update copyright

6f590d1

LeonarddeR added the component/speech label Sep 18, 2018

LeonarddeR requested review from derekriemer and michaelDCurran September 18, 2018 12:48

Revert unnecessary change to sconstruct

eff6ecd

michaelDCurran previously approved these changes Sep 18, 2018

View reviewed changes

User guide

b26ca56

LeonarddeR dismissed michaelDCurran’s stale review via b26ca56 September 19, 2018 05:22

Remove all eSpeak emoji dictsource files before compiling eSpeak

5598427

LeonarddeR force-pushed the i6523 branch from 2d0ba69 to 5598427 Compare September 19, 2018 08:41

michaelDCurran previously approved these changes Sep 25, 2018

View reviewed changes

michaelDCurran added 2 commits September 25, 2018 11:12

Merge branch 'master' into HEAD

fdc8e07

Update what's new

0a8f4ad

michaelDCurran dismissed their stale review via 0a8f4ad September 25, 2018 01:16

michaelDCurran merged commit 21065fa into nvaccess:master Sep 25, 2018

nvaccessAuto added this to the 2018.4 milestone Sep 25, 2018

jage9 mentioned this pull request Sep 25, 2018

Emojis Do Not Speak when Arrowing by Character #8782

Closed

michaelDCurran mentioned this pull request Sep 26, 2018

Gecko VBufBackend: greatly speed up subtree re-renders when part of a document has changed #8678

Merged

LeonarddeR mentioned this pull request Nov 29, 2018

Make sure that user symbols without a replacement are properly ignored #8932

Merged

DrSooom mentioned this pull request Feb 18, 2019

Cannot read Braille unicode Characters. #6341

Closed

feerrenrut referenced this pull request in nvaccess/nvda-cldr Sep 14, 2022

Produce locale folder

a0489e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Unicode CLDR to create speech symbol dictionaries with emojis #8758

Use Unicode CLDR to create speech symbol dictionaries with emojis #8758

LeonarddeR commented Sep 18, 2018

LeonarddeR commented Sep 18, 2018

LeonarddeR commented Sep 18, 2018

jcsteh commented Sep 18, 2018 via email

michaelDCurran left a comment

michaelDCurran Sep 18, 2018

josephsl commented Sep 18, 2018 via email

jcsteh commented Sep 19, 2018 via email

LeonarddeR commented Sep 19, 2018 via email

michaelDCurran commented Sep 19, 2018

LeonarddeR commented Sep 19, 2018

LeonarddeR commented Sep 19, 2018

jage9 commented Sep 25, 2018

LeonarddeR commented Sep 25, 2018

tmthywynn8 commented Nov 20, 2018

LeonarddeR commented Nov 20, 2018 via email

tmthywynn8 commented Nov 21, 2018 via email

LeonarddeR commented Nov 22, 2018

tmthywynn8 commented Nov 29, 2018 •

edited

LeonarddeR commented Nov 29, 2018

Use Unicode CLDR to create speech symbol dictionaries with emojis #8758

Use Unicode CLDR to create speech symbol dictionaries with emojis #8758

Conversation

LeonarddeR commented Sep 18, 2018

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed

Known issues

Change log entries

LeonarddeR commented Sep 18, 2018

LeonarddeR commented Sep 18, 2018

jcsteh commented Sep 18, 2018 via email

michaelDCurran left a comment

Choose a reason for hiding this comment

michaelDCurran Sep 18, 2018

Choose a reason for hiding this comment

josephsl commented Sep 18, 2018 via email

jcsteh commented Sep 19, 2018 via email

LeonarddeR commented Sep 19, 2018 via email

michaelDCurran commented Sep 19, 2018

LeonarddeR commented Sep 19, 2018

LeonarddeR commented Sep 19, 2018

jage9 commented Sep 25, 2018

LeonarddeR commented Sep 25, 2018

tmthywynn8 commented Nov 20, 2018

LeonarddeR commented Nov 20, 2018 via email

tmthywynn8 commented Nov 21, 2018 via email

LeonarddeR commented Nov 22, 2018

tmthywynn8 commented Nov 29, 2018 • edited

Steps to reproduce:

Possible solutions:

LeonarddeR commented Nov 29, 2018

tmthywynn8 commented Nov 29, 2018 •

edited