New text symbol processing framework #332

Closed
nvaccessAuto opened this Issue Jan 1, 2010 · 30 comments

2 participants

@nvaccessAuto

Reported by aleksey_s on 2009-06-15 19:20

Rationale

currently, punctuation processing in NVDA is very simple and quite limited. Limitations of current system are:

  1. No punctuation levels. Users need to have ability to change punctuation level for improving experience in performing daily tasks (e.g. "none" level when reading books, "some" for things like smilies when browsing, chatting and reading mail, "all" for programming).
  2. Changing the certain label for punctuation mark or adding newis difficult. Currently, to add new punctuation mark or edit label of existing mark user needs to edit NVDA sources or language file appropriately.
  3. The set of punctuation marks and their labels is hard-coded and there is no way to switch between profiles. Example where it can be usefull are some languages (especially Russian and Ukrainian) which have too long punctuation labels by default. It is no problem for not experienced user, who deals not so often with punctuation, but is very unconfortable and nasty for people, who deals with textual data as work basis (e.g. programming). There can be two profiles for this reason: "default" and "brief".
  4. Switching language of punctuation labels isn't possible. User can easily switch synthesizers for different languages, but reading when text and punctuation is in different languages is atleast not confortable or even not acceptable in cases, where languages use different alphabets so synthesizer don't know how to handle the text, nvda replaces punctuation marks with. This one is related to runtime language switching in general, which can be a different issue.

So new punctuation handling system need to be developed to eliminate existing bariers.

expected features

  • gui to manage user-defined punctuation
  • punctuation levels and gui to manage them
  • besides to default, "brief" punctuation labels atleast for russian

implementation

template
The current speechDictionaries system may be extended to textProcessing module which will handle speech dicts, punctuation, indentation, char repetition and other stuff
...more detailed info about implementation

need to answer

  1. p 2.
    • Do user really need edit punctuation labels bundled with nvda by default?
    • if not, then user may be able only to add custom punctuation marks to process but not edit existing, which simplify things a bit
  2. p 4.
    • Is it required for all languages?
    • Are more than two profiles required?
    • If not, we can generalize this for all languages and just have "default" and "brief" label for all punctuation marks.
  3. what about translators?
    • we want not change things for them while possible
    • so have punctuation labels in external file (as with buildin.dic) is bad idea
    • at minimum, "brief" labels will be added to translation queue
    • how to handle two punctuation labels within gettext?
    • a sort of _("point") _("short_point")
    • then need translation also for english
    • i feel gettext is not for such things

Blocking #43, #271, #454, #919

@nvaccessAuto

Comment 2 by jteh on 2009-07-24 05:21
Despite my original thoughts about merging speech dicts and punctuation handling, I think they will be separate after all. Therefore, I'm narrowing the scope of this ticket just to punctuation handling.

Another question: how do we deal with punctuation symbols which need to be preserved in the output (e.g. "dot.", "comma,") to provide proper entonation? There need to be special rules for this; e.g. "..." should not become "dot. dot. dot.", otherwise you will hear pauses after each dot. The current code knows how to handle this for English, but I know that some other languages have their own punctuation systems. We can have a field which specifies that the symbol should be preserved in the output, but we also need to somehow cover the exceptions.
Changes:
Changed title from "New text processing system design document proposal" to "User configurable punctuation labels and/or profiles"

@nvaccessAuto

Comment 4 by aleksey_s on 2010-08-01 21:27
Finally,, i think that profiles are retundant. They add problems with translation, and i can't think about the case when I need to switch quickly brief and default. I need brief punctuation or no punctuation at all. So, my proposal is to allow users to edit punctuation labels and save only changed labels to the user config. Also, if user changes the level of an existing mark, it will be also saved. That means one can customize labels and levels one time and keep the settings even if we add a new punctuation mark into NVDA source.

Problems, that come in mind:

  • How to handle custom punctuation and languages? Do we need different punctuation config for different languages/synths/voices?
  • how to handle the damned exceptions? I personally don't need different punctuation labels/levels for different synths, but I feel it is important to have different labels for different languages. If there are no other examples like "..." then we may hardcode it and forget about it. after all, Special cases aren't special enough to break the rules, aren't them?
@nvaccessAuto

Comment 5 by aleksey_s on 2010-08-02 07:27
Yet another questions to think of:

  • Do we consider symbols like ©, as punctuation characters? Right now, a lot of unicode characters are left to the synth, so behavior is different with each synth.
  • What about removing characters, that are below current punctuation level and for those "remain in text for proper intonation" flag is not set, from the text? This way we guarantee that synth will not speak them. Currently we can't control the synth punctuation in any other way. Such removal will be useful, say, for the svox pico, as it seems to handle a lot of punctuation in its own manner which many users find too verbose. If anybody can give an example where removal of punctuation that is below current level breaks anything, i'll perhaps make a checkbox for it.
  • Is punctuation level really related to the voice, as it is made curently? I feel that punctuation level and current voice are unrelated. I switch punctuation depending on the task, not on the voice.
  • how many punctuation levels do we want? I think none, some, most and all will fit everyone. Or do you think none, some and all will be enough?
@nvaccessAuto

Comment 6 by jteh on 2010-08-13 03:20
A few comments. I haven't had a chance to think about this in much detail yet.

  • Regarding brief punctuation labels, why not just use brief labels by default? After all, this is what we do in English. For example, rather than saying apostrophe, we say tick; rather than saying exclamation mark, we say bang. This is unpopular with some people, but we did it precisely because it is brief, which I think is important for screen reader users who rely on this stuff all the time.
  • I think four punctuation levels (none, some, most, all) is good. I personally don't see a need for levels at all (toggle works fine for me), but most users seem to be requesting four levels.
  • Because only the synth can know what language it's actually speaking, I still wonder whether we should just push all punctuation handling to the synth and drop punctuation handling in NVDA altogether.
    • This really is a job for the synth. We don't do word or number processing, so why punctuation?
    • I guess a lot of synths may not support punctuation, but that's technically a bug in the synth.
    • Unfortunately, there are probably far too many synths that this will break, so it's probably unrealistic to force this.
    • Part of my reason for mentioning this is that there are just far too many symbols for us to keep track of; e.g. copyright, bullets, quotes and dashes, just to list a few.
  • If we are going to do this, I think we should try to handle common symbols such as copyright, bullets, quotes and dashes.
    • However, this could get ridiculous, with people wanting all sorts fo nonsense added to the default punctuation dictionaries. We'll need to "draw a line" somewhere.
  • Should punctuation data be tied to NVDA's language setting?
    • If you want to support multiple reading languages, no.
    • However, this raises the question of how to determine which language to load. Or does the user always have to load them manually?
    • We might need to revisit the idea of adding a standard "language" property on synths to help with this.
  • I don't think that punctuation is tied to the voice, though I do think it is tied to the language. m * Unfortunately, one often changes voices when changing languages, so the two might have to be linked unless we can find a way to reliably determine the language. Related to my previous point.
  • I think we need something more flexible than just defining characters which must be included to preserve entonation.
    • I don't like hard coded exceptions for things like "...". There are bound to be more of them.
    • However, we might be able to divide this into three groups: sentence endings, only before spaces and anywhere.
    • The problem with sentence endings is that English (and I assume other languages) has particular rules about when they are actually considered endings. For example, if followed by a quote, "." is a sentence ending, but not if followed by a letter, comma, etc.
  • Your idea of removing characters below the current punctuation level is good. We often get people complaining that they turned off punctuation but some still gets spoken.
    • Unfortunately, how well this works depends on how many characters we catch. We're bound to miss some. :)
@nvaccessAuto

Comment 7 by aleksey_s (in reply to comment 6) on 2010-08-13 05:19
Replying to jteh:

  • Regarding brief punctuation labels, why not just use brief labels by default? After all, this is what we do in English. For example, rather than saying apostrophe, we say tick; rather than saying exclamation mark, we say bang. This is unpopular with some people, but we did it precisely because it is brief, which I think is important for screen reader users who rely on this stuff all the time.

It will make NVDA very unfriendly to newbies or, better said, for non-geeks at least in Russian. To give an example: Currently, NVDA says something like "levaya krooglaya skobka" for "(" (left paren). It is how we used to name this symbol in school etc. I changed it to "le sko" (first letters, something like "le par"). It is how things were written on telegraphs. While I know that some users also do desire or/and can stand with such abbreviations for punctuation, i also know well that there are a lot of newbies who will hate NVDA for such geeky stuff. It would be cool if brief labels are build-in NVDA, but as i said in previous comment adding profiles will increase complexity, and, after all, user can go to NVDA-community.org and optain a punctuation file with brief labels, if he/she wants.

  • I think four punctuation levels (none, some, most, all) is good. I personally don't see a need for levels at all (toggle works fine for me), but most users seem to be requesting four levels.

OK, let it be 4.

  • Because only the synth can know what language it's actually speaking, I still wonder whether we should just push all punctuation handling to the synth and drop punctuation handling in NVDA altogether.

Sounds like a science fiction. As you said, this is real world, and almost all synthesizers do not support it. Anyway, if you are too concerned, we can make an option "let synthesizer process the punctuation", and add support for turning on punctuation handling in synthDrivers for synths, which support that stuff like espeak.

  • Unfortunately, there are probably far too many synths that this will break, so it's probably unrealistic to force this.

Indeed.

  • Part of my reason for mentioning this is that there are just far too many symbols for us to keep track of; e.g. copyright, bullets, quotes and dashes, just to list a few.

    • If we are going to do this, I think we should try to handle common symbols such as copyright, bullets, quotes and dashes.

Sure.

  • However, this could get ridiculous, with people wanting all sorts fo nonsense added to the default punctuation dictionaries. We'll need to "draw a line" somewhere.

Why do you think this is a problem? Seems as not a big deal for me. the more punctuation the better. However, we can implement an ability to load language-specific punctuation independently for each language, as well as default. Then language-specific punctuation symbols will be in separate files. (What about gettext translation then? Or we may be language-specific marks can stay untranslated.)

  • Should punctuation data be tied to NVDA's language setting?

I think yes, probably for nearest future, while we do not have support for language setting in synths.

  • If you want to support multiple reading languages, no.

Doesn't matter. Anyway, we don't support changing languages on the fly.

  • However, this raises the question of how to determine which language to load. Or does the user always have to load them manually?

This breaks gettext usage.

  • We might need to revisit the idea of adding a standard "language" property on synths to help with this.

I read a lot of multilingual texts here. So loading a new synth each time is a bad idea. However, I have created a virtual synthesizer called "multilang", which can determine language of text and speak it with desired synth. It keeps all needed synths loaded simultaneously, so there is no lag when switching synths.

  • I think we need something more flexible than just defining characters which must be included to preserve entonation.

I still believe that we need a flag on punctuation symbol that it must be preserved in the text after an insertion of actual label.

  • I don't like hard coded exceptions for things like "...". There are bound to be more of them.

OK, let's go with regular expressions then. BTW, we will be able to handle differently "." when it is between numbers (one point zero two), when at end of sentence (a very long sentence full stop) and when in other context (www dot NVDA dash project dot org).
aOne problem that came in mind: We need to handle priority in some way. e.g. handle "..." before ".", and stuff like that. We probably should mark somehow that this part of the string is already processed and all punctuation expanded. Need to think about that.

  • However, we might be able to divide this into three groups: sentence endings, only before spaces and anywhere.

What about languages, which haven't idea of sentence endings at all? I remember you or Mick said something about that some time ago.

  • The problem with sentence endings is that English (and I assume other languages) has particular rules about when they are actually considered endings. For example, if followed by a quote, "." is a sentence ending, but not if followed by a letter, comma, etc.

Same here for Ukrainian and Russian.

  • Unfortunately, how well this works depends on how many characters we catch. We're bound to miss some. :)

We always can go and add missing ones, when users complain.

@nvaccessAuto

Comment 8 by briang1 on 2010-10-04 09:06
I have been thinking about this purely from my own perspective. Nobody has mentioned application specific settings as yet, but I kind of envisaged a table of parameters which could be edited for symbl, when its to be spoken, which scheme, which language, and that app modules would have a flag for default scheme.
This way, those using, say, an editor required for programming could use their max verbose list for that application even in doc read.

Of course, all of this being user configurable could end up in an unholy mess , so some default settings would be needed for each scheme, and presumably language. It might be a good time to try to tidy up the way symbols are spoken so that globally they are the same in all contexts, and to stop using # as the deliniator for comments.
All of this will break existing dictionaries however, and I do appreciate that what might look simple to a user, could be a nightmare for programming!

I'd also suggest, as has been touched upon here, that geek based systems should be avoided. Thus, I think many will and indeed do find regular expressions difficult to grasp. Obviously not used Dos, but this should be defaulted off and at least some explanation given in the user guide.
Unexpected results occur for me, and I thought at least I had some appreciation of it!
With regard to what the default spoken words are, this presumably is to some extent locale based. In the UK # is seemingly always hash, though some foreign items that talk say pound, number or even hatch or chequeboard! I can live with bang, as exclaimation, but not sure about tick, as to me a tick is what you get in tickboxes, not the apostrophe. As long as its spoken only when cursoring and typing, then I'm actually quite happy with apostrophe. Its start is not confusable with the letter T which is often what you hear when trncating tick.

Its my vote for four levels of punctuation too, but someone needs to sort out the when its not spoken, ie, it would be nice to hear punctuation in reading a line at a time, but not in say all, etc.

This is my few comments for what they are worth.

@nvaccessAuto

Comment 9 by BugHunter on 2010-10-17 10:02
If i read this right, then issue 2 in the original report is the same as described in ticket 271, and if so 271 could probably be closed.

@nvaccessAuto

Comment 11 by jteh on 2010-12-02 03:51
Changes:
Changed title from "User configurable punctuation labels and/or profiles" to "New text symbol processing framework"
Milestone changed from None to 2011.1

@nvaccessAuto

Comment by jteh on 2010-12-28 23:46
(In #149) The current idea is to use the new symbol framework to determine sentence endings. Text will be pulled in by line, but it will be buffered somewhere until the end of a sentence is reached.

However, I've just realised that this will cause problems with regard to indexing. We're moving by line, so the indexes need to be for each line. Because sentences may cross multiple lines, this means that the indexes need to be inserted in the middle of an utterance. While synths supporting markup do allow this, NVDA doesn't currently support speech markup.

Unfortunately, this means we probably won't be able to implement better say all until we implement speech markup. There may also be some synths that don't support markup (eSpeak, sapi4 and sapi5 do, but I'm not sure about newfon and audiologic for example). If this is the case, say all by sentence won't be possible for these synths.

@nvaccessAuto

Comment 14 by jteh on 2011-01-17 00:21
Sorry people. I know it's been a long wait, but we've decided to defer this to 2011.2, as we want to release 2011.1 sooner rather than later and this will take some time to get right, especially for translators. Rest assured that work is definitely underway on this.
Changes:
Milestone changed from 2011.1 to 2011.2

@nvaccessAuto

Comment by mdcurran on 2011-01-20 23:47
(In #635) Firstly thanks very much for the code. Work on a new symbol framework for all languages is under way. Once this is completed the code in this ticket will certainly be used in some form or another, even if it is only as a reference.

@nvaccessAuto

Comment by jteh on 2011-03-31 06:49
(In #55) We've determined that this will be done separately from symbol/punctuation handling, though probably in the same module (characterProcessing).

@nvaccessAuto

Comment 19 by jteh on 2011-04-11 01:00
My current thinking is that locales will be able to include locale specific symbol definitions. However, users will only be able to customise symbols globally. Otherwise, we have to implement the ability for users to do locale specific configuration, which gets ridiculously complex for both user experience and implementation. Lex, is this going to be a major problem for you?

@nvaccessAuto

Comment 20 by aleksey_s (in reply to comment 19) on 2011-04-12 19:44
Replying to jteh:

However, users will only be able to customise symbols globally. Otherwise, we have to implement the ability for users to do locale specific configuration, which gets ridiculously complex for both user experience and implementation. Lex, is this going to be a major problem for you?

Yes. One of requirements for the framework was an ability for user to customise punctuation labels. But if user customizes labels globally (say, shortens Russian labels) and changes synthesizer to another language (say, English), the synthesizer will be unable to handle punctuation labels in Russian. So changing punctuation labels globally is a bad idea in my view.

@nvaccessAuto

Comment 21 by jteh on 2011-04-12 21:54
Okay. IN that case, the initial implementation won't allow user customisation at all. The more immediate priority is for other locales to be able to customise symbols. You can work around the lack of user configuration temporarily by just overwriting the Russian localisation in your local copy.

@nvaccessAuto

Comment 22 by jteh on 2011-04-12 21:57
One other option is to allow the user to customise symbols only for NVDA's configured locale. In your case, if you have NVDA set to Russian, your user configured symbols will only speak when you are using a Russian voice. If you switch to an English synthesiser, the user configured symbols would be ignored. This makes a certain amount of sense, since NVDA's configured locale is the user's "primary" locale and any other languages are secondary.

Will this work for you? Otherwise, we'll just delay user configuration to a later stage as I said above.

@nvaccessAuto

Comment 23 by jteh on 2011-04-13 05:02
http://bzr.nvaccess.org/nvda/symbols/

@nvaccessAuto

Comment 24 by aleksey_s (in reply to comment 22) on 2011-04-13 05:38
Replying to jteh:

Will this work for you? Otherwise, we'll just delay user configuration to a later stage as I said above.

This will work, however I don't understand why such a restricting limitation. what is so ridiculously complex you came to when designing the system, where user can set locale-dependent changes? Could you please write your thoughts, i want to collaborate on this.

What about similar system like done for global gesture map? e.g. there is one file where user changes are saved. it is divided into sections, and each section contains info for specific locale.

@nvaccessAuto

Comment 25 by jteh (in reply to comment 24) on 2011-04-13 06:36
Symbol information and processing is handled separately for each locale. If we want users to be able to customise for each locale, we must load user symbol info separately for each locale as well.

I originally thought locale specific user customisation would make the code far more complex. However, I ended up having to rethink the design anyway, so the new code should handle this more easily in future.

I definitely can't implement all of the customisations in one file, as the maps are parsed separately for each locale and I'd rather not maintain two separate file formats. However, we could have symbols-en.dic, symbols-ru.dic, etc.

This does present a challenge as far as GUI is concerned. How will the user specify which locale they're configuring? Too many options gets needlessly complex. Imo, they should only be able to configure a single locale at a time. The question is which locale the GUI will choose.

Anyway, the initial implementation won't handle user customisation. I'll address that separately. One thing at a time. :)

@nvaccessAuto

Comment by jteh on 2011-04-14 03:19
(In #43) This is already fixed in the new code for #332. On further reflection, I think indentation needs to be handled separately.

@nvaccessAuto

Comment by jteh on 2011-04-14 07:30
(In #919) This should be addressed using the new symbols code once it is merged.

@nvaccessAuto

Comment by jteh on 2011-04-14 07:34
(In #149) This probably won't use the symbol code after all.

@nvaccessAuto

Comment 29 by jteh on 2011-04-14 10:28
Arrrg! Looks like Python's re module can only handle 100 named groups:
AssertionError: sorry, but this version only supports 100 named groups
That means only 100 symbols can be defined. It looks like this is going to be a show stopper for some languages, which means this whole thing needs to be rethought from scratch. Terrific.

@nvaccessAuto

Comment 30 by jteh on 2011-04-14 10:37
The solution is to match all simple symbols in a single group and only use named groups for complex symbols. This means that there can be a maximum of 99 complex symbols, but that should be fine. I just need to rework the code which builds the pattern and possibly split simple and complex symbols more in the code. Urg.

@nvaccessAuto

Comment 31 by aleksey_s on 2011-04-14 14:06
I don't see a problem if using 3rd-party regular expressions library, say, that from Google. It is claimed to be extremely fast.

or what about rethinking the pattern-building code? e.g. process 100 symbols at a time, not the whole thing.

@nvaccessAuto

Comment 32 by jteh on 2011-04-14 22:35
The library needs to be able to call back into a Python function when handling replacements.

I thought about splitting the expression. However, the problem is that you don't want overlapping matches. For example, if the ". sentence ending" pattern preserves the ".", if you apply the normal "." pattern later, it will try to convert it into "dot.", so you will end up with something awful like "dot. dot." for a single sentence ending. You can work around this by using special marker characters to mark sections that are already processed, but this makes the expressions much more complicated.

I don't think 90 or so (it's actually less than 99) complex symbols is going to be a problem. If it is, I guess we can cross that bridge later and switch libraries.

@nvaccessAuto

Comment 33 by jteh on 2011-04-15 07:23
This issue has been fixed.

The only thing left now for this part is documentation.

Any translators should note that info will be inherited from English by default, so you don't need to specify anything that's already specified in English. See the French symbols.dic for an example... or wait for the documentation if you don't understand. :)

@nvaccessAuto

Comment 34 by jteh on 2011-04-15 12:18
Looks like we need a third option for the preserve field. Currently, we have always and never. However, there are some symbols that need to be preserved only if they're not being replaced; i.e. they need to be preserved if the level is lower than the level for replacement. The best example of this is the apostrophe between letters; e.g. "can't". I propose we call this "norep"; i.e. only preserve if not replacing. Mick, thoughts on the name?

@nvaccessAuto

Comment 35 by mdcurran on 2011-04-16 03:39
The name norep sounds fine to me.

@nvaccessAuto

Comment 36 by jteh on 2011-04-18 03:07
Merged in 862d836. Please file new tickets for features and bugs.
Changes:
State: closed

@jcsteh jcsteh was assigned by nvaccessAuto Nov 10, 2015
@nvaccessAuto nvaccessAuto added this to the 2011.2 milestone Nov 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment