Refactor speech to support callbacks, speech commands in control/format fields, priority output, etc. #4877

Open
nvaccessAuto opened this Issue Feb 3, 2015 · 16 comments

Comments

Projects
None yet
5 participants

Reported by jteh on 2015-02-03 00:56
A lot of potentially nice functionality is not possible (or is at least ridiculously painful) with our current speech framework. Speech needs a pretty big refactor to allow for such things. It needs to support:

  • Callbacks which are called at a requested point in the speech output
  • Callbacks which are called when a given synth finishes speaking and when overall speech has stopped
  • Speech commands in control/format field speech so that control and formatting info can be indicated by things other than just text
  • Priority output so that important messages can interrupt what is being spoken and/or be spoken after the next utterance without losing lower priority utterances already sent

Unfortunately, some of this is going to break backwards compatibility, but I think it's worth it in this case.
Blocking #905, #3188, #3286, #3736, #4089, #4233, #4433, #4874, #4966, #5026, #5104

Comment 2 by camlorn on 2015-02-04 17:37
+infinity. Glad to see this is at least a ticket now. I'll give input where and as I can; at the moment this is too ill-defined for me to really comment beyond suggesting you get a time machine so we can have this yesterday. If only.

Comment 3 by leonarddr on 2015-02-06 16:35
I wonder, is #914 something which could be involved in this ticket?

Comment 4 by camlorn on 2015-02-24 17:53
I've been thinking about this some.
I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.
The first is when NVDA says a fixed string, so basically anywhere that's not say all. In this case, I need to be allowed to potentially lengthen the buffer. If the speech takes less time than the sounds, simple addition will not work. Preparing the buffer as one chunk and passing it through the add-on with tags for specific sample ranges should be sufficient for this case but may cause processing issues for larger chunks.
The second is say all or other "streaming" situations, and this is the harder one. ideally, something like Unspoken can work in parallel with say all. But when you're applying filters, they have tails of a few hundred samples. Not to mention the proceeding situation. In this case, I don't want to lengthen the buffer for the simple reason that it is not semantically separated into logical chunks.
And the problem with playing in parallel: variable latencies means I won't actually be tightly aligned with the speech anymore.
I'm not sure how to fix these. I think we need to decide what kinds of manipulation we want to allow and disallow. I've got enough knowledge to talk about what we can potentially do to NVWave or something that sits a level above it, and in the worst case we allow add-ons to monkeypatch through a blessed interface or something. Obviously I want Unspoken to work in say all and other places where the object is said but not focused, but beyond that I'm open. Nevertheless, I do know most of the algorithms at this point, and I think we should start pinning down the capabilities. Ideally we can get input from more than just me, but i'm not sure anyone else is working on add-ons similar to Unspoken.

Comment 5 by jteh (in reply to comment 4) on 2015-02-24 22:51
Replying to camlorn:

I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.

What you're suggesting (sample range tagging, etc.) requires that the speech framework has intimate knowledge of audio output. The biggest problem with this is that not all synths output audio through NVDA, so this actually isn't possible.

One approach which should cover at least some of your needs is that we allow a callback to specify that it wants to suspend further utterances until it is complete, at which point it will request that speech resume. One major problem with this is that a broken add-on can very easily break speech quite badly (more easily than it can already), so maybe we need to allow a maximum timeout for this or something.

Comment 6 by camlorn on 2015-02-25 03:06
How common is not playing through NVDA? Personally, I'd have no problem saying "either you give me the audio and semantic sample tagging or my add-on doesn't work" if that means I'm getting very accurate and not drifting playback during say all, and I'm still not entirely convinced that applying filters directly to speech is a bad idea when possible. Bump volume for bold, underline is chorus, I don't know. I wish I could experiment with this now so that we could know if this is super valuable or just me being me. I do not feel confident enough in the quality of Espeak's code to hack Espeak into having extra filters that you can toggle without it probably becoming a pretty hefty project, unfortunately.
And doesn't going through other things break NVDA's device selection? Also, what can't we get audio out of? Because this breaks backward compatibility anyway, and if there's not actually anything you can't request samples from reasonably, why not break it in that way?
The callback would help with some, I think. In the common case, sounds are short. But the add-on will still need to know if it's a say all situation, and possibly if it needs to abort the sound instead. If the utterance is because I pressed something, it needs to bail gracefully. I think that timeouts should not be allowed here or at least set to a significantly large value; if you are an add-on developer using this feature, then you need to be aware that it is dangerous and treat it accordingly. It is worth noting that if I'm playing in parallel because there's no choice, then playing things that overlap slightly is also pretty trivial. If Libaudioverse is the backend (it is for the unreleased version of Unspoken), it's always playing anyway.

Comment 7 by jteh on 2015-02-25 05:34
We still use direct output from SAPI4, SAPI5 and Audiologic. I believe the Festival and Acapela drivers do also. And no, this doesn't t break device selection; the drivers just handle the initialisation themselves. There are also some who still want external synths. We can get samples from SAPI4 and SAPI5 and probably will do so in future, but the point is that we aren't going to drop support for this.

Aside from that, the stuff I'm working on relates to how the speech framework processes utterances, passes them to synths and calls callbacks. The synths still generate the samples after they receive the utterance. Therefore, I don't see how you could do sample tagging at this level anyway. You can have the synth fire callbacks when it outputs an index, which is what I plan to do.

Comment 8 by camlorn on 2015-02-26 22:08
I get it, I'm just not thrilled. Things can be done without sample-accurate playback and, thinking about it more, I think that a callback to delay speech won't help much either. But it just seems like we could do a lot here, especially as we move forward and external synths finally, finally begin to finish dying.
If you use the Synth's callbacks, doesn't it usually give you sample indexes? You can still get this information even if it doesn't by gathering samples at, say, chunks of 64. If a callback triggers, tag that chunk. I get that we can't have this, at least without custom synth drivers or something, but it's certainly possible. I'm personally not adverse to having features of add-ons that depend on synth drivers supporting certain things, but maybe I'm being too ambitious.
Also, if the synth's latency is too high, there's no way to sync at all. Sapi is a pretty bad offender for this, at least from the few times I've used it.
How high level will new speech commands be? Am I still monkeypatching stuff, or do we have semantic stuff like "is saying object" or something? Higher level, hookable speech commands would be nice-this would let me move my add-on into the synth itself if I wanted, for example.

Comment 11 by camlorn on 2015-03-18 13:48
I had a thought on the syncing issue that might actually be workable. This might also be overly complicated, and maybe it's something we can do after this refactor if it proves necessary. It's also got some unaddressed details and such, but I've been thinking about it for a few days and I can't see specific downsides that would make it unworkable.
First, include the Speex resampler. This may have other benefits, especially if integrated into NVWave. Investigating the parts that don't apply to this ticket is on my to-do list; waveout uses a linear resampler and those aren't exactly good for large jumps. It's a couple c source files that can be integrated pretty easily. This gives synths which wish to support the next thing a convenient way to upsample their samples to 44.1khz.
Second, either include a "play stereo samples" command or allow for callbacks in whatever architecture exists to return samples they wish played. I like the former because it's the most common thing I think people are going to want to do and there's no guarantee that the synth has to process it in realtime.
Third, implement a background thread that can play these samples when passed over a queue. This isn't exactly as hard as it sounds, though it might involve 2 threads depending. I can give this code if we go here. Then, in all synths that don't want to or can't support highly accurate synchronization, they can delegate to this thread in the same way we would if we were getting callbacks. It might even be possible to map this into a callback command for those synths that don't want to deal with it.
Finally, add the ability to splice the audio directly into the speech stream before playing to whatever synths we can inside NVDA itself. This involves an upsample to 44.1khz or thereabouts; 44.1khz is the most common and is sufficiently high for anything we'd want to do.
I figure this is at least a starting point, though I will admit that knowing how much of a problem differences in latency is will have to wait until we have the ability to play sound during say all and the like.

Comment 12 by Q (in reply to comment 4) on 2015-04-03 19:41
Replying to camlorn:

I've been thinking about this some.

I think we need to refactor how mixing works. There's two cases in which I would want to play a sound in parallel with speech.

The first is when NVDA says a fixed string, so basically anywhere that's not say all. In this case, I need to be allowed to potentially lengthen the buffer. If the speech takes less time than the sounds, simple addition will not work. Preparing the buffer as one chunk and passing it through the add-on with tags for specific sample ranges should be sufficient for this case but may cause processing issues for larger chunks.

The second is say all or other "streaming" situations, and this is the harder one. ideally, something like Unspoken can work in parallel with say all. But when you're applying filters, they have tails of a few hundred samples. Not to mention the proceeding situation. In this case, I don't want to lengthen the buffer for the simple reason that it is not semantically separated into logical chunks.

And the problem with playing in parallel: variable latencies means I won't actually be tightly aligned with the speech anymore.

I'm not sure how to fix these. I think we need to decide what kinds of manipulation we want to allow and disallow. I've got enough knowledge to talk about what we can potentially do to NVWave or something that sits a level above it, and in the worst case we allow add-ons to monkeypatch through a blessed interface or something. Obviously I want Unspoken to work in say all and other places where the object is said but not focused, but beyond that I'm open. Nevertheless, I do know most of the algorithms at this point, and I think we should start pinning down the capabilities. Ideally we can get input from more than just me, but i'm not sure anyone else is working on add-ons similar to Unspoken.

@nvaccessAuto nvaccessAuto added this to the next milestone Nov 10, 2015

Contributor

camlorn commented Apr 23, 2016

I thought of another possible use case, though I'm not sure how useful it would be: implement an option to monitor things in the background. I'm thinking at least aria live regions and the controller API, but it might also be useful to tag terminals for watching.
As an example, I'm currently programming something wherein I need to have two terminals open, one running a client and the other running a server. If I could tag them both with separate voices and monitor even when it doesn't have focus, that might be very useful. It could also be horribly confusing, mind you, and it would probably help if this also included pan. But it's an interesting thought and maybe something to put on the "We want an add-on to be able to..." list at least. I don't think you could use it instantly, but I think you could train to it if you tried a bit.
There is an old demo somewhere of a prototype system that did something like this for other stuff as well, speaking labels out one speaker and control types out the other in parallel. I wish I had a link to it. I'll see if I can find it, but don't even know what to search for.
But we're still collecting use cases, so I figured I'd throw this out there anyway. I'm sure it has problems I haven't thought of, etc, but it's something to put on the list.

@jcsteh jcsteh removed this from the next milestone Jun 24, 2016

@jcsteh jcsteh added the p3 label Jul 1, 2016

Contributor

camlorn commented Jul 21, 2016

Got one more.

On the web, we currently don't read the title attribute when moving by arrows. I think this also applies to aria-label and maybe some other things. With this, it would be possible to indicate these attributes efficiently, so that a user could know to check it in one way or another.

As it stands, we would have to say something like "has title" or just read it every time. Since this can be used on things like abbreviations, this is possibly less than useful.

There's already an issue about them not reading, but I don't remember which one at the moment. Just that it has a weird title and that I'm on the thread.

I posted on bugzilla about an issue with NVDA (and JAWS) on Firefox where the 'Remember Password' popup interrupts other aria alerts on the page (https://bugzilla.mozilla.org/show_bug.cgi?id=1323070). It was determined that this was a problem with the screen readers not being able to handle multiple alerts at the same time. Would the changes proposed here fix this issue?

Contributor

camlorn commented Dec 29, 2016

It wouldn't, I don't think. It would allow cool experimental things like saying them all at the same time in different voices and maybe panning them across the sound field or something, but the core issue is probably about queuing things to be spoken, and we should already be able to do that.

Contributor

jcsteh commented Jan 2, 2017

Collaborator

leonardder commented May 29, 2017

An additional use case I'd like to suggest is an UI to reorder speech output. E.g.

  1. Save configuration on exit check box checked Alt+s
  2. checked Save configuration on exit check box Alt+s
  3. checked check box Save configuration on exit Alt+s
  4. check box checked Save configuration on exit Alt+s

Such an ui could have a list containing the several speech parts, along with move up and move down buttons. This is especially helpful in cases where one wants to reorder output foor table cells in Excel (e.g. speak column and row headers before the cell contents).

Contributor

jcsteh commented May 29, 2017

This was referenced May 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment