Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech Abstraction Layer #35

Open
3 tasks
albertotirla opened this issue Nov 6, 2022 · 14 comments
Open
3 tasks

Speech Abstraction Layer #35

albertotirla opened this issue Nov 6, 2022 · 14 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@albertotirla
Copy link
Member

albertotirla commented Nov 6, 2022

after a thread on the audiogames.net forum regarding speech dispatcher not working, therefore user is left without a way to use their computer in graphical mode, I concluded that, regardless how good speech dispatcher may be from a unix philosophi standpoint, it's made of many moving parts, two many imho, any of which can bring the entire system down, simply by failing. In a world where speech is the only way for visually impaired people to use a computer therefore as critical as the GPU for sighted users, having it fail on us is simply inacceptable. Dear readers with some kind of sight, would you like your screen to go black simply because the shader cache was full for example? Also, graphics are integrated everywhere in the stack while speech isn't, but that's another discussion for another time. So then, as I'm sure you won't like that, why should we have to deal with it? I took this idea from screenreaders like nvda and fenrir for the tty, where there's a speech abstraction inside the sr itself, not speech dispatcher. Then, there are different backends facilitating speech, which eventually perhaps give back pcm wave data to the sr, which gets processed by some internal systems, for example direct interaction with pipewire. So, in this case, we would use a rust trait to abstract away the concrete implementation of the backend speech provider, then probably the sr would use Box<dyn SpeechProvider> as the interface through which to deliver speech to the user. For now, this is the draft I want to propose for this, feel free to modify it and suggest improvements, as this one is probably here to stay after it'll be implemented

pub trait SpeechProvider: Sized {
    type Configurator;
    type Error;
type Buffer;
    fn init_speaker(cfg: Self::Configurator) -> Result<Self, Self::Error>;
    fn speak<T>(&self, text: T) -> Result<Self::Buffer, Self::Error>
    where
        T: AsRef<str>;
    fn pause(&self) -> Result<(), Self::Error>;
    fn stop(&self) -> Result<(), Self::Error>;
    //configuration specific methods like set_volume, get_volume, set_pitch, etc  are not required because the configurator is backend specific and will load values in the init method, this is ment to be called each time the configuration is supposed to change.
    fn reload_configuration(self, cfg: Self::Configurator) -> Result<Self, Self::Error>;
}

For now, here are a few things the current implementation doesn't explain:

  • loading the right speech provider based on configuration and populating the middleware with it. Define mechanism that would fit inside rust's typesystem, such as enum dispatch and dynamic dispatch
    • loading the right provider specific configuration
      • where is that located?
      • should it be standardised, or anywhere the config crate can find it?
      • how are changes to it tracked?
      • do we allow implementations to define their own format, or do we inforce toml everywhere?
  • buffer handling
    • do we introduce synthizer now?
    • how do we pass the buffer over? rust isn't a dynamically typed language, so how do we interract with the buffer type? Is an associated type even a good mechanism, given what we want to do?
      • as to the functionality of the buffer itself, do we treat it like a vec, or do we make an additional buffer trait that has a method which returns an iterator of float values?
    • or maybe we should implement pipewire access directly
    • do we allow custom sample rates, or define a static one?
@TTWNO
Copy link
Member

TTWNO commented Nov 8, 2022

I think this method severely limits what we can do with speech-dispatcher given that we are (slowly) moving to an SSIP-based library. This would restrict a lot of the functionality which I expect to become standard to most screen readers.
I'm fine with an espeak-ng backend, but we don't need generic speech capabilities... that's what speech-dispatcher is for.

So having a backup where without speech-dispatcher support, we just directly speak to espeak with few extra modifications is fine, perhaps an enum in State containing the option of one of two screen readers, then all "change punctuation", "change rate", "change pitch", etc. can happen on two backends. To me this is sufficient as a backup strategy if you want to implement it.

@TTWNO TTWNO closed this as completed Nov 11, 2022
@albertotirla
Copy link
Member Author

I don't think this issue should be closed just yet, since not everything is ready for implementation, there are things to iron out before I can start implementing.
first, what do you mean by it's limiting? again, this isn't final untill I put it in the code, but what else would the trait need to do to be compliant with how screenreaders should use speech in your opinion? I thought of methods to register functions when speech events happen, on_pause, on_start, on_speech_break and so on, that's why I put this as an issue, to make it better than it is now, or replace it entirely if it's busted, but I need more feedback than what you currently provided.
second, why limit it to espeak-ng or speech dispatcher? what if someone wants to make something very specific that can't be made with spd, such as working with a synth that doesn't have a module and only uses a dynamic library without a command line program for example? Furthermore, spd is known to bug out with pipewire most of the time, as well as it delivers audio in a very inefficient way because of using the pa_simple library which does everything syncronously, also that's why priority issues in most cases end up causing a program to steal speech from everything else and so on, there's both a stream mixing problem, threading and syncrony problem and pipewire support non-existant problem at play here. Eventually probably there should be another speech server made to address these issues, but I thought that in the meantime we can roll with this but have a different backend with espeak-ng where we manipulate samples directly, make pipewire support, async stream support and so on, that's why I designed this interface.
Next, what do you mean by the enum of speech dispatcher and espeak-ng as backends? if we do that, won't spd and espeak configuration kinda mingle in unpleasent ways? won't it be better for the speak function to accept either backend, in stead of us defining functions for each one, with or without a type? we want to support the same set of primitives for both backends, why not incorporate it in a trait and generalise it, with each backend having its own configurator type where it could specify what values it needs for configuration, including punctuation, volume, rate and all that, even though we aren't gonna make it that general? I think a better idea would probably be to perfect this trait, however, I would like it if you could explain the thing with the enum and configurations better, including how would we call backends to speak, how would we configure each, preferably separate from one another, how would we abstract them in the code that actually speaks, etc. Maybe we're actually thinking about the same thing, however I don't understand what you mean very well. Perhaps you could type up a blueprint, so to speak, of how you invision it to look, like what I did above with the trait?
another thing. Because of real life issues as well as because I didn't find a libespeak crate on crates.io yet, I probably won't be able to implement the espeak backend for some time. However, after that trait is perfected enough that we don't have to go back to it in some time, I could implement it for speech dispatcher, prepair the configuration to instantiate the right type based on values found in the file as well as what the configurator type declared in the backend expects, then I could come in later and just slot in espeak, make the configuration type for it loadable with the mechanism designed before, handle the pipewire stuff or whatever and that's it. In that case, I wouldn't have to either implement espeak in the same couple days, risk getting behind a lot with the branch because I couldn't finish espeak, or risk delivering an unusable screenreader to main. Either I didn't understand the enum approach well, or that would make me have to design the espeak-ng backend as well before it would compile. How do we handle errors, how do we know something crashed to switch to espeak-ng if the interface to the two backends isn't well defined, with good result return types?

@albertotirla albertotirla reopened this Nov 11, 2022
@TTWNO
Copy link
Member

TTWNO commented Nov 18, 2022

With the hopeful rise of something like AccessKit, which should enable Rust programs to talk to native accessibility APIs from different platforms, perhaps the idea of a generic speech interface is not out of the question.

That said, the answer to this is (probably) another speech server which implements SSIP, not a direct linking with libespeak and making our entire speech system abstract.

I believe the correct solution for this is still to have libespeak as a backup, but also as the only backup. Interfacing with different speech systems shouldn't be necessary as long as they implement (at least a subset) of the SSIP protocol.

So, a generic speech struct is fine, as long as it only supports two enums. I don't want Odilia trying to support Eiffel and pico speech engines as well. This falls fairly outside the scope of this project; perhaps an additional crate could be created to deal with this... but again, PRs to upstream speech engines to implement SSIP would still be preferred.

So I like the idea, I just think it may fall almost entirely outside of Odilia's perview. If anybody happens to know how Orca handles this, I'd be interested.

You're right, I shouldn't have closed the issue.

@DataTriny
Copy link
Contributor

If I may add something here: the thing that annoys me the most with speech-dispatcher is that you can't tweak settings that are specific to a particular synthesizer. When using espeak-ng for instance, you can't set the rate multiplier setting (I don't exactly remember the name), and so even with the speech rate set to 100% I still find it painfully slow.
With that being said, I totally agree with @TTWNO. Sure, being able to pick amongst multiple speech backends is nice, but if the standard option is good enough, then it's just redundancy. Honestly I have never encountered a major issue with speech-dispatcher, but I tend to avoid doing crazy stuff with my system.
I'd say it's the same argument for AccessKit: when Odilia will need a GUI, will you implement your own UI framework and write your own AT-SPI provider? Probably not.
speech-dispatcher already exist, use it! If there are bugs, let's work together to fix them. If gradually replacing some parts by ones written in Rust really bring advantages, let's convince people of that. I don't think this piece of software has major flaws that could not be addressed. I am pretty sure most of the sound issues come from speech synthesizers themselves.

@albertotirla
Copy link
Member Author

With the hopeful rise of something like AccessKit, which should enable Rust programs to talk to native accessibility APIs from different platforms, perhaps the idea of a generic speech interface is not out of the question.

indeed, however if we're talking about other platforms and not covering only linux fragmentation, we have tts-rs that abstracts what we want well enough. Only problem is, it doesn't integrate anything but speech dispatcher on linux

That said, the answer to this is (probably) another speech server which implements SSIP, not a direct linking with libespeak and making our entire speech system abstract.

well, if we want espeak to be the backup, we probably have to either link with libespeak(the ng variant these days I suppose), or statically include it in the binary, as of now we can't require many things to implement ssip on their own, we should have a very fault tolerant speech server in stead. This falls very neetly in position with my comment about remaking the speech server and making it right this time, with asyncronous streams, compatibility with newer audio backends and so on, see above for context.

I believe the correct solution for this is still to have libespeak as a backup, but also as the only backup. Interfacing with different speech systems shouldn't be necessary as long as they implement (at least a subset) of the SSIP protocol.

Problem here is, you can't require all speech engines to be a speech server and implement ssip, that'd basically be wayland all over again. Plus, if there are more speech sockets, how would odilia see them? how would it connect to them, to which should it connect? The ideas of speech dispatcher are good and needed, however a screenreader should always have an emergency fallback mode, which can either be activated on an event of the engine reporting an error when speak is called, or simply if the user presses a panic key command that basically tells the sr help! I got no speech!

So, a generic speech struct is fine, as long as it only supports two enums...

can you mok out an implementation of that? I don't understand what you described well enough to implement it, sorry. Ideally that should also show how would the flow of changing the implementations go, if that's configurable in the config file and so on.

So I like the idea, I just think it may fall almost entirely outside of Odilia's perview. If anybody happens to know how Orca handles this, I'd be interested.

well, simple, it entirely and solely relies on speech dispatcher. In comparison, either tdsr, fenrir or both, have an abstraction that allows them to rely on either spd, espeak or anything that can be called through the command line, with a wrapper. While there might still be various soundcard issues and whatever else preventing speech, spd is no longer the only source of truth, which means that if there's anything wrong with it and it alone, the backup engine/s should provide speech, allowing the user to fix the issue if it's fixable. Also, I think odilia shouldn't only work on the desktop, which means it should also be able to scale down to the command line/terminal screenreader mode if atspi errors out, isn't speech dispatcher two heavy for a tty only environment like, say, bsd images or the arch live iso shell?

@DataTriny
Copy link
Contributor

It's true that Orca solely relies on speech-dispatcher. It has an abstraction layer though.

@TTWNO
Copy link
Member

TTWNO commented Nov 18, 2022

@albertotirla

trait SpeechBackend {
  ...
}
impl SpeechBackend for SSIPBackend { ... }
impl SpeechBackend for ESpeakBackend { ... }
enum SpeechBackendType {
  SSIPCompatible(SSIPBackend),
  EspeakNG(ESpeakBackend),
}

So, only two variants will be supported. One which supports SSIP (for now only speech dispatcher), the other one which directly interfaces with libespeak for redundancy.

The syntax obviously won't compile, but you get the idea.

@albertotirla
Copy link
Member Author

@albertotirla

trait SpeechBackend {
  ...
}
impl SpeechBackend for SSIPBackend { ... }
impl SpeechBackend for ESpeakBackend { ... }
enum SpeechBackendType {
  SSIPCompatible(SSIPBackend),
  EspeakNG(ESpeakBackend),
}

So, only two variants will be supported. One which supports SSIP (for now only speech dispatcher), the other one which directly interfaces with libespeak for redundancy.

The syntax obviously won't compile, but you get the idea.

why is the speech backend only a marker trait so to speak? we need some way to unify those backends, so that we can place values for them in config files, speak with those backends through the trait itself, etc. Or, if that behaviour is not desired, then why have a trait at all? also, how would we represent backend specific configuration in the config file? could we derive serialize and deserialize on the enum and types, put the config in the types and have it work? what about backend agnostic settings? or is espeak considered strictly a fallback backend and it's not to be much configurable since it's for emergency issues only?

@albertotirla
Copy link
Member Author

If I may add something here: the thing that annoys me the most with speech-dispatcher is that you can't tweak settings that are specific to a particular synthesizer. When using espeak-ng for instance, you can't set the rate multiplier setting (I don't exactly remember the name), and so even with the speech rate set to 100% I still find it painfully slow.

Indeed, ideally some module specific configuration should be allowed, since some synth parameters are specific to that synth, aka rate boost in espeak. That could probably be easily fixable in spd, probably with some kind of configuration architecture overhall, modules needing updating and so on, but it's doable.

With that being said, I totally agree with @TTWNO. Sure, being able to pick amongst multiple speech backends is nice, but if the standard option is good enough, then it's just redundancy. Honestly I have never encountered a major issue with speech-dispatcher, but I tend to avoid doing crazy stuff with my system.

have you used pipewire recently with spd? maybe it's only in my case, I dk, however it introduces a lot of stuttering in my system. At the start, everything is good, but gradually as spd runs more and more, the xruns it usually does and have to be accounted for specially by the pipewire teams after some bugs were filed in the past, are affecting running apps to the point that voice calls on matrix or whatever voip solution I'm using, even videos or music, become so choppy I can't understand anything. Usually, restarting pipewire, turning off orca, then turning it back on after enough seconds passed so that spd goes and shuts itself down fixes it for a while. I asked on the pipewire matrix room and they told me people are having issues with spd as well, they also say it's because audio is processed syncronously. I would add to that the fact that it creates a stream per module which isn't necesarily good, on top of which its priority system is just cut the previously speaking stream, give the other one highest priority, then allow for the possibility of that app, regardless if it's being a screenreader, to never get the priority to speak again, I've seen that happen. Then, there's that incident discussed on the audiogames.net forum, where a person installed arch and everything but speech dispatcher, not sound, just speech dispatcher, wasn't working.

I'd say it's the same argument for AccessKit: when Odilia will need a GUI, will you implement your own UI framework and write your own AT-SPI provider? Probably not. Speech-dispatcher already exist, use it! If there are bugs, let's work together to fix them. If gradually replacing some parts by ones written in Rust really bring advantages, let's convince people of that. I don't think this piece of software has major flaws that could not be addressed. I am pretty sure most of the sound issues come from speech synthesizers themselves.

I firstly agree with what you said about the user interface part, probably we'd use gtk, however I'm dreading that day because it's gonna be interesting to make the UI in such a way it's generated by and depends on what's in the config files, but that's another discussion for another day. About contributing to speech dispatcher, firstly it's written in C which I don't think I know well enough, then it uses autotools which is kinda hard to get right even for a unix master, due to the intricacies and complexities of said system, plus most of the architecture would probably have to be rewritten if we want to have asyncronous streams, mixing per module, sane priority systems and all that goodness, we can always take it incrementally though. About it being only the synths, in the modules section of the spd reference, it says modules are incouraged to send spoken text as samples back to spd, which means then that spd playes them. For a quick and dirty test, try speaking "hello world!" with espeak-ng, voice en-us+max, vs doing it through spd-say. The first version would sound like normal espeak, like it sounds on windows and android, while through spd it sounds compressed for some reason, so spd is surely doing something with the data before sending it to...pulse I guess.

@TTWNO
Copy link
Member

TTWNO commented Nov 18, 2022

or is espeak considered strictly a fallback backend and it's not to be much configurable since it's for emergency issues only?

Exactly.

@TTWNO TTWNO added enhancement New feature or request help wanted Extra attention is needed labels Nov 23, 2022
@TTWNO TTWNO changed the title speech engine abstraction needed Speech Abstraction Layer Nov 28, 2022
@francois-caddet
Copy link
Contributor

If we already have a good crate (tts-rs) abstracting the speech server in rust, why not to contribute to it. If we want to have a direct espeak backend, why we don't add it to tts-rs?
I already tried to make a few contributions to tts-rs when I was learning Rust. The author was realy open to ideas of improvment.

@TTWNO
Copy link
Member

TTWNO commented Feb 26, 2023

This would be great!

I don't know enough about tts-rs to know if we csn use it with SSML, but generaly I like contributing upsteam, and this seems pretty reasonable.

@francois-caddet
Copy link
Contributor

I'm not sure, but I think at least it was in there plans. We would have to see the current state and if it's a feature they are intrested in if it's not already supported. For the backends they use, I'm almost sure, at least the desktop ones all have this feature.
I can investigate on this dirrection if you want and go back here when I know more.

@albertotirla
Copy link
Member Author

albertotirla commented Feb 26, 2023

the problem with tts-rs is that all its functions are syncronous, which means that we have to use spawn_blocking on tokio in order to do speech quasi asyncronously, but we tryed that some time ago and that's why we moved from the tts-rs creator's speech-dispatcher binding, it presented problems when used outside the simplest take this piece of text and speak it scenario. For example, there was an issue with being unable to stop speech after pressing the ctrl key. This is because, as you know, odilia's architecture, while not really hardcore well defined, is mostly based around the everything is an async stream philosophi. While it allowed us to use the systems we got in probably a more efficient way, problem is now that the codebase as it currently stands isn't prepaired for syncronous code, especially not in larger quantities or at the point where systems meet, in this case it happened to be the speech layer. So yeah, what happened is that the speech function either blocked till speech was finished, or spawned a blocking task, but because the sr was able to continue processing events, I expect the latter. In any case, a thread out there in the application was blocked till the speak function returned, aka till libspeechd returned, aka untill speech dispatcher server sends the speech finished/ok message. So then, when the ctrl key was pressed, the sr got it, however it couldn't stop the speech. Aborting the task did nothing since the message was still in flight, sending a cancel request, which is what we ended up doing, didn't work because probably libspeechd has its own queue per connection I think, so that the cancel message we sent remained in that queue untill the speak message got its reply, but that's already too late, since probably the server would have replied with, well, nothing to cancel or something like that. With async, the speak message is sent via normal means, but then the future returns Poll::Pending and uses epol_wait to register a waker which signifies interest for when that socket is ready to be read from, in other words when the server replied with something. While that task is parked, one can just send another message via the same socket, which would arrive to the server, without conflicting with the previous one, simply because spd messages rarely are made to conflict. In any case, if the second message, in this case cancel, is recieved before the first finished speaking, the server would get it, cancel, then send an OK response, at which point the second future would be waked from parking and probably imediately return Pole::Ready(T). Now, since the thing is canceled, the server would probably answer to the previous request as well, even if only with a speech canceled message. Now, the first future would wake at last, let's suppose by just returning Pole::Ready(Err(SpeechCanceledError)), that would then bubble up till it would reach some logging instruction or another, etc. No, our implementation doesn't make a new socket connection every request, but the server currently answers to every speak message, even if they are interleafed with other messages like presented above, that's why it works. I understand why the creator of tts-rs doesn't want to introduce async as well, that's why I'm not just opening issues left and right. As I'm sure you know, async requires an executor, many async crates aren't going to work across executors because the system is fragmented, if an additional big runtime blob and many compile-time dependencies weren't bad enough. If you don't have to do anything with async in the rest of your application, of course you're gonna find that something like bloatware, especially when it comes to speech, even more so because linux is the one where async makes most sense, in most other cases calls are syncronous after a fassion, since it's all ffi anyway. So yes, from this point of view, not including async makes a lot of sense, so unfortunately we had to make our own systems. Also, speech-dispatcher-rs has an issue in my view, it depends on linking with a c library, which implies all kinds of pane when trying to compile this cross-architecture, so the more pure rust this is, the better for everyone involved. So yeah, currently everything to my knowledge is pure rust, probably that would change when it comes to linking with libespeak, which is probably not gonna be very pleasent when we'll arrive there. Also, async will be more of a problem there, but we'll see what we can do then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants