-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic TTS audio streaming feature #4706
Comments
Sorry, I mis-read some of your text before and I went off on a tangent! One thing I would like to suggest, if you can process ahead, is that on low VRAM situations, voice generation is slow. It would be great to have the option to use CPU+System RAM as an option for generation. In some very low VRAM scenarios, I've found CPU generation is about on a par with GPU generation, even sometimes a little faster. (Obviously peoples milage will vary depending on their CPU). By low VRAM, to load a 13B model on a 12GB card, it uses between 11.4GB to 11.7GB's before you even start thinking about doing TTS. It does still work, but GPU generation in this type scenario goes from a 10-20second time to generate, up to a 60-120 time to generate. TTS on CPU (8x core 16x thread) seems to come in around the 50-120 second mark, so if processed alongside the text generation, it would shorten the wait time before you hear something. |
I have a few other thoughts on this.
I have however possibly hit on a decent performance gain for people with low VRAM situations #4712
Average Processing time whole Paragraph: 42.5 seconds (Before you hear anything spoken)
|
But as long as the RTF is under 1 it doesn't matter how many percent the "chunk inference" is slower then the "normal" inference as playback will take longer anyway. With a GPU and deepspeed the tts RTF was 0.34 in my test with coqui xtts (silero or piper even faster), so plenty of free processing time available to use. I do understand that it is not the same for everyone but so is everything regarding AI, some have the hardware to take advantage of all the options and bigger models and some don't. As with almost all features, it should be optional for sure. So it wont affect users which can't use it and everyone else has a much better experience. |
100% completely agree! I just thought it was worth adding a few thoughts around it so that if anything gets developed, everyone's needs can be covered and we can choose what flavour works for us personally. |
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment. |
In short: It's about tts generation and audio playback during text-generation instead of waiting till the whole response is generated by the LLM.
"In longer": I'd love to see a feature which makes it possible to support audio streaming in text-generation-webui, as it would increase response time especially for longer answers and if it would be a generic solution, it could potentially solve the problem with tts engines not being able to handle long text inputs. However I think there needs to be some essential changes to the way streaming works in text-generatio-webui, as well as how the returned audio is handled. I think the changes required could potentially be utilized by all tts engines. The following things are important in my opinion and from a very high level view (as I'm not a dev):
1. streaming mode:
text-generation-webui streaming mode has nothing to do with the tts engine being able to stream the response audio. In fact text-generation-webui streaming mode is incompatible with all tts engines, as far as I can tell. That's why it is necessary to disable streaming mode when using tts. Reason is you can't feed the tts engines individual words, while they are produced by the LLM. So we need to wait at least till the first sentence is finished. For very short sentences like "Ok!" it is most likely better to accumulate more sentences till the input text reaches a predefined amount of tokens.
2. call the tts-engine/handle the returned audio:
There are at least two ways to handover the input text to the tts engine, based on the fact if the engine supports streaming or not. If it does not (which would be the generic approach) text-generation just calls the tts engine multiple times for each chunk of sentences. This yields multiple audio files that need to be played in succession AND the playback control needs to support that, as the user should still be able to pause/play/download the wav file (I don't think it's a good idea to add multiple playback controls per each returned audio). There also needs to be some queuing before the tts engine is called and after the audio is returned, so sentence chunks and audio is processed in the correct order. Also if every audio response should be concatenated in text-generation-webui it must be made sure to initially create a wav header and append the audio chunks without further headers, to have one continuous audio response. If the tts engine does support streaming the text-generation-webui needs to support byte streaming which has its own challenges for example buffering, to make sure playback is not getting ahead of the inference stream, and also playback control for an audio stream in the UI etc. (gradio must support that: gradio-app/gradio#5160)
As an example, the coqui space on huggingface does it more or less like that: https://huggingface.co/spaces/coqui/voice-chat-with-mistral
The text was updated successfully, but these errors were encountered: