-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Support Piper voices in the custom voice backend #375
Comments
There are many open-source AI TTS models now, and they're all capable of very high quality output. I hope soon someone would package one of these models into a readily distributable package that end users can download and install, albeit they'll need machines with beefy GPUs, but it's the future. It's on my R&D todo list, but I haven't time to dig in at the moment. Riva is a good start. |
I'm excited about Piper because sounds better than Riva's English and Karim's CoquiTTS Persian model while seemingly taking negligible resources. They are not even using my GPU. The models are in the ONNX format which I think may have something to do with it. The docker image for Riva is huge and not easily shippable, requires CUDA, and loads 10GB into VRAM! Piper on the other hand actually sounds better and presents as a plain python program with an ONNX file alongside it. I haven't dug into the details but that future is here, but it just needs a user-friendly way to ship. It may already exist or is being worked on, I don't know. Should I find myself working on this I'll let you know, I expect to need it soon though, and I'll try my best to make a sensible and documented design. |
You're right the Piper models are quite small, that means it doesn't need much computing power at all. This is pretty cool. It would be nice if there is a Windows executable of piper. Then we write a Windows app that automatically downloads voices as they're needed, runs a local HTTP server to service TTS requests similar to how our Riva implementation is doing right now. Figuring how to compile Piper code on Windows would be a first step. We could run their Linux executable via WSL, but to make it readily accessible to the average user, it would have to be a Windows program that they can click and install. |
Yeah good points. Following this section https://github.com/rhasspy/piper?tab=readme-ov-file#people-using-piper leads (through NVDA - NonVisual Desktop Access) to the Rust implementation of Piper which sounds may be good path to a Windows executable for ReadAloud users https://github.com/mush42/sonata although shipping a locally-included Python similar to apps like Sabnzbd do is a well-trodden path complete with cross-platform niceties we'd want like a built-in tray icon and local HTTP server. I should point out that Piper is building windows binaries here https://github.com/rhasspy/piper/releases/tag/2023.11.14-2 which the persian voice project is instructing to be used via command line. |
Well, the Windows build of Piper works great, takes about 5 seconds to generate 60s of audio on my mini-PC with an i5 and no GPU. So now we can make a wrapper app around it, provide an App manifest so extensions can talk to it via native messaging. And implement voice selection and automatic model download and inference. It's all possible. More than half of Read Aloud users, however, use ChromeOS. The best thing we can do, if possible, is to do inferencing right in the browser. Since ONNX has a JavaScript runtime, we simply need to implement in JS the Piper processing pipeline. It appears from Piper's code, a model accepts either phonemes or codepoints as input. For phonemes the pipeline is: (1) convert text to phonemes using eSpeak phonemizer, (2) translate phonemes to IDs using lookup table, (3) pass phoneme IDs to the model for inferencing. For codepoints, simply break the text into codepoints and pass to the model. Either that, or we compile Piper's C++ code into WebAssembly. Once that's done, then Piper can be distributed as part of our extension, and will run on any modern browser/OS. Our extension lets user select the voice, download the voice model and stores it in the user's computer using IndexedDB or OPFS. Loads the model and use it to generate PCM audio data. Wraps the audio data into a WAV blob for playback. It's a lot of work. But it will make AI TTS accessible to every end-user. |
@ken107 very cool proposal! I'm curious to see how it will play out if you decide to do it For the time being, on my Linux, I am able to use Piper TTS with Read Aloud via customizing speech-dispatcher configuration. It's not a very difficult process:
I also submitted a PR that would make it really nice to use Piper TTS with Read Aloud plugin: #376. Currently, there's a bit of a problem where the beginning of each paragraph takes a long time to start speaking, but that PR adds an option to essentially merge all paragraphs into one, and Piper TTS is much better at handling that |
That PR is not necessary, as Read Aloud has the ability to prefetch the speech for the next chunk while it's reading the current chunk. In the tts-engine.js file, you can see that all the cloud-based engines perform prefetching. Your piper voices are provided by the OS, and so Read Aloud uses the BrowserTtsEngine which doesn't prefetch. To do it correctly, need to add a tts-engine implementation for Piper. |
@kfatehi Been some intense work weeks but it's finally here. I'll publish Read Aloud 2.9.0 today with Piper voices integrated. I'll also publish a separate "Piper TTS" extension that makes Piper voices available to the browser and other extensions. The full source code will be released soon as well. The Piper project is pretty awesome. Thanks for bringing it to my attention. |
I searched the repo for piper and did not see any mention of it. So I just discovered Piper which has very natural sounding local voices in many languages, including the challenging Persian.
Here is a link to their samples demo page https://rhasspy.github.io/piper-samples/
I would love to see these voices supported within ReadAloud. Time-permitting I should be able to perform this lift within the existing custom python backend through which local Nvidia Riva voice support was merged some time ago, especially to bring support for multiple languages and backend implementations thereof -- for example, the Persian voice is a custom trained piper voice that exists as a separate python application with the model loaded. I would need to understand how best to make this generically plug into such a system. It will take a bunch of time to understand.
But this particular lift would probably close #217 and kfatehi/persian-tts-server#1 but it will be tricky to document. I am already somewhat confused when it comes to my docs having come back to this to use Riva after many months without using TTS...
The text was updated successfully, but these errors were encountered: