Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support Piper voices in the custom voice backend #375

Closed
kfatehi opened this issue Feb 8, 2024 · 9 comments
Closed

Feature request: Support Piper voices in the custom voice backend #375

kfatehi opened this issue Feb 8, 2024 · 9 comments

Comments

@kfatehi
Copy link
Contributor

kfatehi commented Feb 8, 2024

I searched the repo for piper and did not see any mention of it. So I just discovered Piper which has very natural sounding local voices in many languages, including the challenging Persian.

Here is a link to their samples demo page https://rhasspy.github.io/piper-samples/

I would love to see these voices supported within ReadAloud. Time-permitting I should be able to perform this lift within the existing custom python backend through which local Nvidia Riva voice support was merged some time ago, especially to bring support for multiple languages and backend implementations thereof -- for example, the Persian voice is a custom trained piper voice that exists as a separate python application with the model loaded. I would need to understand how best to make this generically plug into such a system. It will take a bunch of time to understand.

But this particular lift would probably close #217 and kfatehi/persian-tts-server#1 but it will be tricky to document. I am already somewhat confused when it comes to my docs having come back to this to use Riva after many months without using TTS...

@ken107
Copy link
Owner

ken107 commented Feb 8, 2024

There are many open-source AI TTS models now, and they're all capable of very high quality output. I hope soon someone would package one of these models into a readily distributable package that end users can download and install, albeit they'll need machines with beefy GPUs, but it's the future. It's on my R&D todo list, but I haven't time to dig in at the moment. Riva is a good start.

@kfatehi
Copy link
Contributor Author

kfatehi commented Feb 8, 2024

I'm excited about Piper because sounds better than Riva's English and Karim's CoquiTTS Persian model while seemingly taking negligible resources. They are not even using my GPU. The models are in the ONNX format which I think may have something to do with it. The docker image for Riva is huge and not easily shippable, requires CUDA, and loads 10GB into VRAM! Piper on the other hand actually sounds better and presents as a plain python program with an ONNX file alongside it. I haven't dug into the details but that future is here, but it just needs a user-friendly way to ship. It may already exist or is being worked on, I don't know. Should I find myself working on this I'll let you know, I expect to need it soon though, and I'll try my best to make a sensible and documented design.

@kfatehi
Copy link
Contributor Author

kfatehi commented Feb 8, 2024

image
nothing at all while playing the Persian voice thru piper on my small server with an nvidia card in it.

meanwhile I need to use my gaming machine for Riva, which is taking up 9GB.. not to mention installation is a nightmare and requires Nvidia API account..

image

@ken107
Copy link
Owner

ken107 commented Feb 8, 2024

You're right the Piper models are quite small, that means it doesn't need much computing power at all. This is pretty cool. It would be nice if there is a Windows executable of piper. Then we write a Windows app that automatically downloads voices as they're needed, runs a local HTTP server to service TTS requests similar to how our Riva implementation is doing right now. Figuring how to compile Piper code on Windows would be a first step. We could run their Linux executable via WSL, but to make it readily accessible to the average user, it would have to be a Windows program that they can click and install.

@kfatehi
Copy link
Contributor Author

kfatehi commented Feb 8, 2024

Yeah good points. Following this section https://github.com/rhasspy/piper?tab=readme-ov-file#people-using-piper leads (through NVDA - NonVisual Desktop Access) to the Rust implementation of Piper which sounds may be good path to a Windows executable for ReadAloud users https://github.com/mush42/sonata although shipping a locally-included Python similar to apps like Sabnzbd do is a well-trodden path complete with cross-platform niceties we'd want like a built-in tray icon and local HTTP server.

I should point out that Piper is building windows binaries here https://github.com/rhasspy/piper/releases/tag/2023.11.14-2 which the persian voice project is instructing to be used via command line.

@ken107
Copy link
Owner

ken107 commented Feb 9, 2024

Well, the Windows build of Piper works great, takes about 5 seconds to generate 60s of audio on my mini-PC with an i5 and no GPU. So now we can make a wrapper app around it, provide an App manifest so extensions can talk to it via native messaging. And implement voice selection and automatic model download and inference. It's all possible.

More than half of Read Aloud users, however, use ChromeOS. The best thing we can do, if possible, is to do inferencing right in the browser. Since ONNX has a JavaScript runtime, we simply need to implement in JS the Piper processing pipeline. It appears from Piper's code, a model accepts either phonemes or codepoints as input. For phonemes the pipeline is: (1) convert text to phonemes using eSpeak phonemizer, (2) translate phonemes to IDs using lookup table, (3) pass phoneme IDs to the model for inferencing. For codepoints, simply break the text into codepoints and pass to the model.

Either that, or we compile Piper's C++ code into WebAssembly.

Once that's done, then Piper can be distributed as part of our extension, and will run on any modern browser/OS. Our extension lets user select the voice, download the voice model and stores it in the user's computer using IndexedDB or OPFS. Loads the model and use it to generate PCM audio data. Wraps the audio data into a WAV blob for playback.

It's a lot of work. But it will make AI TTS accessible to every end-user.

@guitarino
Copy link

guitarino commented Feb 11, 2024

@ken107 very cool proposal! I'm curious to see how it will play out if you decide to do it

For the time being, on my Linux, I am able to use Piper TTS with Read Aloud via customizing speech-dispatcher configuration. It's not a very difficult process:

  1. Install Piper TTS. Personally, I prefer installation via pip if you already have Python: pip install piper-tts
  2. Make sure you either already have ~/.config/speech-dispatcher/speechd.conf folder or create the user-specific speech-dispatcher configuration via spd-conf -uc
  3. You may need to restart your computer at this point
  4. Make sure you have sox installed by checking sox --version, and, if not, install it via your normal process, e.g. on Ubuntu / debian it's sudo apt install sox
  5. Download voices from the Piper voices repo and put them somewhere (make a note of where). If you don't download the entire repo, just make sure that the voices still follow the same folder structure, e.g. en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx
  6. Just make sure that at this point you can actually run Piper TTS via a command line like this one
  7. Figure out where your Piper executable is by running which piper. Make a note of where it is
  8. If all goes well, now let's configure speech-dispatcher
  9. Modify ~/.config/speech-dispatcher/speechd.conf and add a few lines at the end that look like this:
    AddModule "piper" "sd_generic" "piper.conf"
    DefaultModule piper
    LanguageDefaultModule "en" "piper"
    LanguageDefaultModule "ru" "piper"
    
  10. In this speechd.conf, change the added LanguageDefaultModule lines to specify the languages you want to use. It will use Piper TTS by default for these languages
  11. Add a file ~/.config/speech-dispatcher/modules/piper.conf that looks a bit like this:
DefaultVoice "en/en_GB/alan/medium/en_GB-alan-medium.onnx"

# Specifying a rarely used symbol & big limit so that speech-dispatcher doesn't cut text into chunks:
GenericDelimiters "˨"
GenericMaxChunkLength 1000000

# These lines are important to specify for every language you'll use, otherwise some characters will not work:
GenericLanguage "en" "en-us" "utf-8"
GenericLanguage "en" "en-gb" "utf-8"
GenericLanguage "ru" "ru" "utf-8"

GenericCmdDependency "sox"
GenericCmdDependency "aplay"

GenericExecuteSynth \
"echo '$DATA' | PATH_TO_PIPER_EXECUTABLE --model 'PATH_TO_VOICES_FOLDER/$VOICE' --output_raw | sox -r 22050 -c 1 -b 16 -e signed-integer -t raw - -t wav - tempo $RATE pitch $PITCH norm | aplay -r 22050 -f S16_LE -t raw -"

GenericRateAdd 1
GenericPitchAdd 1
GenericVolumeAdd 1
GenericRateMultiply 1
GenericPitchMultiply 1000

# Adding all voices we want:
AddVoice "en" "FEMALE1" "en/en_GB/jenny_dioco/medium/en_GB-jenny_dioco-medium.onnx"
AddVoice "en" "MALE1" "en/en_GB/alan/medium/en_GB-alan-medium.onnx"
AddVoice "en" "FEMALE1" "en/en_GB/semaine/medium/en_GB-semaine-medium.onnx"
AddVoice "en" "FEMALE1" "en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx"
AddVoice "en" "FEMALE1" "en/en_GB/alba/medium/en_GB-alba-medium.onnx"
AddVoice "en" "FEMALE1" "en/en_US/amy/medium/en_US-amy-medium.onnx"
AddVoice "ru" "MALE1" "ru/ru_RU/dmitri/medium/ru_RU-dmitri-medium.onnx"
  1. Now, in this piper.conf, please replace PATH_TO_PIPER_EXECUTABLE with the path to your Piper TTS executable you noted, and PATH_TO_VOICES_FOLDER with the path to your voices folder you noted.
  2. Change piper.conf and adopt it to the voices you downloaded and the languages you'll need. So you'll need to remove those lines that have AddVoice and GenericLanguage and replace them with yours
  3. You'll need to restart speech-dispatcher, so (1) kill any process named speech-dispatcher, and (2) run sudo systemctl restart speech-dispatcherd.service
  4. All done! Install Read Aloud plugin, and the voices should be there
    Screenshot from 2024-02-11 02-03-15

I also submitted a PR that would make it really nice to use Piper TTS with Read Aloud plugin: #376. Currently, there's a bit of a problem where the beginning of each paragraph takes a long time to start speaking, but that PR adds an option to essentially merge all paragraphs into one, and Piper TTS is much better at handling that

@ken107
Copy link
Owner

ken107 commented Feb 11, 2024

That PR is not necessary, as Read Aloud has the ability to prefetch the speech for the next chunk while it's reading the current chunk. In the tts-engine.js file, you can see that all the cloud-based engines perform prefetching. Your piper voices are provided by the OS, and so Read Aloud uses the BrowserTtsEngine which doesn't prefetch. To do it correctly, need to add a tts-engine implementation for Piper.

ken107 added a commit that referenced this issue Mar 23, 2024
@ken107
Copy link
Owner

ken107 commented Mar 23, 2024

@kfatehi Been some intense work weeks but it's finally here. I'll publish Read Aloud 2.9.0 today with Piper voices integrated. I'll also publish a separate "Piper TTS" extension that makes Piper voices available to the browser and other extensions. The full source code will be released soon as well. The Piper project is pretty awesome. Thanks for bringing it to my attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants