-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there anyway to synthesize in real-time? #15
Comments
The slowest part at the moment is the Griffin-Lim reconstruction to convert spectrograms to waveforms. This currently runs on the CPU and uses librosa's stft and istft functions. There are a number of options for improving this, ordered from best to worst:
Pull requests for (1) or (2) would be very welcome! |
@keithito thanks for the solid answer, I've also been wondering about this. It seems a GPU implementation would be most beneficial. You could then make it optional to use the GPU or vanilla implementation. |
@keithito Hi, thank you for your experiments. I am a little confused about what's the major difference between you and Kyubyong's? I have researcheded with Kyubyong's for some days but still can not get results like yours. |
@keithito i think another option would be implementing an algorithm other than griffin-lim to do the inversion. the one presented in the paper "Real-time Iterative Spectrum Inversion with Look-ahead" seems to be a good choice. i unfortunately did not find the code of that anywhere online.. |
@ElevenGameStudios unfamtomably , someone ( Kyubyong/tacotron#81) got kyubyong's version to perform better than any other , when it was the other way around |
@ElevenGameStudios so much so , that the only thing left to do would be to implement baidu's version of tacotron , which is multi speaker and includes a better algorithm than grifin-lim |
@keithito - Thanks for your awesome implementation! :) After training (and babysitting) from scratch, with CMU enabled, for 111K steps (2 days) - Here are my results: For those evals, I set griffin-lim iters to 12 (and my max iters is 325 because of my dataset).
Hope someone can share a tf GPU implementation of griffin-lim, along with multiprocessing / threading implementations for librosa functions. :)
|
@MXGray Your results sound really good! 👍 Would you mind sharing audio files with 60 Griffin-Lim iterations and would it be okay if I linked to them from the top-level README file? I'm also hoping someone shares a TF implementation of Griffin-Lim, but if nobody does, I might give it a shot next weekend. |
@keithito >> No problem - Yes, feel free to link these samples - Here are my 111K and 140K results at different gfn-lm iters (12 and 60) and the same max iters (325) using the same dataset (Nancy): /edit/correct ZIP files updated Can't wait to get my hands on a tf GPU gfn-lm strategy! :) |
@MXGray those are some really nice results, well done! with max_iters that high, i am sure it can synthesize pretty long sentences.. @keithito I haven't gotten around to implement any possible Griffin-Lim improvements either. But I think this https://github.com/lonce/SPSI_Python (Single Pass Spectrogram Inversion, was mentioned in the post you refer to in your g.l. code comment) could still be a good idea as an initialization for g.l. then one should need way less griffin lim steps and if those could be done multithreaded or on the gpu, synthesis should be way faster. |
Kyubyong implemented Griffin-Lim in TF: https://github.com/Kyubyong/tensorflow-exercises/blob/master/Audio_Processing.ipynb |
@tmulc18 Thanks for you sharing. @Kyubyong 's Griffin-Lim implementation works very well. Based on this code, I added Here is my modified version: https://github.com/candlewill/Griffin_lim . This could be integrated easy with current Tacotron implementation. |
@candlewill
|
@qclu In your integration method, two graphs are executed sequentially in two sessions. That might be time consumption, as the data flows between the two sessions are on CPU (not GPU). I think it's better to extend the Tacotron network with the griffin-lim. The extended Tacotron would output samples directly in just one session. |
@candlewill
Have you tied and get a better result? |
In @keithito 's latest commit, there is a TF-griffin-lim branch. https://github.com/keithito/tacotron/tree/tf-griffin-lim |
I just merged PR #41 which adds a TensorFlow implementation of Griffin-Lim, based on the code in Kyubyong's notebook. This speeds things up considerably. You can synthesize 6 sec of audio in about 0.8 sec on a GTX 1080 Ti. There's still room for improvement. Everything runs on one example at a time right now, but both the model and the Griffin-Lim implementation work fine on batches of data, so it should be pretty straightforward to synthesize multiple (e.g. 32) examples in parallel. Thanks to @Kyubyong and @candlewill! |
@qclu If you're not seeing a speedup, one thing to check is if you're using a version of TensorFlow with GPU support. The first time I tried upgrading to TensorFlow 1.3, I ran |
I'll re-open this for a few more days in case there are issues. |
In my own implementation, from text->linear spectrum:
I think the key problems of synthesis's speed are:
|
@candlewill Yes, thanks so much. |
@keithito |
In case this helps, my tests so far point out the following things:
P.S. I'm continuing to separately train the newest release using LJS, Nancy and a Tagalog (Filipino) dataset; |
@qclu That's a bit surprising. The training pipeline isn't using the GPU version of Griffin-Lim. Also, a Tesla P40 has 24GB of memory, which is a lot more than should be necessary. What dataset are you training on? Does it have really long audio clips? |
Hi, Could you please share your trained model checkpoints on the Nancy Dataset? But if you want to try Single Pass inversion, there is a repo available. |
@saxenauts |
BTW, another thing is making duration of WAV output adapt to input length ... |
@mertyildiran |
@MXGray It would take some time to process the end silence of wav. Does it still have real-time? |
@zuoxiang95 |
@keithito |
Another option for trimming the silence is to run it through librosa.effects.trim: trimmed_wav, _ = librosa.effects.trim(wav) This seems to perform pretty well, for example, start = time.time()
trimmed_wav, _ = librosa.effects.trim(wav)
print('Trimmed from %.2f sec to %.2f sec in %.3f sec' % (
len(wav) / sample_rate, len(trimmed) / sample_rate, (time.time() - start)))
|
Hi, Thank you for sharing your model and trained logs but google drive link seems dead now. Could you please reshare it for all of us? |
@nucleiis |
@MXGray |
Dear all, I am new to speech field, and still can't understand it after going through the original paper or google this algorithm, so if it is a stupid question, sorry about that>< |
@MXGray |
Answering to the question posted on this topic: |
And a PyTorch Tacotron 2 implementation with FP16 and multi-GPU support |
When normalizing the Nancy dataset - did you normalize to peak, rms, broadcast standard(LUFs) or something else? Cheers! |
with LPCNET , atleast Vocoder is the realtime infact takes almost 0.8s for 1sec audio, |
@saxenauts Thank you very much! |
Hi there, the link is broken. Can you please share again? |
Based on my experiments, Tacotron isn't able to synthesize in real-time (1 second for synthesizing 1 second).
Is there any solution or modification to resolve this problem?
The text was updated successfully, but these errors were encountered: