Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there anyway to synthesize in real-time? #15

Closed
msobhan69 opened this issue Aug 7, 2017 · 44 comments
Closed

Is there anyway to synthesize in real-time? #15

msobhan69 opened this issue Aug 7, 2017 · 44 comments

Comments

@msobhan69
Copy link

Based on my experiments, Tacotron isn't able to synthesize in real-time (1 second for synthesizing 1 second).
Is there any solution or modification to resolve this problem?

@keithito
Copy link
Owner

keithito commented Aug 7, 2017

The slowest part at the moment is the Griffin-Lim reconstruction to convert spectrograms to waveforms. This currently runs on the CPU and uses librosa's stft and istft functions. There are a number of options for improving this, ordered from best to worst:

  1. Write a Tensorflow Griffin-Lim implementation so it can run on the GPU.
  2. Write a parallel CPU implementation (I haven't looked closely at librosa's stft and istft, but I believe they're single threaded). Even something simple like slicing up the spectrogram into several pieces, running Griffin Lim on each piece on a different core, and then stitching the wavs together might work.
  3. Decrease the number of Griffin-Lim iterations by running eval/demo_server with --hparams="griffin_lim_iters=30". This is easy to do, but will decrease audio quality. You can play with the number to trade off speed and quality.

Pull requests for (1) or (2) would be very welcome!

@MaxwellRebo
Copy link

MaxwellRebo commented Aug 7, 2017

@keithito thanks for the solid answer, I've also been wondering about this. It seems a GPU implementation would be most beneficial. You could then make it optional to use the GPU or vanilla implementation.

@jpdz
Copy link

jpdz commented Aug 9, 2017

@keithito Hi, thank you for your experiments. I am a little confused about what's the major difference between you and Kyubyong's? I have researcheded with Kyubyong's for some days but still can not get results like yours.
Thank you so much!

@ElevenGameStudios
Copy link

@keithito i think another option would be implementing an algorithm other than griffin-lim to do the inversion. the one presented in the paper "Real-time Iterative Spectrum Inversion with Look-ahead" seems to be a good choice. i unfortunately did not find the code of that anywhere online..

@GunpowderGuy
Copy link

GunpowderGuy commented Aug 17, 2017

@ElevenGameStudios unfamtomably , someone ( Kyubyong/tacotron#81) got kyubyong's version to perform better than any other , when it was the other way around

@GunpowderGuy
Copy link

GunpowderGuy commented Aug 17, 2017

@ElevenGameStudios so much so , that the only thing left to do would be to implement baidu's version of tacotron , which is multi speaker and includes a better algorithm than grifin-lim

@MXGray
Copy link

MXGray commented Sep 4, 2017

@keithito - Thanks for your awesome implementation! :)
@everyone - Thanks for sharing your ideas and experiments! :)

After training (and babysitting) from scratch, with CMU enabled, for 111K steps (2 days) - Here are my results:
ResultsAfter111KSteps.zip

For those evals, I set griffin-lim iters to 12 (and my max iters is 325 because of my dataset).

  • Did this so I can get my userinput statements synthesized in less than 8 seconds or so.
  • Much higher quality when I set it to 30 or so, though it takes around 15 seconds at 30 and 22 seconds at 60 ...

Hope someone can share a tf GPU implementation of griffin-lim, along with multiprocessing / threading implementations for librosa functions. :)
BTW, I used the Nancy dataset up at http://data.cstr.ed.ac.uk

  • I did some improvements to the WAV files and also normalized everythin; and
  • I also preprocessed the text data to better match the formatting of LJ Speech's text data.

@keithito
Copy link
Owner

keithito commented Sep 5, 2017

@MXGray Your results sound really good! 👍

Would you mind sharing audio files with 60 Griffin-Lim iterations and would it be okay if I linked to them from the top-level README file?

I'm also hoping someone shares a TF implementation of Griffin-Lim, but if nobody does, I might give it a shot next weekend.

@MXGray
Copy link

MXGray commented Sep 5, 2017

@keithito >> No problem - Yes, feel free to link these samples - Here are my 111K and 140K results at different gfn-lm iters (12 and 60) and the same max iters (325) using the same dataset (Nancy): /edit/correct ZIP files updated
111KStepsAt60GfnLm.zip
140KStepsAt12GfnLm.zip
140KStepsAt60GfnLm.zip

Can't wait to get my hands on a tf GPU gfn-lm strategy! :)

@ElevenGameStudios
Copy link

@MXGray those are some really nice results, well done! with max_iters that high, i am sure it can synthesize pretty long sentences..

@keithito I haven't gotten around to implement any possible Griffin-Lim improvements either. But I think this https://github.com/lonce/SPSI_Python (Single Pass Spectrogram Inversion, was mentioned in the post you refer to in your g.l. code comment) could still be a good idea as an initialization for g.l. then one should need way less griffin lim steps and if those could be done multithreaded or on the gpu, synthesis should be way faster.

@tmulc18
Copy link

tmulc18 commented Sep 7, 2017

Kyubyong implemented Griffin-Lim in TF: https://github.com/Kyubyong/tensorflow-exercises/blob/master/Audio_Processing.ipynb

@candlewill
Copy link
Contributor

candlewill commented Sep 8, 2017

@tmulc18 Thanks for you sharing. @Kyubyong 's Griffin-Lim implementation works very well. Based on this code, I added _denormalize, _db_to_amp, _inv_preemphasis functions. All APIs keeps the same with @keithito 's implementation.

Here is my modified version: https://github.com/candlewill/Griffin_lim .

This could be integrated easy with current Tacotron implementation.

@qclu
Copy link

qclu commented Sep 8, 2017

@candlewill
Great!
I have tried your share with eval.py in synthesizer. But I didnot get any imporvement in time consumption used in griffin-lim.
Following is my implementation(in synthesizer.py):

  • timestart = time.time()
    
  • spec = self.session.run(self.model.linear_outputs[0], feed_dict=feed_dict)
    
  • out = io.BytesIO()
    
  • #with tf.device('/gpu:2'):
     # here I onced tired to specified gpu, but got errors.
    
  • sample = inv_spectrogram(spec) 
    
  • with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as sess:
    
  •     sample = sess.run(sample) 
    
  • audio.save_wav(sample, out)
    
  • print('tf time consumed: ', time.time() - timestart)
    

For time consumptions:(60 itrs)
GPU time consumed: 29.525731086730957
CPU time consumed: 7.406686067581177

If I got some errors in my implementations?
Thanks

@candlewill
Copy link
Contributor

candlewill commented Sep 8, 2017

@qclu In your integration method, two graphs are executed sequentially in two sessions. That might be time consumption, as the data flows between the two sessions are on CPU (not GPU).

I think it's better to extend the Tacotron network with the griffin-lim. The extended Tacotron would output samples directly in just one session.

@qclu
Copy link

qclu commented Sep 13, 2017

@candlewill
I have extended the Tacotron network with the griffin-lim and output samples directly in one session.
Following is time consumpetion:

  1. Original implementation:
    time consumed: 8.203073501586914 s
  2. Wtih gl algorithm embeded in tacotron:
    time consumed: 10.650420665740967 s

Have you tied and get a better result?

@candlewill
Copy link
Contributor

candlewill commented Sep 13, 2017

In @keithito 's latest commit, there is a TF-griffin-lim branch. https://github.com/keithito/tacotron/tree/tf-griffin-lim

@keithito
Copy link
Owner

I just merged PR #41 which adds a TensorFlow implementation of Griffin-Lim, based on the code in Kyubyong's notebook. This speeds things up considerably. You can synthesize 6 sec of audio in about 0.8 sec on a GTX 1080 Ti.

There's still room for improvement. Everything runs on one example at a time right now, but both the model and the Griffin-Lim implementation work fine on batches of data, so it should be pretty straightforward to synthesize multiple (e.g. 32) examples in parallel.

Thanks to @Kyubyong and @candlewill!

@keithito
Copy link
Owner

keithito commented Sep 13, 2017

@qclu If you're not seeing a speedup, one thing to check is if you're using a version of TensorFlow with GPU support. The first time I tried upgrading to TensorFlow 1.3, I ran pip install --upgrade tensorflow, which installed the non-GPU version. To fix this, I needed to uninstall TF, then run pip install tensorflow-gpu.

@keithito
Copy link
Owner

I'll re-open this for a few more days in case there are issues.

@keithito keithito reopened this Sep 13, 2017
@lifeiteng
Copy link

lifeiteng commented Sep 13, 2017

In my own implementation, from text->linear spectrum:
synthesis 10 (batch 10) utters 303 chars using 0.9374 seconds on CPU
max decoder timestep is 80 ( reduction factor is 5, so 80 * 5 * 12.5 = 5 seconds audio output), texts are:

Lisa They study outside.
Lisa He didn't get enough sleep.
Lisa Who enjoys gardening?
Lisa The Earth's axis is perpendicular to the Sun.
Lisa to encourage it.
Lisa Angela is staying in bed because she's sick.
Lisa Who are sitting in a circle?
Lisa Dan is American.
Lisa Harry and his friend took a train from Beijing to Shanghai.
Lisa There would be more seasons.

I think the key problems of synthesis's speed are:

  1. balancing text's length in one batch
  2. stop decoding when reach the end of the text (related to 1, I haven't implement this)
  3. faster Griffin-Lim implementation

@qclu
Copy link

qclu commented Sep 13, 2017

@candlewill Yes, thanks so much.
@keithito yes, I compare with your latest release, time consumptions are quite close.
Great work!

@qclu
Copy link

qclu commented Sep 14, 2017

@keithito
I tried train with latest release v0.2.0. In previous train, batch_size is 32, and it processes well. But for v0.2.0, unitl batch size is set to 8, OOM error is avoided. My GPU is Tesla P40.
I think it should be aroused by Griffin_lim implementation. Mybe we need furthur optimize this issue for long.

@saxenauts
Copy link

@MXGray @keithito
Training on Nancy dataset seems to be more natural sounding than on LJ Speech dataset.
Is it because the latter is clippings of audiobooks?
If so, how can I overcome this?

Thanks

@MXGray
Copy link

MXGray commented Sep 14, 2017

@saxenauts
@keithito

In case this helps, my tests so far point out the following things:

  1. The voice you get tends to be what you feed it - Synthesis voice quality is largely dependent on dataset voice quality;

  2. Prediction quality tends to depend largely on the volume and diversity of what you feed it - Bigger data volume helps in optimizing prediction of unseen text; and

  3. Similar to number 2 - Longer dataset clips optimize prediction of longer unseen text, though has some issues when ending synthesis of shorter unseen text, and vice-versa (using shorter dataset clips generates some issues synthesizing longer unseen text) ...

P.S. I'm continuing to separately train the newest release using LJS, Nancy and a Tagalog (Filipino) dataset;
After manual and auto preprocessing - I got 26,200 NPY (26.4 Gb) in LJS training dir, while Nancy has 24,190 (19.8 Gb), and our Tagalog dataset only has 988 NPY files (1.34 Gb); and
In all 3 cases, I got reasonably audible and comprehensible synthesis of unseen text after 120K global steps on average.

@keithito
Copy link
Owner

@qclu That's a bit surprising. The training pipeline isn't using the GPU version of Griffin-Lim. Also, a Tesla P40 has 24GB of memory, which is a lot more than should be necessary. What dataset are you training on? Does it have really long audio clips?

@saxenauts
Copy link

saxenauts commented Sep 21, 2017

@MXGray

Hi, Could you please share your trained model checkpoints on the Nancy Dataset?
I intend to try a different spectrogram inversion technique, that supposedly produces faster results.
Liffin-Grim Algorithm does multiple passes(of stft and inverse stft) over the entire sequence. But there are researches showing faster spectrogram inverters. There are two techniques, Real Time inversion and Single Pass Inversion.
I want to test tacotron's speed on these, and you can experiment on this too.

But if you want to try Single Pass inversion, there is a repo available.
https://github.com/lonce/SPSI_Python

@MXGray
Copy link

MXGray commented Sep 22, 2017

@saxenauts
No problem - Here you go: https://drive.google.com/file/d/0B1HeTSnLaWOSVGtMZEN4X1FDbWc/view?usp=sharing
Also, can anybody point out the names of the input and output nodes needed to freeze the model and weights into a PB graph?
I want to experiment porting this to C++, following this guide here: https://medium.com/@hamedmp/exporting-trained-tensorflow-models-to-c-the-right-way-cf24b609d183
Would greatly appreciate your help. I tried checking out some things with write_graph, though I'm finding it hard to identify the necessary input and output node names. :(

@MXGray
Copy link

MXGray commented Sep 22, 2017

BTW, another thing is making duration of WAV output adapt to input length ...
@mertyildiran
I tried this, among other things, though I don't see any effect. Has anybody else tested this yet? >> #43

@MXGray
Copy link

MXGray commented Sep 24, 2017

@mertyildiran
The hack I implemented for this (adapt WAV duration to text input length) is to use PyDub, in order to detect and remove end silence of generated WAV before playing it back. It works.

@zuoxiang95
Copy link

@MXGray It would take some time to process the end silence of wav. Does it still have real-time?

@MXGray
Copy link

MXGray commented Sep 24, 2017

@zuoxiang95
0.29 seconds for each 1-second audio. Using my Windows 10 laptop with 4Gb GPU (GTX 1050) - It took 1.82 seconds to synthesize, detect and remove silence then play back the 6.2-second audio output for this sample sentence here:
processingtime
P.S. Pardon me if the image above isn't resized properly - I'm completely blind.

@qclu
Copy link

qclu commented Sep 25, 2017

@keithito
I make a mistake in datasets. Thank u.

@keithito
Copy link
Owner

keithito commented Sep 25, 2017

Another option for trimming the silence is to run it through librosa.effects.trim:

trimmed_wav, _ = librosa.effects.trim(wav)

This seems to perform pretty well, for example,

  start = time.time()
  trimmed_wav, _ = librosa.effects.trim(wav)
  print('Trimmed from %.2f sec to %.2f sec in %.3f sec' % (
    len(wav) / sample_rate, len(trimmed) / sample_rate, (time.time() - start)))

Trimmed from 6.24 sec to 4.37 sec in 0.003 sec

@keithito keithito closed this as completed Oct 4, 2017
@nucleiis
Copy link

nucleiis commented Nov 3, 2017

@MXGray

Hi, Thank you for sharing your model and trained logs but google drive link seems dead now. Could you please reshare it for all of us?

@MXGray
Copy link

MXGray commented Nov 7, 2017

@nucleiis
Just got back from a long vacation. Well, here it is:
https://drive.google.com/file/d/1AtKUeUPp95NCdve2uwXb4JYTVJb1iAt0/view?usp=sharing

@nucleiis
Copy link

nucleiis commented Nov 9, 2017

@MXGray
Thank you so much! Your model will be tremendous help for me

@r03922123
Copy link

Dear all,
Is there any intuition about Griffin Lim's Law?

I am new to speech field, and still can't understand it after going through the original paper or google this algorithm, so if it is a stupid question, sorry about that><

@15857541616
Copy link

@MXGray
I have download Nancy speech dataset. Then how can I use the dataset? or what is the tree of my tacotron ?
just like this:
tacotron
|- LJSpeech-1.1
|- metadata.csv
|- wavs
could give some details so that i can train my model myself,such as some other files or tips? If I can have you repo is better :)
Thanks

@rafaelvalle
Copy link

Answering to the question posted on this topic:
It is possible to synthesize in real-time by using nv-wavenet
https://github.com/NVIDIA/nv-wavenet

@rafaelvalle
Copy link

And a PyTorch Tacotron 2 implementation with FP16 and multi-GPU support
https://github.com/NVIDIA/tacotron2

@fatchord
Copy link

fatchord commented May 4, 2018

@MXGray

When normalizing the Nancy dataset - did you normalize to peak, rms, broadcast standard(LUFs) or something else?

Cheers!

@alokprasad
Copy link

with LPCNET , atleast Vocoder is the realtime infact takes almost 0.8s for 1sec audio,

@Hanwun
Copy link

Hanwun commented Dec 16, 2019

@saxenauts
Currently, i want to speed spectrogram inverters, can you share how to implement Single Pass Inversion replaced Griffin-Lim algorithm on tacotron or Real Time inversion ?

Thank you very much!

@darkzbaron
Copy link

darkzbaron commented Nov 15, 2020

@nucleiis @MXGray
Just got back from a long vacation. Well, here it is:
https://drive.google.com/file/d/1AtKUeUPp95NCdve2uwXb4JYTVJb1iAt0/view?usp=sharing

Hi there, the link is broken. Can you please share again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests