Is there anyway to synthesize in real-time? #15

msobhan69 · 2017-08-07T08:33:49Z

Based on my experiments, Tacotron isn't able to synthesize in real-time (1 second for synthesizing 1 second).
Is there any solution or modification to resolve this problem?

keithito · 2017-08-07T14:00:48Z

The slowest part at the moment is the Griffin-Lim reconstruction to convert spectrograms to waveforms. This currently runs on the CPU and uses librosa's stft and istft functions. There are a number of options for improving this, ordered from best to worst:

Write a Tensorflow Griffin-Lim implementation so it can run on the GPU.
Write a parallel CPU implementation (I haven't looked closely at librosa's stft and istft, but I believe they're single threaded). Even something simple like slicing up the spectrogram into several pieces, running Griffin Lim on each piece on a different core, and then stitching the wavs together might work.
Decrease the number of Griffin-Lim iterations by running eval/demo_server with --hparams="griffin_lim_iters=30". This is easy to do, but will decrease audio quality. You can play with the number to trade off speed and quality.

Pull requests for (1) or (2) would be very welcome!

MaxwellRebo · 2017-08-07T17:52:14Z

@keithito thanks for the solid answer, I've also been wondering about this. It seems a GPU implementation would be most beneficial. You could then make it optional to use the GPU or vanilla implementation.

jpdz · 2017-08-09T06:15:37Z

@keithito Hi, thank you for your experiments. I am a little confused about what's the major difference between you and Kyubyong's? I have researcheded with Kyubyong's for some days but still can not get results like yours.
Thank you so much!

ElevenGameStudios · 2017-08-17T13:00:54Z

@keithito i think another option would be implementing an algorithm other than griffin-lim to do the inversion. the one presented in the paper "Real-time Iterative Spectrum Inversion with Look-ahead" seems to be a good choice. i unfortunately did not find the code of that anywhere online..

GunpowderGuy · 2017-08-17T22:16:09Z

@ElevenGameStudios unfamtomably , someone ( Kyubyong/tacotron#81) got kyubyong's version to perform better than any other , when it was the other way around

GunpowderGuy · 2017-08-17T22:19:13Z

@ElevenGameStudios so much so , that the only thing left to do would be to implement baidu's version of tacotron , which is multi speaker and includes a better algorithm than grifin-lim

MXGray · 2017-09-04T21:11:57Z

@keithito - Thanks for your awesome implementation! :)
@everyone - Thanks for sharing your ideas and experiments! :)

After training (and babysitting) from scratch, with CMU enabled, for 111K steps (2 days) - Here are my results:
ResultsAfter111KSteps.zip

For those evals, I set griffin-lim iters to 12 (and my max iters is 325 because of my dataset).

Did this so I can get my userinput statements synthesized in less than 8 seconds or so.
Much higher quality when I set it to 30 or so, though it takes around 15 seconds at 30 and 22 seconds at 60 ...

Hope someone can share a tf GPU implementation of griffin-lim, along with multiprocessing / threading implementations for librosa functions. :)
BTW, I used the Nancy dataset up at http://data.cstr.ed.ac.uk

I did some improvements to the WAV files and also normalized everythin; and
I also preprocessed the text data to better match the formatting of LJ Speech's text data.

keithito · 2017-09-05T06:28:37Z

@MXGray Your results sound really good! 👍

Would you mind sharing audio files with 60 Griffin-Lim iterations and would it be okay if I linked to them from the top-level README file?

I'm also hoping someone shares a TF implementation of Griffin-Lim, but if nobody does, I might give it a shot next weekend.

MXGray · 2017-09-05T07:28:48Z

@keithito >> No problem - Yes, feel free to link these samples - Here are my 111K and 140K results at different gfn-lm iters (12 and 60) and the same max iters (325) using the same dataset (Nancy): /edit/correct ZIP files updated
111KStepsAt60GfnLm.zip
140KStepsAt12GfnLm.zip
140KStepsAt60GfnLm.zip

Can't wait to get my hands on a tf GPU gfn-lm strategy! :)

ElevenGameStudios · 2017-09-06T22:07:45Z

@MXGray those are some really nice results, well done! with max_iters that high, i am sure it can synthesize pretty long sentences..

@keithito I haven't gotten around to implement any possible Griffin-Lim improvements either. But I think this https://github.com/lonce/SPSI_Python (Single Pass Spectrogram Inversion, was mentioned in the post you refer to in your g.l. code comment) could still be a good idea as an initialization for g.l. then one should need way less griffin lim steps and if those could be done multithreaded or on the gpu, synthesis should be way faster.

tmulc18 · 2017-09-07T05:53:45Z

Kyubyong implemented Griffin-Lim in TF: https://github.com/Kyubyong/tensorflow-exercises/blob/master/Audio_Processing.ipynb

candlewill · 2017-09-08T08:49:37Z

@tmulc18 Thanks for you sharing. @Kyubyong 's Griffin-Lim implementation works very well. Based on this code, I added _denormalize, _db_to_amp, _inv_preemphasis functions. All APIs keeps the same with @keithito 's implementation.

Here is my modified version: https://github.com/candlewill/Griffin_lim .

This could be integrated easy with current Tacotron implementation.

qclu · 2017-09-08T10:23:03Z

@candlewill
Great!
I have tried your share with eval.py in synthesizer. But I didnot get any imporvement in time consumption used in griffin-lim.
Following is my implementation(in synthesizer.py):

```
timestart = time.time()
```

spec = self.session.run(self.model.linear_outputs[0], feed_dict=feed_dict)

```
out = io.BytesIO()
```

#with tf.device('/gpu:2'):
 # here I onced tired to specified gpu, but got errors.

```
sample = inv_spectrogram(spec) 
```

with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as sess:

```
    sample = sess.run(sample) 
```
```
audio.save_wav(sample, out)
```

print('tf time consumed: ', time.time() - timestart)

For time consumptions:(60 itrs)
GPU time consumed: 29.525731086730957
CPU time consumed: 7.406686067581177

If I got some errors in my implementations?
Thanks

candlewill · 2017-09-08T14:20:31Z

@qclu In your integration method, two graphs are executed sequentially in two sessions. That might be time consumption, as the data flows between the two sessions are on CPU (not GPU).

I think it's better to extend the Tacotron network with the griffin-lim. The extended Tacotron would output samples directly in just one session.

qclu · 2017-09-13T01:43:34Z

@candlewill
I have extended the Tacotron network with the griffin-lim and output samples directly in one session.
Following is time consumpetion:

Original implementation:
time consumed: 8.203073501586914 s
Wtih gl algorithm embeded in tacotron:
time consumed: 10.650420665740967 s

Have you tied and get a better result?

candlewill · 2017-09-13T02:23:43Z

In @keithito 's latest commit, there is a TF-griffin-lim branch. https://github.com/keithito/tacotron/tree/tf-griffin-lim

keithito · 2017-09-13T04:16:01Z

I just merged PR #41 which adds a TensorFlow implementation of Griffin-Lim, based on the code in Kyubyong's notebook. This speeds things up considerably. You can synthesize 6 sec of audio in about 0.8 sec on a GTX 1080 Ti.

There's still room for improvement. Everything runs on one example at a time right now, but both the model and the Griffin-Lim implementation work fine on batches of data, so it should be pretty straightforward to synthesize multiple (e.g. 32) examples in parallel.

Thanks to @Kyubyong and @candlewill!

keithito · 2017-09-13T04:19:17Z

@qclu If you're not seeing a speedup, one thing to check is if you're using a version of TensorFlow with GPU support. The first time I tried upgrading to TensorFlow 1.3, I ran pip install --upgrade tensorflow, which installed the non-GPU version. To fix this, I needed to uninstall TF, then run pip install tensorflow-gpu.

keithito · 2017-09-13T04:47:36Z

I'll re-open this for a few more days in case there are issues.

lifeiteng · 2017-09-13T05:55:59Z

In my own implementation, from text->linear spectrum:
synthesis 10 (batch 10) utters 303 chars using 0.9374 seconds on CPU
max decoder timestep is 80 ( reduction factor is 5, so 80 * 5 * 12.5 = 5 seconds audio output), texts are:

Lisa They study outside.
Lisa He didn't get enough sleep.
Lisa Who enjoys gardening?
Lisa The Earth's axis is perpendicular to the Sun.
Lisa to encourage it.
Lisa Angela is staying in bed because she's sick.
Lisa Who are sitting in a circle?
Lisa Dan is American.
Lisa Harry and his friend took a train from Beijing to Shanghai.
Lisa There would be more seasons.

I think the key problems of synthesis's speed are:

balancing text's length in one batch
stop decoding when reach the end of the text (related to 1, I haven't implement this)
faster Griffin-Lim implementation

qclu · 2017-09-13T06:09:36Z

@candlewill Yes, thanks so much.
@keithito yes, I compare with your latest release, time consumptions are quite close.
Great work!

qclu · 2017-09-14T06:38:56Z

@keithito
I tried train with latest release v0.2.0. In previous train, batch_size is 32, and it processes well. But for v0.2.0, unitl batch size is set to 8, OOM error is avoided. My GPU is Tesla P40.
I think it should be aroused by Griffin_lim implementation. Mybe we need furthur optimize this issue for long.

saxenauts · 2017-09-14T13:23:24Z

@MXGray @keithito
Training on Nancy dataset seems to be more natural sounding than on LJ Speech dataset.
Is it because the latter is clippings of audiobooks?
If so, how can I overcome this?

Thanks

MXGray · 2017-09-14T15:32:11Z

@saxenauts
@keithito

In case this helps, my tests so far point out the following things:

The voice you get tends to be what you feed it - Synthesis voice quality is largely dependent on dataset voice quality;
Prediction quality tends to depend largely on the volume and diversity of what you feed it - Bigger data volume helps in optimizing prediction of unseen text; and
Similar to number 2 - Longer dataset clips optimize prediction of longer unseen text, though has some issues when ending synthesis of shorter unseen text, and vice-versa (using shorter dataset clips generates some issues synthesizing longer unseen text) ...

P.S. I'm continuing to separately train the newest release using LJS, Nancy and a Tagalog (Filipino) dataset;
After manual and auto preprocessing - I got 26,200 NPY (26.4 Gb) in LJS training dir, while Nancy has 24,190 (19.8 Gb), and our Tagalog dataset only has 988 NPY files (1.34 Gb); and
In all 3 cases, I got reasonably audible and comprehensible synthesis of unseen text after 120K global steps on average.

keithito · 2017-09-15T06:29:44Z

@qclu That's a bit surprising. The training pipeline isn't using the GPU version of Griffin-Lim. Also, a Tesla P40 has 24GB of memory, which is a lot more than should be necessary. What dataset are you training on? Does it have really long audio clips?

saxenauts · 2017-09-21T14:21:22Z

@MXGray

Hi, Could you please share your trained model checkpoints on the Nancy Dataset?
I intend to try a different spectrogram inversion technique, that supposedly produces faster results.
Liffin-Grim Algorithm does multiple passes(of stft and inverse stft) over the entire sequence. But there are researches showing faster spectrogram inverters. There are two techniques, Real Time inversion and Single Pass Inversion.
I want to test tacotron's speed on these, and you can experiment on this too.

But if you want to try Single Pass inversion, there is a repo available.
https://github.com/lonce/SPSI_Python

MXGray · 2017-09-22T05:16:50Z

@saxenauts
No problem - Here you go: https://drive.google.com/file/d/0B1HeTSnLaWOSVGtMZEN4X1FDbWc/view?usp=sharing
Also, can anybody point out the names of the input and output nodes needed to freeze the model and weights into a PB graph?
I want to experiment porting this to C++, following this guide here: https://medium.com/@hamedmp/exporting-trained-tensorflow-models-to-c-the-right-way-cf24b609d183
Would greatly appreciate your help. I tried checking out some things with write_graph, though I'm finding it hard to identify the necessary input and output node names. :(

MXGray · 2017-09-22T07:07:51Z

BTW, another thing is making duration of WAV output adapt to input length ...
@mertyildiran
I tried this, among other things, though I don't see any effect. Has anybody else tested this yet? >> #43

MXGray · 2017-09-24T08:26:42Z

@mertyildiran
The hack I implemented for this (adapt WAV duration to text input length) is to use PyDub, in order to detect and remove end silence of generated WAV before playing it back. It works.

zuoxiang95 · 2017-09-24T11:25:30Z

@MXGray It would take some time to process the end silence of wav. Does it still have real-time?

MXGray · 2017-09-24T12:12:28Z

@zuoxiang95
0.29 seconds for each 1-second audio. Using my Windows 10 laptop with 4Gb GPU (GTX 1050) - It took 1.82 seconds to synthesize, detect and remove silence then play back the 6.2-second audio output for this sample sentence here:

P.S. Pardon me if the image above isn't resized properly - I'm completely blind.

qclu · 2017-09-25T03:16:32Z

@keithito
I make a mistake in datasets. Thank u.

keithito · 2017-09-25T05:42:42Z

Another option for trimming the silence is to run it through librosa.effects.trim:

trimmed_wav, _ = librosa.effects.trim(wav)

This seems to perform pretty well, for example,

  start = time.time()
  trimmed_wav, _ = librosa.effects.trim(wav)
  print('Trimmed from %.2f sec to %.2f sec in %.3f sec' % (
    len(wav) / sample_rate, len(trimmed) / sample_rate, (time.time() - start)))

Trimmed from 6.24 sec to 4.37 sec in 0.003 sec

nucleiis · 2017-11-03T05:45:17Z

@MXGray

Hi, Thank you for sharing your model and trained logs but google drive link seems dead now. Could you please reshare it for all of us?

MXGray · 2017-11-07T21:44:09Z

@nucleiis
Just got back from a long vacation. Well, here it is:
https://drive.google.com/file/d/1AtKUeUPp95NCdve2uwXb4JYTVJb1iAt0/view?usp=sharing

nucleiis · 2017-11-09T06:17:51Z

@MXGray
Thank you so much! Your model will be tremendous help for me

r03922123 · 2018-01-27T16:55:02Z

Dear all,
Is there any intuition about Griffin Lim's Law?

I am new to speech field, and still can't understand it after going through the original paper or google this algorithm, so if it is a stupid question, sorry about that><

15857541616 · 2018-04-17T11:16:14Z

@MXGray
I have download Nancy speech dataset. Then how can I use the dataset? or what is the tree of my tacotron ?
just like this:
tacotron
|- LJSpeech-1.1
|- metadata.csv
|- wavs
could give some details so that i can train my model myself,such as some other files or tips? If I can have you repo is better :)
Thanks

rafaelvalle · 2018-04-17T14:43:22Z

Answering to the question posted on this topic:
It is possible to synthesize in real-time by using nv-wavenet
https://github.com/NVIDIA/nv-wavenet

rafaelvalle · 2018-05-03T22:36:17Z

And a PyTorch Tacotron 2 implementation with FP16 and multi-GPU support
https://github.com/NVIDIA/tacotron2

fatchord · 2018-05-04T08:17:54Z

@MXGray

When normalizing the Nancy dataset - did you normalize to peak, rms, broadcast standard(LUFs) or something else?

Cheers!

alokprasad · 2019-07-11T08:16:31Z

with LPCNET , atleast Vocoder is the realtime infact takes almost 0.8s for 1sec audio,

Hanwun · 2019-12-16T04:52:03Z

@saxenauts
Currently, i want to speed spectrogram inverters, can you share how to implement Single Pass Inversion replaced Griffin-Lim algorithm on tacotron or Real Time inversion ?

Thank you very much!

darkzbaron · 2020-11-15T13:11:39Z

@nucleiis @MXGray
Just got back from a long vacation. Well, here it is:
https://drive.google.com/file/d/1AtKUeUPp95NCdve2uwXb4JYTVJb1iAt0/view?usp=sharing

Hi there, the link is broken. Can you please share again?

keithito mentioned this issue Sep 13, 2017

Add TensorFlow implementation of Griffin-Lim #41

Merged

keithito closed this as completed Sep 13, 2017

keithito reopened this Sep 13, 2017

keithito closed this as completed Oct 4, 2017

This was referenced Dec 20, 2017

Nancy Corpus pre-trained #87

Open

Problem with "Hello, world." #81

Closed

begeekmyfriend mentioned this issue Jan 22, 2018

Maybe you need to feed some better audio data set for better results ttsunion/Deep-Expression#1

Open

Prithwiraj12 mentioned this issue Nov 21, 2019

Tensorflow GPU version Problem #313

Open

Is there anyway to synthesize in real-time? #15

Is there anyway to synthesize in real-time? #15

Comments

msobhan69 commented Aug 7, 2017

keithito commented Aug 7, 2017 • edited Loading

MaxwellRebo commented Aug 7, 2017 • edited Loading

jpdz commented Aug 9, 2017

ElevenGameStudios commented Aug 17, 2017

GunpowderGuy commented Aug 17, 2017 • edited Loading

GunpowderGuy commented Aug 17, 2017 • edited Loading

MXGray commented Sep 4, 2017

keithito commented Sep 5, 2017

MXGray commented Sep 5, 2017 • edited Loading

ElevenGameStudios commented Sep 6, 2017

tmulc18 commented Sep 7, 2017

candlewill commented Sep 8, 2017 • edited Loading

qclu commented Sep 8, 2017

@candlewill Great! I have tried your share with eval.py in synthesizer. But I didnot get any imporvement in time consumption used in griffin-lim. Following is my implementation(in synthesizer.py):

candlewill commented Sep 8, 2017 • edited Loading

qclu commented Sep 13, 2017 • edited Loading

candlewill commented Sep 13, 2017 • edited Loading

keithito commented Sep 13, 2017

keithito commented Sep 13, 2017 • edited Loading

keithito commented Sep 13, 2017

lifeiteng commented Sep 13, 2017 • edited Loading

qclu commented Sep 13, 2017

qclu commented Sep 14, 2017

saxenauts commented Sep 14, 2017

MXGray commented Sep 14, 2017

keithito commented Sep 15, 2017

saxenauts commented Sep 21, 2017 • edited Loading

MXGray commented Sep 22, 2017

MXGray commented Sep 22, 2017

MXGray commented Sep 24, 2017

zuoxiang95 commented Sep 24, 2017

MXGray commented Sep 24, 2017 • edited Loading

qclu commented Sep 25, 2017

keithito commented Sep 25, 2017 • edited Loading

nucleiis commented Nov 3, 2017

MXGray commented Nov 7, 2017

nucleiis commented Nov 9, 2017

r03922123 commented Jan 27, 2018

15857541616 commented Apr 17, 2018

rafaelvalle commented Apr 17, 2018

rafaelvalle commented May 3, 2018

fatchord commented May 4, 2018

alokprasad commented Jul 11, 2019

Hanwun commented Dec 16, 2019

darkzbaron commented Nov 15, 2020 • edited Loading

keithito commented Aug 7, 2017 •

edited

Loading

MaxwellRebo commented Aug 7, 2017 •

edited

Loading

GunpowderGuy commented Aug 17, 2017 •

edited

Loading

GunpowderGuy commented Aug 17, 2017 •

edited

Loading

MXGray commented Sep 5, 2017 •

edited

Loading

candlewill commented Sep 8, 2017 •

edited

Loading

@candlewill
Great!
I have tried your share with eval.py in synthesizer. But I didnot get any imporvement in time consumption used in griffin-lim.
Following is my implementation(in synthesizer.py):

candlewill commented Sep 8, 2017 •

edited

Loading

qclu commented Sep 13, 2017 •

edited

Loading

candlewill commented Sep 13, 2017 •

edited

Loading

keithito commented Sep 13, 2017 •

edited

Loading

lifeiteng commented Sep 13, 2017 •

edited

Loading

saxenauts commented Sep 21, 2017 •

edited

Loading

MXGray commented Sep 24, 2017 •

edited

Loading

keithito commented Sep 25, 2017 •

edited

Loading

darkzbaron commented Nov 15, 2020 •

edited

Loading