Tacotron (2?) based models appear to be limited to rather short input #739

deliciouslytyped · 2021-12-26T22:09:33Z

Running tts --text on some meaningful sentences results in the following output:

$ tts --text "An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future."                                                           
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.
 > Text splitted to sentences.
['An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4).', 'The rescheduling calculation is done once per second.', 'The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.']
   > Decoder stopped with `max_decoder_steps` 500
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 52.66666388511658
 > Real-time factor: 3.1740607061125763
 > Saving output to tts_output.wav

The audio file is truncated with respect to the text.
If I hack the config file at TTS/tts/configs/tacotron_config.py to have a larger max_decoder_steps value, the output does seem to successfully get longer, but I'm not sure how safe this is.

Are there any better solutions? Should I use a different model?

The text was updated successfully, but these errors were encountered:

deliciouslytyped · 2021-12-26T22:46:20Z

I'm confused because sometime this works, other times it doesn't.
Using the "test" test string, at first I got a synthesis with an extended end and malformed audio,
then it worked and I couldn't reproduce it anymore. I don't think I changed anything, but I'm not sure.

Now, I accidentally reproduced the bad sample:
badoutput.zip
(this is a zipped wav file, due to GitHub's restrictions)

Instead of just "test", you can hear something like "test-t-t-t-t-t-t-t....".

All I changed is max_decoder_steps to 1000.

jeaye · 2022-01-20T07:41:37Z

I get the same thing. If sentences are past a certain length, they are cut off in the produced wav. Here's a simple example:

❯ tts --text "This sentence, being as long as it is, most unfortunately, will not be fully stated." --out_path test.wav
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: This sentence, being as long as it is, most unfortunately, will not be fully stated.
 > Text splitted to sentences.
['This sentence, being as long as it is, most unfortunately, will not be fully stated.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 3.1818737983703613
 > Real-time factor: 0.49914852912682467
 > Saving output to test.wav

In this example, the speaker is cut off before saying "stated".

How can we synthesize arbitrarily long sentences?

stale · 2022-04-16T05:52:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye · 2022-04-16T06:22:13Z

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

stale · 2022-07-31T05:45:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye · 2022-07-31T05:53:57Z

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

ethindp · 2022-08-24T19:26:23Z

Suffering this issue too. Unsure what to do to resolve it. Will try other models to see what happens, I suppose.

stale · 2022-11-02T02:07:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye · 2022-11-02T05:40:09Z

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

asiletto · 2022-12-21T11:48:47Z

I have the same problem here, long sentences get truncated.

It seems to be just a configuration as they say here thorstenMueller/Thorsten-Voice#22

setting "max_decoder_steps": 10000 in the model config.json solved the problem

stale bot added the wontfix This will not be worked on label Apr 16, 2022

stale bot removed the wontfix This will not be worked on label Apr 16, 2022

stale bot added the wontfix This will not be worked on label Jul 31, 2022

stale bot removed the wontfix This will not be worked on label Jul 31, 2022

stale bot added the wontfix This will not be worked on label Nov 2, 2022

stale bot removed the wontfix This will not be worked on label Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tacotron (2?) based models appear to be limited to rather short input #739

Tacotron (2?) based models appear to be limited to rather short input #739

deliciouslytyped commented Dec 26, 2021

deliciouslytyped commented Dec 26, 2021 •

edited

jeaye commented Jan 20, 2022

stale bot commented Apr 16, 2022

jeaye commented Apr 16, 2022

stale bot commented Jul 31, 2022

jeaye commented Jul 31, 2022

ethindp commented Aug 24, 2022

stale bot commented Nov 2, 2022

jeaye commented Nov 2, 2022

asiletto commented Dec 21, 2022 •

edited

Tacotron (2?) based models appear to be limited to rather short input #739

Tacotron (2?) based models appear to be limited to rather short input #739

Comments

deliciouslytyped commented Dec 26, 2021

deliciouslytyped commented Dec 26, 2021 • edited

jeaye commented Jan 20, 2022

stale bot commented Apr 16, 2022

jeaye commented Apr 16, 2022

stale bot commented Jul 31, 2022

jeaye commented Jul 31, 2022

ethindp commented Aug 24, 2022

stale bot commented Nov 2, 2022

jeaye commented Nov 2, 2022

asiletto commented Dec 21, 2022 • edited

deliciouslytyped commented Dec 26, 2021 •

edited

asiletto commented Dec 21, 2022 •

edited