Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tacotron (2?) based models appear to be limited to rather short input #739

Open
deliciouslytyped opened this issue Dec 26, 2021 · 10 comments

Comments

@deliciouslytyped
Copy link

Running tts --text on some meaningful sentences results in the following output:

$ tts --text "An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future."                                                           
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.
 > Text splitted to sentences.
['An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4).', 'The rescheduling calculation is done once per second.', 'The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.']
   > Decoder stopped with `max_decoder_steps` 500
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 52.66666388511658
 > Real-time factor: 3.1740607061125763
 > Saving output to tts_output.wav

The audio file is truncated with respect to the text.
If I hack the config file at TTS/tts/configs/tacotron_config.py to have a larger max_decoder_steps value, the output does seem to successfully get longer, but I'm not sure how safe this is.

Are there any better solutions? Should I use a different model?

@deliciouslytyped
Copy link
Author

deliciouslytyped commented Dec 26, 2021

I'm confused because sometime this works, other times it doesn't.
Using the "test" test string, at first I got a synthesis with an extended end and malformed audio,
then it worked and I couldn't reproduce it anymore. I don't think I changed anything, but I'm not sure.

Now, I accidentally reproduced the bad sample:
badoutput.zip
(this is a zipped wav file, due to GitHub's restrictions)

Instead of just "test", you can hear something like "test-t-t-t-t-t-t-t....".

All I changed is max_decoder_steps to 1000.

@jeaye
Copy link

jeaye commented Jan 20, 2022

I get the same thing. If sentences are past a certain length, they are cut off in the produced wav. Here's a simple example:

tts --text "This sentence, being as long as it is, most unfortunately, will not be fully stated." --out_path test.wav
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: This sentence, being as long as it is, most unfortunately, will not be fully stated.
 > Text splitted to sentences.
['This sentence, being as long as it is, most unfortunately, will not be fully stated.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 3.1818737983703613
 > Real-time factor: 0.49914852912682467
 > Saving output to test.wav

In this example, the speaker is cut off before saying "stated".

How can we synthesize arbitrarily long sentences?

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

@stale stale bot added the wontfix This will not be worked on label Apr 16, 2022
@jeaye
Copy link

jeaye commented Apr 16, 2022

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

@stale stale bot removed the wontfix This will not be worked on label Apr 16, 2022
@stale
Copy link

stale bot commented Jul 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

@stale stale bot added the wontfix This will not be worked on label Jul 31, 2022
@jeaye
Copy link

jeaye commented Jul 31, 2022

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

@stale stale bot removed the wontfix This will not be worked on label Jul 31, 2022
@ethindp
Copy link

ethindp commented Aug 24, 2022

Suffering this issue too. Unsure what to do to resolve it. Will try other models to see what happens, I suppose.

@stale
Copy link

stale bot commented Nov 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

@stale stale bot added the wontfix This will not be worked on label Nov 2, 2022
@jeaye
Copy link

jeaye commented Nov 2, 2022

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

@stale stale bot removed the wontfix This will not be worked on label Nov 2, 2022
@asiletto
Copy link

asiletto commented Dec 21, 2022

I have the same problem here, long sentences get truncated.

It seems to be just a configuration as they say here thorstenMueller/Thorsten-Voice#22

setting "max_decoder_steps": 10000 in the model config.json solved the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants