Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tacotron 2 #9

Closed
kikirizki opened this issue Jun 6, 2021 · 4 comments
Closed

Tacotron 2 #9

kikirizki opened this issue Jun 6, 2021 · 4 comments

Comments

@kikirizki
Copy link

Hi, thank you, for your great Job. I wondering should I retrain Tacotron 2 with the same sample rate if I want to feed output from Tacotron 2 to this project?

@TonyWangX
Copy link
Member

TonyWangX commented Jun 7, 2021

Hello,

Do you mean sampling rate of the waveform from which you extract the target acoustic features for Tacotron2 and input acoustic features for the neural waveform model?
Or, do you mean the frame-rate of the acoustic features produced by Tacotron2 and will be the input to neural waveform model? This acoustic feature will be replicated (up-sampled).

==== For waveform sampling rate:
Dring training, the training procedure looks like

Text input -> Tacotron2 <- acoustic-feature of waveform(1)
acoustic-feature of waveform(1) -> neural waveform model <- waveform(2)

  1. Tacotron2 and neural waveform model should use the same acoustic features from waveform(1)
  2. Neural waveform model target waveform(2) may have a lower sampling rate than waveform(1).
  3. For example, I may use waveform(1) in 48kHz while waveform(2) in 24kHz.

==== For the frame rate of acoustic feature:
Related to the topic above, frame-rate, or in related terms called frame-shift, window-hop ...:

  1. Tacotron2 and neural waveform model should use the same acoustic features from waveform(1), thus the frame-rate of the acoustic features should be the same
  2. For example, we normally use 12.5ms as the frame-shift for the acoustic features for both Tacotron2 and neural waveform model.

In all:
So, it is OK if there is mismatch between the sampling rate of waveform(1) and waveform(2).
But Tacotron2 and neural waveform model should be trained using the same acoustic features from waveform(1).

if there is mismatch, you can retrain either Tacotron2 or the neural waveform model.
It depends on your specific configuration. For example, I think it may be difficult to train Tacotron2 with a frame-shift of 5ms, then retain the neural waveform model with the same frame-shift as that for Tacotron2 will be easier.

Hope it clarifies.

@kikirizki
Copy link
Author

kikirizki commented Jun 7, 2021

Hi thank you for your response now I do understand. One more question please. I notice that the input for NSF tensor size is [1,81,n] while the output of tacotron is [1,80,n] . Is that extra size because of pitch sequence?

@TonyWangX
Copy link
Member

TonyWangX commented Jun 7, 2021

Yes, exactly.

NSF requires pitch input (as the source signal)

@kikirizki
Copy link
Author

Oh thank you for your response I will close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants