Tacotron 2 #9

kikirizki · 2021-06-06T06:00:57Z

Hi, thank you, for your great Job. I wondering should I retrain Tacotron 2 with the same sample rate if I want to feed output from Tacotron 2 to this project?

TonyWangX · 2021-06-07T00:20:58Z

Hello,

Do you mean sampling rate of the waveform from which you extract the target acoustic features for Tacotron2 and input acoustic features for the neural waveform model?
Or, do you mean the frame-rate of the acoustic features produced by Tacotron2 and will be the input to neural waveform model? This acoustic feature will be replicated (up-sampled).

==== For waveform sampling rate:
Dring training, the training procedure looks like

Text input -> Tacotron2 <- acoustic-feature of waveform(1)
acoustic-feature of waveform(1) -> neural waveform model <- waveform(2)

Tacotron2 and neural waveform model should use the same acoustic features from waveform(1)
Neural waveform model target waveform(2) may have a lower sampling rate than waveform(1).
For example, I may use waveform(1) in 48kHz while waveform(2) in 24kHz.

==== For the frame rate of acoustic feature:
Related to the topic above, frame-rate, or in related terms called frame-shift, window-hop ...:

Tacotron2 and neural waveform model should use the same acoustic features from waveform(1), thus the frame-rate of the acoustic features should be the same
For example, we normally use 12.5ms as the frame-shift for the acoustic features for both Tacotron2 and neural waveform model.

In all:
So, it is OK if there is mismatch between the sampling rate of waveform(1) and waveform(2).
But Tacotron2 and neural waveform model should be trained using the same acoustic features from waveform(1).

if there is mismatch, you can retrain either Tacotron2 or the neural waveform model.
It depends on your specific configuration. For example, I think it may be difficult to train Tacotron2 with a frame-shift of 5ms, then retain the neural waveform model with the same frame-shift as that for Tacotron2 will be easier.

Hope it clarifies.

kikirizki · 2021-06-07T01:07:08Z

Hi thank you for your response now I do understand. One more question please. I notice that the input for NSF tensor size is [1,81,n] while the output of tacotron is [1,80,n] . Is that extra size because of pitch sequence?

TonyWangX · 2021-06-07T07:37:44Z

Yes, exactly.

NSF requires pitch input (as the source signal)

kikirizki · 2021-06-07T08:01:16Z

Oh thank you for your response I will close this issue

kikirizki closed this as completed Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tacotron 2 #9

Tacotron 2 #9

kikirizki commented Jun 6, 2021

TonyWangX commented Jun 7, 2021 •

edited

kikirizki commented Jun 7, 2021 •

edited

TonyWangX commented Jun 7, 2021 •

edited

kikirizki commented Jun 7, 2021

Tacotron 2 #9

Tacotron 2 #9

Comments

kikirizki commented Jun 6, 2021

TonyWangX commented Jun 7, 2021 • edited

kikirizki commented Jun 7, 2021 • edited

TonyWangX commented Jun 7, 2021 • edited

kikirizki commented Jun 7, 2021

TonyWangX commented Jun 7, 2021 •

edited

kikirizki commented Jun 7, 2021 •

edited

TonyWangX commented Jun 7, 2021 •

edited