-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need pad audio when train hifigan on tacotron outputs? #63
Comments
@CookiePPP I would be very grateful if you would help me in this not a simple matter. |
https://github.com/jik876/hifi-gan/blob/master/meldataset.py#L61 y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect') This repo pads the audio by If you didn't update the repos, the bias would be constant so the fine-tuning HiFi-GAN model would learn the offset just fine, but you might run into cases where the spectrogram lengths don't match causing an exception somewhere else. |
CORRECTIONhttps://pytorch.org/docs/stable/generated/torch.stft.html
Looks like torch.stft (used inside this repo) also performs it's own padding. I'm guessing that's why the implementation in this repo has You'll have to test what this does, I don't use the original spectrogram code from this repo in my stuff so I have no idea what the audio->spectrogram code here actually outputs. |
Hello.
In the nvidia tacotron repository, a function that return mel spectrograms of audio first fills it with zeros on both sides. It means that the tacotron is trained to generate mel spectrograms not for the original audio, but for the filled with zeros on the right and left. When generating mels in teacher forcing mode for a hifi gan, we also get spectrograms that imply the presence of zeros in the audio at the beginning. If we try to find the correspondence between the original audio (without padding) and the generated mel spectrograms, then the correspondence between them will be shifted by the number of zeros that fill the original audio during the training of the tacotron. Does this mean that before training the vocoder, we must fill the audio with zeros (in particular at the beginning to avoid bias)?
In the case of the nvidia tacotron, this is:
https://github.com/NVIDIA/tacotron2/blame/185cd24e046cc1304b4f8e564734d2498c6e2e6f/stft.py#L85-L88
The text was updated successfully, but these errors were encountered: