Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need pad audio when train hifigan on tacotron outputs? #63

Closed
Alexey322 opened this issue Mar 9, 2021 · 3 comments
Closed

Do we need pad audio when train hifigan on tacotron outputs? #63

Alexey322 opened this issue Mar 9, 2021 · 3 comments

Comments

@Alexey322
Copy link

Hello.

In the nvidia tacotron repository, a function that return mel spectrograms of audio first fills it with zeros on both sides. It means that the tacotron is trained to generate mel spectrograms not for the original audio, but for the filled with zeros on the right and left. When generating mels in teacher forcing mode for a hifi gan, we also get spectrograms that imply the presence of zeros in the audio at the beginning. If we try to find the correspondence between the original audio (without padding) and the generated mel spectrograms, then the correspondence between them will be shifted by the number of zeros that fill the original audio during the training of the tacotron. Does this mean that before training the vocoder, we must fill the audio with zeros (in particular at the beginning to avoid bias)?

In the case of the nvidia tacotron, this is:
https://github.com/NVIDIA/tacotron2/blame/185cd24e046cc1304b4f8e564734d2498c6e2e6f/stft.py#L85-L88

@Alexey322 Alexey322 changed the title Do we need to pad audio while training on tacotron outputs? Do we need to pad audio when training hifigan on tacotron outputs? Mar 9, 2021
@Alexey322 Alexey322 changed the title Do we need to pad audio when training hifigan on tacotron outputs? Do we need to pad audio when train hifigan on tacotron outputs? Mar 9, 2021
@Alexey322 Alexey322 changed the title Do we need to pad audio when train hifigan on tacotron outputs? Do we need pad audio when train hifigan on tacotron outputs? Mar 9, 2021
@Alexey322
Copy link
Author

@CookiePPP I would be very grateful if you would help me in this not a simple matter.

@CookiePPP
Copy link

CookiePPP commented Mar 11, 2021

https://github.com/jik876/hifi-gan/blob/master/meldataset.py#L61

    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')

This repo pads the audio by (n_fft-hop_size)//2 on both sides, tacotron2 pads by n_fft//2 on both sides.
This should be very easy for you to update, just remove -hop_size from this repo or add -self.hop_length to the tacotron2 repo.

If you didn't update the repos, the bias would be constant so the fine-tuning HiFi-GAN model would learn the offset just fine, but you might run into cases where the spectrogram lengths don't match causing an exception somewhere else.

@CookiePPP
Copy link

CookiePPP commented Mar 11, 2021

CORRECTION


https://pytorch.org/docs/stable/generated/torch.stft.html

If center is True (default), input will be padded on both sides so that the tt -th frame is centered at time t \times \text{hop_length}t×hop_length . Otherwise, the tt -th frame begins at time t \times \text{hop_length}t×hop_length .

Looks like torch.stft (used inside this repo) also performs it's own padding. I'm guessing that's why the implementation in this repo has -hop_size added.

You'll have to test what this does, I don't use the original spectrogram code from this repo in my stuff so I have no idea what the audio->spectrogram code here actually outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants