Do we need pad audio when train hifigan on tacotron outputs? #63

Alexey322 · 2021-03-09T16:59:01Z

Hello.

In the nvidia tacotron repository, a function that return mel spectrograms of audio first fills it with zeros on both sides. It means that the tacotron is trained to generate mel spectrograms not for the original audio, but for the filled with zeros on the right and left. When generating mels in teacher forcing mode for a hifi gan, we also get spectrograms that imply the presence of zeros in the audio at the beginning. If we try to find the correspondence between the original audio (without padding) and the generated mel spectrograms, then the correspondence between them will be shifted by the number of zeros that fill the original audio during the training of the tacotron. Does this mean that before training the vocoder, we must fill the audio with zeros (in particular at the beginning to avoid bias)?

In the case of the nvidia tacotron, this is:
https://github.com/NVIDIA/tacotron2/blame/185cd24e046cc1304b4f8e564734d2498c6e2e6f/stft.py#L85-L88

Alexey322 · 2021-03-11T16:05:57Z

@CookiePPP I would be very grateful if you would help me in this not a simple matter.

CookiePPP · 2021-03-11T16:34:25Z

https://github.com/jik876/hifi-gan/blob/master/meldataset.py#L61

    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')

This repo pads the audio by (n_fft-hop_size)//2 on both sides, tacotron2 pads by n_fft//2 on both sides.
This should be very easy for you to update, just remove -hop_size from this repo or add -self.hop_length to the tacotron2 repo.

If you didn't update the repos, the bias would be constant so the fine-tuning HiFi-GAN model would learn the offset just fine, but you might run into cases where the spectrogram lengths don't match causing an exception somewhere else.

CookiePPP · 2021-03-11T16:43:37Z

CORRECTION

https://pytorch.org/docs/stable/generated/torch.stft.html

If center is True (default), input will be padded on both sides so that the tt -th frame is centered at time t \times \text{hop_length}t×hop_length . Otherwise, the tt -th frame begins at time t \times \text{hop_length}t×hop_length .

Looks like torch.stft (used inside this repo) also performs it's own padding. I'm guessing that's why the implementation in this repo has -hop_size added.

You'll have to test what this does, I don't use the original spectrogram code from this repo in my stuff so I have no idea what the audio->spectrogram code here actually outputs.

Alexey322 changed the title ~~Do we need to pad audio while training on tacotron outputs?~~ Do we need to pad audio when training hifigan on tacotron outputs? Mar 9, 2021

Alexey322 changed the title ~~Do we need to pad audio when training hifigan on tacotron outputs?~~ Do we need to pad audio when train hifigan on tacotron outputs? Mar 9, 2021

Alexey322 changed the title ~~Do we need to pad audio when train hifigan on tacotron outputs?~~ Do we need pad audio when train hifigan on tacotron outputs? Mar 9, 2021

Alexey322 closed this as completed Mar 12, 2021

v-nhandt21 mentioned this issue Jun 10, 2022

The right way to generate mel-spectrogram CookiePPP/VocoderComparisons#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need pad audio when train hifigan on tacotron outputs? #63

Do we need pad audio when train hifigan on tacotron outputs? #63

Alexey322 commented Mar 9, 2021

Alexey322 commented Mar 11, 2021

CookiePPP commented Mar 11, 2021 •

edited

Loading

CookiePPP commented Mar 11, 2021 •

edited

Loading

Do we need pad audio when train hifigan on tacotron outputs? #63

Do we need pad audio when train hifigan on tacotron outputs? #63

Comments

Alexey322 commented Mar 9, 2021

Alexey322 commented Mar 11, 2021

CookiePPP commented Mar 11, 2021 • edited Loading

CookiePPP commented Mar 11, 2021 • edited Loading

CORRECTION

CookiePPP commented Mar 11, 2021 •

edited

Loading

CookiePPP commented Mar 11, 2021 •

edited

Loading