Add option to interpolate mel between TTS and vocoder to increase vocoder compatibility #520

WeberJulian · 2020-09-17T22:20:40Z

Hello,

As of right now, if you train your TTS with certain sample rate, you can't use it with a pertained vocoder trained on an other sampling rate. (It's true with other audio parameters like hop_size).

As proven by this colab notebook, by simply interpolating the mel-spectrogram to the right size, we can use the pertained vocoder regardless of the sample rate it was initially trained on.

Now we have to see what changes are necessary to make this work in mozilla/TTS

erogol · 2020-09-18T09:38:20Z

I think we can add interpolation to audio.py and call it from there as necessary.

WeberJulian · 2020-09-18T09:39:44Z

Ok thanks. I'm gonna look into it today

WeberJulian · 2020-09-18T09:44:18Z

Also, do you know a way to compute the shape of the target mel spectogram without having to initialize two different audioprocessors ?

erogol · 2020-09-18T11:02:35Z

you can have the ratio of two config_TTS.audio['sample_rate'] / config_vocoder.audio['sample_rate]

would that be ok?

WeberJulian · 2020-09-18T12:09:43Z

you can have the ratio of two config_TTS.audio['sample_rate'] / config_vocoder.audio['sample_rate]

Thing is, sample rate is not the only audio parameter that affects the shape of the mel-spectogram. I know fft_size and num_mels affects it as well and I'm suspecting others do. Although I don't know if there is a need to support those other differences...

WeberJulian · 2020-09-18T16:01:34Z

Also, the function that interpolate the mel will need both the audio config of the vocoder and of the TTS. I was thinking that if we implement it in audio.py, the audio processor would need to have both configs in the constructor or we could pass the TTS config every time we want to interpolate directly to the function. Do you see a better way to implement this ?

erogol · 2020-09-18T16:09:20Z

I was thinking to just having a function in audio.py which would take the target sample rate and than interpolate it relative to its sample-rate

def interpolate(target_sr):
    scale_factor = target_sr / self.sample_rate
    return interpolate(scale_factor....)

then it is up to the user to use it anywhere he likes. I don't think there is a straight forward way to automate all these without more code spread around, which I'd prefer not to do for the sake of simplicity.

What do you think?

WeberJulian · 2020-09-18T16:45:49Z

Yeah I guess we can start with that. It was @george-roussos's comment that made me realise that it would be nice to support other kinds of audio processing tweaks (like the fft_size he uses).

To support all of this, we could just have a a function like this :

def interpolate(target_sr=None, scale_factors=None):
    if not scale_factors:
        scale_factors = (1, target_sr / self.sample_rate)
    return interpolate(scale_factors....)

ghost · 2020-09-29T06:54:37Z

you can have the ratio of two config_TTS.audio['sample_rate'] / config_vocoder.audio['sample_rate]

Thing is, sample rate is not the only audio parameter that affects the shape of the mel-spectogram. I know fft_size and num_mels affects it as well and I'm suspecting others do. Although I don't know if there is a need to support those other differences...

Hello @WeberJulian, I've looked into interpolation for vocoder compatibility too, but never had to write the code because I've always trained my synthesizer with the intent to use a specific vocoder. The relevant parameters are the following:

1. `frame_shift = hop_length / sample_rate`

This determines the time duration of each mel frame. A frame shift of 12.5ms is typical.

2. `num_mels, fmin, fmax`

These 3 parameters determine the frequency range of each mel band (typically num_mels=80 of them). Since it's mel scale there will be logarithmic spacing with more bands allocated to the lower frequencies.

3. `max_norm, symmetric_norm`

If these are different, then the spectrogram data needs to be adjusted for the output volume to be correct.

WeberJulian · 2020-09-29T08:42:47Z

Hi @blue-fish, thanks for the info.
Knowing what parameters influence the size of the mel spec is the easier half of the battle. The hard part would be to be able to compute the exact transformation from all of those parameters. For now I think the best solution would be to let people choose between scale from sr ratio (90% usecases), let them choose the scale_factors or automatically computing it by giving as an argument the TTS's audio processor.

Or we could do a separate function that does just that once:

def compute_scale_factors(self, ap_tts):
    y_TTS = torch.rand(ap_tts.sample_rate)  # Generating a random wav of length 1 sec
    y_vocoder = torch.rand(self.sample_rate) # Same but for the vocoder
    mel_TTS = ap_tts.melspectrogram(y_TTS)
    mel_vocoder = self.melspectrogram(y_vocoder)
    return (mel_vocoder.shape[0] / mel_TTS.shape[0], mel_vocoder.shape[1] / mel_TTS.shape[1]) #returns scale_factors

and this for interpolation:

def interpolate(mel, target_sr=None, scale_factors=None):
    if not scale_factors:
        scale_factors = (1, target_sr / self.sample_rate);
    return F.interpolate(mel.unsqueeze(0).unsqueeze(0), scale_factor=scale_factor, mode='bilinear').squeeze().squeeze()

What do you think @erogol ?

stale · 2020-11-28T15:24:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

fix taco2 speaker-embeddings dimension during inference

Edresson mentioned this issue Sep 23, 2020

Train a better Speaker Encoder #512

Closed

6 tasks

stale bot added the wontfix This will not be worked on label Nov 28, 2020

erogol removed the wontfix This will not be worked on label Nov 29, 2020

erogol closed this as completed Nov 29, 2020

ghost mentioned this issue Oct 6, 2021

Using another vocoder CorentinJ/Real-Time-Voice-Cloning#864

Closed

Mic92 pushed a commit to Mic92/TTS that referenced this issue Oct 27, 2021

Merge pull request mozilla#520 from SanjaESC/patch-1

ea31215

fix taco2 speaker-embeddings dimension during inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to interpolate mel between TTS and vocoder to increase vocoder compatibility #520

Add option to interpolate mel between TTS and vocoder to increase vocoder compatibility #520

WeberJulian commented Sep 17, 2020 •

edited

Loading

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

ghost commented Sep 29, 2020

WeberJulian commented Sep 29, 2020

stale bot commented Nov 28, 2020

Add option to interpolate mel between TTS and vocoder to increase vocoder compatibility #520

Add option to interpolate mel between TTS and vocoder to increase vocoder compatibility #520

Comments

WeberJulian commented Sep 17, 2020 • edited Loading

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

erogol commented Sep 18, 2020

WeberJulian commented Sep 18, 2020

ghost commented Sep 29, 2020

1. frame_shift = hop_length / sample_rate

2. num_mels, fmin, fmax

3. max_norm, symmetric_norm

WeberJulian commented Sep 29, 2020

stale bot commented Nov 28, 2020

WeberJulian commented Sep 17, 2020 •

edited

Loading

1. `frame_shift = hop_length / sample_rate`

2. `num_mels, fmin, fmax`

3. `max_norm, symmetric_norm`