# Universal MelGAN

synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/universal-melgan](https://github.com/huseinzol05/malaya-speech/tree/master/example/universal-melgan).
    
</div>

<div class="alert alert-info">

This module is language independent, so it save to use on different languages.
    
</div>

### Vocoder description

1. These vocoder models only able to convert melspectrogram generated by TTS malaya-speech models
2. Only accept mel feature size 80.
3. Will generate waveform with 22050 sample rate.

### Explanation

If you use MelGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own MelGAN Vocoder.

So Universal MelGAN, https://arxiv.org/abs/2011.09631 solved this problem, able to synthesize any melspectrogram to waveform.

In [1]:
import malaya_speech
import numpy as np

### List available MelGAN

In [2]:
malaya_speech.vocoder.available_melgan()

Unnamed: 0,Size (MB),Quantized Size (MB),Mel loss
male,17.3,4.53,0.4443
female,17.3,4.53,0.4434
husein,17.3,4.53,0.4442
haqkiem,17.3,4.53,0.4819
universal,309.0,77.5,0.4463
universal-1024,78.4,19.9,0.4591


`universal` is the original parameter from the paper while `universal-1024` smaller factor using 1024 filters size.

### Load MelGAN model

```python
def melgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-1024')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'universal'`` - Universal MelGAN trained on multiple speakers.
        * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.
        
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
```

In [3]:
melgan = malaya_speech.vocoder.melgan(model = 'universal')
quantized_melgan = malaya_speech.vocoder.melgan(model = 'universal', quantized = True)








In [4]:
melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024')
quantized_melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024', quantized = True)



### Load some examples

We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,

1. Change into melspectrogram.
2. log 10 that melspectrogram.
3. Normalize using global mean and std.

The models should be able to train without global norm.

So, to reuse the same steps, use `malaya_speech.featurization.universal_mel` function.

In [5]:
y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)
mel = malaya_speech.featurization.universal_mel(y)

In [6]:
import IPython.display as ipd

ipd.Audio(y, rate = 22050)

In [7]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 24.1 s, sys: 2.8 s, total: 26.9 s
Wall time: 5.77 s


In [8]:
%%time

y_ = quantized_melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 23.2 s, sys: 2.37 s, total: 25.6 s
Wall time: 4.87 s


In [9]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 7 s, sys: 931 ms, total: 7.93 s
Wall time: 1.87 s


In [10]:
%%time

y_ = quantized_melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 6.8 s, sys: 873 ms, total: 7.67 s
Wall time: 1.74 s


In [33]:
# try english audio
y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)
y = y[:sr * 4]
mel = malaya_speech.featurization.universal_mel(y)
ipd.Audio(y, rate = 22050)

In [34]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 22.7 s, sys: 2.04 s, total: 24.7 s
Wall time: 3.81 s


In [35]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 6.44 s, sys: 659 ms, total: 7.1 s
Wall time: 1.31 s


### Combine with FastSpeech2 TTS

In [11]:
female_v2 = malaya_speech.tts.fastspeech2(model = 'female-v2')
haqkiem = malaya_speech.tts.fastspeech2(model = 'haqkiem')

In [12]:
string = 'husein busuk masam ketiak pun masam tapi nasib baik comel'

In [13]:
%%time

r_female_v2 = female_v2.predict(string)

CPU times: user 676 ms, sys: 181 ms, total: 858 ms
Wall time: 611 ms


In [14]:
%%time

r_haqkiem = haqkiem.predict(string)

CPU times: user 1.16 s, sys: 349 ms, total: 1.51 s
Wall time: 1.11 s


In [15]:
y_ = melgan(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [16]:
y_ = melgan_1024(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [17]:
y_ = melgan(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [18]:
y_ = melgan_1024(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [19]:
string = 'kau ni apehal bodoh? nak gaduh ke siaaaal'

In [20]:
%%time

r_female_v2 = female_v2.predict(string)

CPU times: user 179 ms, sys: 31.3 ms, total: 211 ms
Wall time: 52.3 ms


In [21]:
%%time

r_haqkiem = haqkiem.predict(string)

CPU times: user 345 ms, sys: 52.8 ms, total: 397 ms
Wall time: 81.7 ms


In [22]:
y_ = melgan(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [23]:
y_ = melgan_1024(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [24]:
y_ = melgan(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [25]:
y_ = melgan_1024(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)