# Universal MelGAN

synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/universal-melgan](https://github.com/huseinzol05/malaya-speech/tree/master/example/universal-melgan).
    
</div>

<div class="alert alert-info">

This module is language independent, so it save to use on different languages.
    
</div>

### Vocoder description

1. Only accept mel feature size 80.
2. Will generate waveform with 22050 sample rate.

### Explanation

If you use MelGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own MelGAN Vocoder.

So Universal MelGAN, https://arxiv.org/abs/2011.09631 solved this problem, able to synthesize any melspectrogram to waveform.

In [1]:
import malaya_speech
import numpy as np

### List available MelGAN

In [2]:
malaya_speech.vocoder.available_melgan()

Unnamed: 0,Size (MB),Quantized Size (MB),Mel loss
male,17.3,4.53,0.4443
female,17.3,4.53,0.4434
husein,17.3,4.53,0.4442
haqkiem,17.3,4.53,0.4819
universal,309.0,77.5,0.4463
universal-1024,78.4,19.9,0.4591


`universal` is the original parameter from the paper while `universal-1024` smaller factor using 1024 filters size.

### Load MelGAN model

```python
def melgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-1024')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'universal'`` - Universal MelGAN trained on multiple speakers.
        * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.
        
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
```

In [3]:
melgan = malaya_speech.vocoder.melgan(model = 'universal')
quantized_melgan = malaya_speech.vocoder.melgan(model = 'universal', quantized = True)



In [4]:
melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024')
quantized_melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024', quantized = True)



### Load some examples

We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,

1. Change into melspectrogram.
2. log 10 that melspectrogram.
3. Normalize using global mean and std.

The models should be able to train without global norm.

So, to reuse the same steps, use `malaya_speech.featurization.universal_mel` function.

In [5]:
y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)
mel = malaya_speech.featurization.universal_mel(y)

In [6]:
import IPython.display as ipd

ipd.Audio(y, rate = 22050)

In [7]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 23.6 s, sys: 3.06 s, total: 26.7 s
Wall time: 6.58 s


In [8]:
%%time

y_ = quantized_melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 23.1 s, sys: 2.61 s, total: 25.7 s
Wall time: 5.56 s


In [9]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 6.84 s, sys: 952 ms, total: 7.79 s
Wall time: 2.05 s


In [10]:
%%time

y_ = quantized_melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 6.7 s, sys: 972 ms, total: 7.68 s
Wall time: 1.88 s


In [11]:
# try english audio
y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)
y = y[:sr * 4]
mel = malaya_speech.featurization.universal_mel(y)
ipd.Audio(y, rate = 22050)

In [12]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 24.3 s, sys: 2.42 s, total: 26.8 s
Wall time: 4.62 s


In [13]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 6.57 s, sys: 798 ms, total: 7.37 s
Wall time: 1.41 s


### Combine with FastSpeech2 TTS

In [15]:
female_v2 = malaya_speech.tts.fastspeech2(model = 'female-v2')
haqkiem = malaya_speech.tts.fastspeech2(model = 'haqkiem')

In [16]:
string = 'husein busuk masam ketiak pun masam tapi nasib baik comel'

In [17]:
%%time

r_female_v2 = female_v2.predict(string)

CPU times: user 682 ms, sys: 198 ms, total: 881 ms
Wall time: 652 ms


In [18]:
%%time

r_haqkiem = haqkiem.predict(string)

CPU times: user 1.2 s, sys: 358 ms, total: 1.56 s
Wall time: 1.19 s


In [19]:
y_ = melgan(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [20]:
y_ = melgan_1024(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [21]:
y_ = melgan(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [22]:
y_ = melgan_1024(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [23]:
string = 'kau ni apehal bodoh? nak gaduh ke siaaaal'

In [24]:
%%time

r_female_v2 = female_v2.predict(string)

CPU times: user 195 ms, sys: 34.3 ms, total: 229 ms
Wall time: 69.3 ms


In [25]:
%%time

r_haqkiem = haqkiem.predict(string)

CPU times: user 324 ms, sys: 54.3 ms, total: 378 ms
Wall time: 80.4 ms


In [26]:
y_ = melgan(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [27]:
y_ = melgan_1024(r_female_v2['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [28]:
y_ = melgan(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)

In [29]:
y_ = melgan_1024(r_haqkiem['postnet-output'])
ipd.Audio(y_, rate = 22050)