# Voice

Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/voice-conversion](https://github.com/huseinzol05/malaya-speech/tree/master/example/voice-conversion).
    
</div>

<div class="alert alert-info">

This module is language independent, so it save to use on different languages.
    
</div>

### Explanation

We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion

In [1]:
import malaya_speech
import numpy as np

### List available Voice Conversion

In [2]:
malaya_speech.voice_conversion.available_deep_conversion()

Unnamed: 0,Size (MB),Quantized Size (MB),Total loss
fastvc-32,190.0,54.1,0.2918
fastvc-64,194.0,55.7,0.2764


### Load Deep Conversion

```python
def deep_conversion(
    model: str = 'fastvc-32', quantized: bool = False, **kwargs
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-32')
        Model architecture supported. Allowed values:

        * ``'fastvc-32'`` - FastVC with bottleneck size 32.
        * ``'fastvc-64'`` - FastVC with bottleneck size 64.
        
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.voice_conversion.load function
    """
```

In [3]:
model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32', quantized = True)








### Predict

```python
def predict(self, original_audio, target_audio):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame

    Returns
    -------
    result: Dict[decoder-output, postnet-output]
    """
```

**`original_audio` and `target_audio` must 22050 sample rate**.

In [25]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]

In [16]:
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)

In [17]:
ipd.Audio(target_audio[:sr * 2], rate = sr)

In [11]:
%%time
r = model.predict(original_audio, target_audio)
r

CPU times: user 8.16 s, sys: 1.12 s, total: 9.28 s
Wall time: 1.71 s


{'decoder-output': array([[ 0.36787206,  0.39518872,  0.27016976, ...,  1.928651  ,
          1.9636147 ,  1.9293905 ],
        [ 0.30282757,  0.3588042 ,  0.2197719 , ...,  2.1164858 ,
          2.1588402 ,  2.119205  ],
        [ 0.16019775,  0.2600227 ,  0.13983072, ...,  2.2608795 ,
          2.3051713 ,  2.2935262 ],
        ...,
        [-0.6276471 , -0.47405264, -0.3977524 , ..., -1.4190919 ,
         -1.4347526 , -1.4413854 ],
        [-0.69217056, -0.5743884 , -0.5522437 , ..., -1.5944515 ,
         -1.6032431 , -1.5926315 ],
        [-0.7784737 , -0.6400775 , -0.63156724, ..., -1.7398036 ,
         -1.7352657 , -1.7042713 ]], dtype=float32),
 'postnet-output': array([[ 0.36787206,  0.39518872,  0.27016976, ...,  1.928651  ,
          1.9636147 ,  1.9293905 ],
        [ 0.30282757,  0.3588042 ,  0.2197719 , ...,  2.1164858 ,
          2.1588402 ,  2.119205  ],
        [ 0.16019775,  0.2600227 ,  0.13983072, ...,  2.2608795 ,
          2.3051713 ,  2.2935262 ],
        ...,
   

In [22]:
%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r

CPU times: user 8.03 s, sys: 1.06 s, total: 9.09 s
Wall time: 1.84 s


{'decoder-output': array([[ 0.35794327,  0.39413252,  0.26602975, ...,  1.9586537 ,
          1.9828256 ,  1.9081111 ],
        [ 0.3567759 ,  0.40634462,  0.22993924, ...,  2.1293466 ,
          2.1610477 ,  2.0889947 ],
        [ 0.21532963,  0.37776726,  0.2349888 , ...,  2.2313044 ,
          2.2567654 ,  2.2152293 ],
        ...,
        [-0.59956986, -0.47636947, -0.4257096 , ..., -1.2851118 ,
         -1.2892275 , -1.3192046 ],
        [-0.650754  , -0.576388  , -0.56404495, ..., -1.5505072 ,
         -1.5410377 , -1.5254403 ],
        [-0.76077545, -0.66785324, -0.68400216, ..., -1.7248758 ,
         -1.7060162 , -1.6601568 ]], dtype=float32),
 'postnet-output': array([[ 0.35794327,  0.39413252,  0.26602975, ...,  1.9586537 ,
          1.9828256 ,  1.9081111 ],
        [ 0.3567759 ,  0.40634462,  0.22993924, ...,  2.1293466 ,
          2.1610477 ,  2.0889947 ],
        [ 0.21532963,  0.37776726,  0.2349888 , ...,  2.2313044 ,
          2.2567654 ,  2.2152293 ],
        ...,
   

### Voice Conversion output

1. Will returned mel feature size 80.
2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

### Load Universal MelGAN

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

In [13]:
quantized_melgan = malaya_speech.vocoder.melgan(model = 'universal-1024', quantized = True)



In [20]:
%%time

y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.2 s, sys: 1.88 s, total: 16.1 s
Wall time: 2.73 s


In [23]:
%%time

y_ = quantized_melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.5 s, sys: 1.99 s, total: 16.5 s
Wall time: 2.79 s


Pretty good!

### More example

This time we try, original voice is English, target voice from Malay and English.

In [26]:
original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)

In [29]:
target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

In [28]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

In [30]:
target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

In [31]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

In [34]:
target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

In [35]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

In [36]:
target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

In [37]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

In [38]:
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

In [39]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

In [51]:
target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)

If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html

In [52]:
enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')

In [53]:
logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)

In [54]:
r = model.predict(original_audio, target_audio)

In [55]:
y_ = quantized_melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)