### Introduction

OpenVoice is a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.

This notebooks provides example of converting original OpenVoice model (https://github.com/myshell-ai/OpenVoice) to OpenVINO IR format for faster inference.

clone the repository and install all dependencies

In [1]:
from pathlib import Path

repo_dir = Path("OpenVoice")

if not repo_dir.exists():
    !git clone https://github.com/myshell-ai/OpenVoice

# cd to the original repo to save original data paths and imports
%cd $repo_dir

/home/epavel/devel/openvino_notebooks/notebooks/280-openvoice/OpenVoice


In [2]:
%pip install "openvino>=2023.3" \
"librosa>=0.9.1" \
"wavmark>=0.0.3" \
"faster-whisper>=0.9.0" \
"pydub>=0.25.1" \
"whisper-timestamped>=1.14.2" \
"tqdm" \
"inflect>=7.0.0" \
"unidecode>=1.3.7" \
"eng_to_ipa>=0.0.2" \
"pypinyin>=0.50.0" \
"cn2an>=0.5.22" \
"jieba>=0.42.1" \
"langid>=1.1.6" \
"gradio>=4.15" \
"ipywebrtc"

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import torch
import openvino as ov
core = ov.Core()

from api import BaseSpeakerTTS, ToneColorConverter, OpenVoiceBaseClass
import se_extractor
from notebook_utils import download_file

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



download all resources from HF Hub

In [4]:
base_url = 'https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/'

CKPT_BASE_PATH = '../checkpoints/'
en_ckpt_path = f'{CKPT_BASE_PATH}/base_speakers/EN/'
zh_ckpt_path = f'{CKPT_BASE_PATH}/base_speakers/ZH/'
converter_path = f'{CKPT_BASE_PATH}/converter/'

enable_chineese_lang = True

In [5]:
download_file(base_url + 'converter/checkpoint.pth', directory=converter_path)
download_file(base_url + 'converter/config.json', directory=converter_path)

download_file(base_url + 'base_speakers/EN/checkpoint.pth', directory=en_ckpt_path)
download_file(base_url + 'base_speakers/EN/config.json', directory=en_ckpt_path)

if enable_chineese_lang:
    download_file(base_url + 'base_speakers/ZH/checkpoint.pth', directory=zh_ckpt_path)
    download_file(base_url + 'base_speakers/ZH/config.json', directory=zh_ckpt_path)

download_file(base_url + 'base_speakers/EN/en_default_se.pth', directory=en_ckpt_path)
download_file(base_url + 'base_speakers/EN/en_style_se.pth', directory=en_ckpt_path)
download_file(base_url + 'base_speakers/ZH/zh_default_se.pth', directory=zh_ckpt_path)

'../checkpoints/converter/checkpoint.pth' already exists.
'../checkpoints/converter/config.json' already exists.
'../checkpoints/base_speakers/EN/checkpoint.pth' already exists.
'../checkpoints/base_speakers/EN/config.json' already exists.
'../checkpoints/base_speakers/ZH/checkpoint.pth' already exists.
'../checkpoints/base_speakers/ZH/config.json' already exists.
'../checkpoints/base_speakers/EN/en_default_se.pth' already exists.
'../checkpoints/base_speakers/EN/en_style_se.pth' already exists.
'../checkpoints/base_speakers/ZH/zh_default_se.pth' already exists.


PosixPath('/home/epavel/devel/openvino_notebooks/notebooks/280-openvoice/checkpoints/base_speakers/ZH/zh_default_se.pth')

In [6]:
Path(f'{zh_ckpt_path}/config.json').absolute()

PosixPath('/home/epavel/devel/openvino_notebooks/notebooks/280-openvoice/OpenVoice/../checkpoints/base_speakers/ZH/config.json')

In [7]:
pt_device = "cpu"

en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_path}/config.json', device=pt_device)
en_base_speaker_tts.load_ckpt(f'{en_ckpt_path}/checkpoint.pth')

if enable_chineese_lang:
    zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_ckpt_path}/config.json', device=pt_device)
    zh_base_speaker_tts.load_ckpt(f'{zh_ckpt_path}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{converter_path}/config.json', device=pt_device)
tone_color_converter.load_ckpt(f'{converter_path}/checkpoint.pth')

Loaded checkpoint '../checkpoints//base_speakers/EN//checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint '../checkpoints//base_speakers/ZH//checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint '../checkpoints//converter//checkpoint.pth'
missing/unexpected keys: [] []


### Convert models to OpenVINO

To convert to OpenVino IR format first we need to get acceptable pytorch nn.Module object. 

Both ToneColorConverter, BaseSpeakerTTS instead of using self.forward as the main entry point use custom methods infer and convert_voice respectively, therefore need to wrap them with a custom class that is inherited from torch.nn.Module. 

<!---
# One more reason to make a wrapper is also that these functions use float arguments while only torch.Tensor and tuple of torch.Tensors are acceptable 
# todo: check if it works when kwargs are moved to example inputs.
-->

In [8]:
tts_kwargs = dict(noise_scale = 0.667, noise_scale_w = 0.6, speed = 1.0, sdp_ratio = 0.2)
voice_convert_kwargs = dict(tau=0.3)

class OVOpenVoiceBase(torch.nn.Module):
    def __init__(self, voice_model: OpenVoiceBaseClass, **kwargs):
        super().__init__()
        self.voice_model = voice_model
        self.default_kwargs = kwargs
        for par in voice_model.model.parameters():
            par.requires_grad = False
    
class OVOpenVoiceTTS(OVOpenVoiceBase):
    def get_example_input(self):
        stn_tst = self.voice_model.get_text('this is original text', self.voice_model.hps, False)
        x_tst = stn_tst.unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
        speaker_id = torch.LongTensor([1])
        return (x_tst, x_tst_lengths, speaker_id)

    def forward(self, x, x_lengths, sid):
        return self.voice_model.model.infer(x, x_lengths, sid, **self.default_kwargs)
    
class OVOpenVoiceConverter(OVOpenVoiceBase):
    def get_example_input(self):
        y = torch.randn([1, 513, 238], dtype=torch.float32)
        y_lengths = torch.LongTensor([y.size(-1)])
        target_se = torch.randn(*(1, 256, 1))
        source_se = torch.randn(*(1, 256, 1))
        return (y, y_lengths, source_se, target_se)
    
    def forward(self, y, y_lengths, sid_src, sid_tgt):
        return self.voice_model.model.voice_conversion(y, y_lengths, sid_src, sid_tgt, **self.default_kwargs)

Convert to OpenVino IR and save to IRs_path folder for the future use. If IRs already exist skip conversion and read them directly

In [9]:
IRS_PATH = '../openvino_irs/'
EN_TTS_IR = f'{IRS_PATH}/openvoice_en_tts.xml'
ZH_TTS_IR = f'{IRS_PATH}/openvoice_zh_tts.xml'

VOICE_CONVERTER_IR = f'{IRS_PATH}/openvoice_tone_conversion.xml'

paths = [EN_TTS_IR, VOICE_CONVERTER_IR]
models = [OVOpenVoiceTTS(en_base_speaker_tts), OVOpenVoiceConverter(tone_color_converter)]
if enable_chineese_lang:
    models.append(OVOpenVoiceTTS(zh_base_speaker_tts))

ov_models = []
for model, path in zip(models, paths):
    if not os.path.exists(path):
        ov_model = ov.convert_model(model, example_input=model.get_example_input())
        ov.save_model(ov_model, path)
    else:
        ov_model = core.read_model(path)
    ov_models.append(ov_model)

ov_en_tts, ov_voice_conversion = ov_models[:2]
if enable_chineese_lang:
    ov_zh_tts = ov_models[-1]

In [10]:
import ipywidgets as widgets

core = ov.Core()
device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='GPU',
    description='Device:',
    disabled=False,
)
device

Dropdown(description='Device:', index=1, options=('CPU', 'GPU', 'AUTO'), value='GPU')

First of all, select the reference tone of voice to which the generated text will be converted: your can select from existing ones or record your own by seleceing 'record_manually'

In [11]:
reference_speakers = [
    'resources/example_reference.mp3',
    'resources/demo_speaker0.mp3',
    'resources/demo_speaker1.mp3',
    'resources/demo_speaker2.mp3',
    'record_manually',
]

ref_speaker = widgets.Dropdown(
    options=reference_speakers,
    value=reference_speakers[0],
    description="reference voice from which tone color will be copied",
    disabled=False,
)

display(ref_speaker)

Dropdown(description='reference voice from which tone color will be copied', options=('resources/example_refer…

In [12]:
output_dir = '../outputs/'
os.makedirs(output_dir, exist_ok=True)

In [13]:
ref_speaker_path = ref_speaker.value

if ref_speaker.value == 'record_manually':
    ref_speaker_path = f'{output_dir}/custom_example_sample.webm'
    from ipywebrtc import AudioRecorder, CameraStream
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera, filename=ref_speaker_path, autosave=True)
    display(recorder)

In [14]:
from IPython.display import Audio
Audio(ref_speaker_path)

In [15]:
# load speaker embeddings
en_source_default_se = torch.load(f'{en_ckpt_path}/en_default_se.pth')
en_source_style_se = torch.load(f'{en_ckpt_path}/en_style_se.pth')
zh_source_se = torch.load(f'{zh_ckpt_path}/zh_default_se.pth')

target_se, audio_name = se_extractor.get_se(ref_speaker_path, tone_color_converter, target_dir='processed', vad=True)  ## ffmpeg must be installed

custom_se = se_extractor.get_se('/home/epavel/my_base_voice.m4a', tone_color_converter, target_dir='processed', vad=True)[0]  ## ffmpeg must be installed

  return f(*args, **kwargs)


### Inference

There are pre and post processings that are not traceable and could not be offloaded to OpenVINO, instead of writing such processing ourselves we will rely on the already existing ones. We just replace infer and voice conversion functions of OpenVoiceBaseClass so that the the most computationally expensive part is done in OpenVINO.

In [16]:
def get_pathched_infer(ov_model: ov.Model, device_name: str = device.value) -> callable:
    compiled_model = core.compile_model(ov_model, device_name)
    
    def infer_impl(x, x_lengths, sid=None, noise_scale=1, length_scale=1, noise_scale_w=1., sdp_ratio=0.2, max_len=None):
        # todo: assert that other params match to compiled ones
        ov_output = compiled_model((x, x_lengths, sid))
        return (torch.tensor(ov_output[0]), )
    return infer_impl

def get_patched_voice_conversion(ov_model: ov.Model, device_name: str = device.value) -> callable:
    compiled_model = core.compile_model(ov_model, device_name)

    def voice_conversion_impl(y, y_lengths, sid_src, sid_tgt, tau=1.0):
        # todo: assert that tau matches to compiled ones
        ov_output = compiled_model((y, y_lengths, sid_src, sid_tgt))
        return (torch.tensor(ov_output[0]), )
    return voice_conversion_impl

en_base_speaker_tts.model.infer = get_pathched_infer(ov_en_tts)
tone_color_converter.model.voice_conversion = get_patched_voice_conversion(ov_voice_conversion)

In [17]:
save_path = f'{output_dir}/output_en_default.wav'

text = """
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve 
a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, 
recommendation systems, and many others.
"""

src_path = f'{output_dir}/tmp.wav'
en_base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)
# src_path = '/home/epavel/my_base_voice.m4a'

tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=en_source_default_se, 
    tgt_se=target_se, 
    output_path=save_path, 
    message="@MyShell")

 > Text splitted to sentences.
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision,
automatic speech recognition, natural language processing, recommendation systems, and many others.
ˈoʊpən vino* toolkit* ɪz ə ˌkɑmpɹiˈhɛnsɪv toolkit* fəɹ kˈwɪkli dɪˈvɛləpɪŋ ˌæpləˈkeɪʃənz ənd səˈluʃənz ðət sɑɫv ə vəɹˈaɪəti əv tæsks ˌɪnˈkludɪŋ ˌɛmjəˈleɪʃən əv ˈjumən ˈvɪʒən,
 length:173
 length:173
ˌɔtəˈmætɪk spitʃ ˌɹɛkɪgˈnɪʃən, ˈnætʃəɹəɫ ˈlæŋgwɪdʒ ˈpɹɑsɛsɪŋ, ˌɹɛkəmənˈdeɪʃən ˈsɪstəmz, ənd ˈmɛni ˈəðəɹz.
 length:105
 length:105


In [18]:
Audio(src_path)

In [19]:
Audio(save_path)

### Run OpenVoice Gradio online app
We can also use [Gradio](https://www.gradio.app/) app to run TTS and voice tone conversion online.

In [18]:
from openvoice_gradio import get_demo

demo = get_demo(output_dir, color_convert_model, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se)
demo.queue(max_size=2)
demo.launch(server_name="0.0.0.0", server_port=7860)



Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.




In [19]:
# please run this cell for stopping gradio interface
demo.close()

# clean up 
# import shutil
# shutil.rmtree(CKPT_BASE_PATH)
# shutil.rmtree(IRS_PATH)

Closing server running on port: 7860
