### Introduction

OpenVoice is a versatile instant voice tone transferring and generating speech in various languages with just a brief audio snippet from the source speaker. OpenVoice has three main features: (i) high quality tone color replication with multiple languages and accents; (ii) it provides fine-tuned control over voice styles, including emotions, accents, as well as other parameters such as rhythm, pauses, and intonation. (iii) OpenVoice achieves zero-shot cross-lingual voice cloning, eliminating the need for the generated speech and the reference speech to be part of a massive-speaker multilingual training dataset.

![sdf](openvoice_scheme.png)

More details about model can be found in [project web page](https://research.myshell.ai/open-voice), [paper](https://arxiv.org/abs/2312.01479), and official [repository](https://github.com/myshell-ai/OpenVoice)

This notebooks provides example of converting original OpenVoice model (https://github.com/myshell-ai/OpenVoice) to OpenVINO IR format for faster inference.

In this tutorial we will explore how to convert and run OpenVoice using OpenVINO.
#### Table of contents:
- [Clone repository and install requirements](#Clone-repository-and-install-requirements)
- [Download checkpoints and load PyTorch model](#Download-checkpoints-and-load-PyTorch-model)
- [Convert Models to OpenVINO IR](#Convert-Models-to-OpenVINO-IR)
- [Inference](#Inference)
    - [Select inference device](#Select-inference-device)
    - [Select reference tone](#Select-reference-tone)
    - [Run inference](#Run-inference)
- [Run OpenVoice Gradio online app](#Run-OpenVoice-Gradio-online-app)
- [Cleanup](#Cleanup)

## Clone repository and install requirements
[back to top ⬆️](#Table-of-contents:)

In [1]:
from pathlib import Path

repo_dir = Path("OpenVoice")

if not repo_dir.exists():
    !git clone https://github.com/myshell-ai/OpenVoice

# cd to the original repo to save original data paths and imports
%cd $repo_dir

%pip install "openvino>=2023.3" \
"librosa>=0.8.1" \
"wavmark>=0.0.3" \
"faster-whisper>=0.9.0" \
"pydub>=0.25.1" \
"whisper-timestamped>=1.14.2" \
"tqdm" \
"inflect>=7.0.0" \
"unidecode>=1.3.7" \
"eng_to_ipa>=0.0.2" \
"pypinyin>=0.50.0" \
"cn2an>=0.5.22" \
"jieba>=0.42.1" \
"langid>=1.1.6" \
"gradio>=4.15" \
"ipywebrtc"

/home/epavel/devel/openvino_notebooks/notebooks/284-openvoice/OpenVoice
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install ffmpeg-downloader
!ffdl install -y

Note: you may need to restart the kernel to use updated packages.


## Download checkpoints and load PyTorch model
[back to top ⬆️](#Table-of-contents:)

In [3]:
import os
import torch
import openvino as ov
core = ov.Core()

from api import BaseSpeakerTTS, ToneColorConverter, OpenVoiceBaseClass
import se_extractor

# Fetch `notebook_utils` module
import urllib.request
urllib.request.urlretrieve(
    url='https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/main/notebooks/utils/notebook_utils.py',
    filename='notebook_utils.py'
)
from notebook_utils import download_file



Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



In [4]:
base_url = 'https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/'

CKPT_BASE_PATH = '../checkpoints/'

en_suffix = 'base_speakers/EN'
zh_suffix = 'base_speakers/ZH'
converter_suffix = 'converter'
en_ckpt_path = f'{CKPT_BASE_PATH}/{en_suffix}'
zh_ckpt_path = f'{CKPT_BASE_PATH}/{zh_suffix}'
converter_path = f'{CKPT_BASE_PATH}/{converter_suffix}'

To make notebook lightweight by default model for Chinese speech is not activated, in order turn on please set flag `enable_chinese_lang` to True

In [5]:
enable_chinese_lang = True

In [6]:
download_file(base_url + f'{converter_suffix}/checkpoint.pth', directory=converter_path)
download_file(base_url + f'{converter_suffix}/config.json', directory=converter_path)
download_file(base_url + f'{en_suffix}/checkpoint.pth', directory=en_ckpt_path)
download_file(base_url + f'{en_suffix}/config.json', directory=en_ckpt_path)

download_file(base_url + f'{en_suffix}/en_default_se.pth', directory=en_ckpt_path)
download_file(base_url + f'{en_suffix}/en_style_se.pth', directory=en_ckpt_path)

if enable_chinese_lang:
    download_file(base_url + f'{zh_suffix}/checkpoint.pth', directory=zh_ckpt_path)
    download_file(base_url + f'{zh_suffix}/config.json', directory=zh_ckpt_path)
    download_file(base_url + f'{zh_suffix}/zh_default_se.pth', directory=zh_ckpt_path)

'../checkpoints/converter/checkpoint.pth' already exists.
'../checkpoints/converter/config.json' already exists.
'../checkpoints/base_speakers/EN/checkpoint.pth' already exists.
'../checkpoints/base_speakers/EN/config.json' already exists.
'../checkpoints/base_speakers/EN/en_default_se.pth' already exists.
'../checkpoints/base_speakers/EN/en_style_se.pth' already exists.
'../checkpoints/base_speakers/ZH/checkpoint.pth' already exists.
'../checkpoints/base_speakers/ZH/config.json' already exists.
'../checkpoints/base_speakers/ZH/zh_default_se.pth' already exists.


In [7]:
pt_device = "cpu"  # todo: check if torch.device("cuda" if torch.cuda.is_available() else "cpu") is indeed needed

en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_path}/config.json', device=pt_device)
en_base_speaker_tts.load_ckpt(f'{en_ckpt_path}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{converter_path}/config.json', device=pt_device)
tone_color_converter.load_ckpt(f'{converter_path}/checkpoint.pth')

if enable_chinese_lang:
    zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_ckpt_path}/config.json', device=pt_device)
    zh_base_speaker_tts.load_ckpt(f'{zh_ckpt_path}/checkpoint.pth')



Loaded checkpoint '../checkpoints//base_speakers/EN/checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint '../checkpoints//converter/checkpoint.pth'
missing/unexpected keys: [] []




Loaded checkpoint '../checkpoints//base_speakers/ZH/checkpoint.pth'
missing/unexpected keys: [] []


## Convert models to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

To convert to OpenVino IR format first we need to get acceptable `torch.nn.Module` object. Both ToneColorConverter, BaseSpeakerTTS instead of using `self.forward` as the main entry point use custom `infer` and `convert_voice` methods respectively, therefore need to wrap them with a custom class that is inherited from torch.nn.Module. 

<!---
# One more reason to make a wrapper is also that these functions use float arguments while only torch.Tensor and tuple of torch.Tensors are acceptable 
# todo: check if it works when kwargs are moved to example inputs.
-->

In [8]:
# tts_kwargs = dict(noise_scale = 0.667, noise_scale_w = 0.6, speed = 1.0, sdp_ratio = 0.2)

voice_convert_kwargs = dict(tau=0.3)
tts_kwargs = dict(noise_scale=0.667, noise_scale_w=0.6, length_scale=1.0)  # length_scale = 1.0 / speed - here we set speed = 1 as default

class OVOpenVoiceBase(torch.nn.Module):
    def __init__(self, voice_model: OpenVoiceBaseClass, kwargs):
        super().__init__()
        self.voice_model = voice_model
        self.default_kwargs = kwargs
        for par in voice_model.model.parameters():
            par.requires_grad = False
    
class OVOpenVoiceTTS(OVOpenVoiceBase):
    def get_example_input(self):
        stn_tst = self.voice_model.get_text('this is original text', self.voice_model.hps, False)
        x_tst = stn_tst.unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
        speaker_id = torch.LongTensor([1])
        return (x_tst, x_tst_lengths, speaker_id)

    def forward(self, x, x_lengths, sid):
        return self.voice_model.model.infer(x, x_lengths, sid, **self.default_kwargs)
    
class OVOpenVoiceConverter(OVOpenVoiceBase):
    def get_example_input(self):
        y = torch.randn([1, 513, 238], dtype=torch.float32)
        y_lengths = torch.LongTensor([y.size(-1)])
        target_se = torch.randn(*(1, 256, 1))
        source_se = torch.randn(*(1, 256, 1))
        return (y, y_lengths, source_se, target_se)
    
    def forward(self, y, y_lengths, sid_src, sid_tgt):
        return self.voice_model.model.voice_conversion(y, y_lengths, sid_src, sid_tgt, **self.default_kwargs)

Convert to OpenVino IR and save to IRs_path folder for the future use. If IRs already exist skip conversion and read them directly

In [9]:
IRS_PATH = '../openvino_irs/'
EN_TTS_IR = f'{IRS_PATH}/openvoice_en_tts.xml'
ZH_TTS_IR = f'{IRS_PATH}/openvoice_zh_tts.xml'
VOICE_CONVERTER_IR = f'{IRS_PATH}/openvoice_tone_conversion.xml'

paths = [EN_TTS_IR, VOICE_CONVERTER_IR]
models = [OVOpenVoiceTTS(en_base_speaker_tts, tts_kwargs), OVOpenVoiceConverter(tone_color_converter, voice_convert_kwargs)]
if enable_chinese_lang:
    models.append(OVOpenVoiceTTS(zh_base_speaker_tts, tts_kwargs))
ov_models = []

for model, path in zip(models, paths):
    if not os.path.exists(path):
        ov_model = ov.convert_model(model, example_input=model.get_example_input())
        ov.save_model(ov_model, path)
    else:
        ov_model = core.read_model(path)
    ov_models.append(ov_model)

ov_en_tts, ov_voice_conversion = ov_models[:2]
if enable_chinese_lang:
    ov_zh_tts = ov_models[-1]

## Inference

### Select inference device
[back to top ⬆️](#Table-of-contents:)

In [10]:
import ipywidgets as widgets

core = ov.Core()
device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value='AUTO',
    description='Device:',
    disabled=False,
)
device

Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')

### Select reference tone
[back to top ⬆️](#Table-of-contents:)

First of all, select the reference tone of voice to which the generated text will be converted: your can select from existing ones or record your own by selecting `record_manually`

In [11]:
reference_speakers = [
    'resources/example_reference.mp3',
    'resources/demo_speaker0.mp3',
    'resources/demo_speaker1.mp3',
    'resources/demo_speaker2.mp3',
    'record_manually',
]

ref_speaker = widgets.Dropdown(
    options=reference_speakers,
    value=reference_speakers[0],
    description="reference voice from which tone color will be copied",
    disabled=False,
)

display(ref_speaker)

Dropdown(description='reference voice from which tone color will be copied', options=('resources/example_refer…

In [12]:
OUTPUT_DIR = '../outputs/'
os.makedirs(OUTPUT_DIR, exist_ok=True)

In [13]:
ref_speaker_path = ref_speaker.value

if ref_speaker.value == 'record_manually':
    ref_speaker_path = f'{OUTPUT_DIR}/custom_example_sample.webm'
    from ipywebrtc import AudioRecorder, CameraStream
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera, filename=ref_speaker_path, autosave=True)
    display(recorder)

Play the reference voice sample before cloning it's tone to another speech

In [14]:
from IPython.display import Audio
Audio(ref_speaker_path)

Load speaker embeddings

In [15]:
import ffmpeg_downloader as ffdl
import sys
delimiter = ':' if sys.platform != 'wind32' else ';'
os.environ['PATH'] = os.environ['PATH'] + f"{delimiter}{ffdl.ffmpeg_dir}"

In [16]:
en_source_default_se = torch.load(f'{en_ckpt_path}/en_default_se.pth')
en_source_style_se = torch.load(f'{en_ckpt_path}/en_style_se.pth')
zh_source_se = torch.load(f'{zh_ckpt_path}/zh_default_se.pth') if enable_chinese_lang else None

target_se, audio_name = se_extractor.get_se(ref_speaker_path, tone_color_converter, target_dir='processed', vad=True)  # ffmpeg must be installed

Replace original infer methods of `OpenVoiceBaseClass` with optimized OpenVINO inference.

There are pre and post processings that are not traceable and could not be offloaded to OpenVINO, instead of writing such processing ourselves we will rely on the already existing ones. We just replace infer and voice conversion functions of `OpenVoiceBaseClass` so that the the most computationally expensive part is done in OpenVINO.

In [17]:
def assert_kwargs_are_same(kwargs: dict, orig_kwargs: dict):
    for k, v in kwargs.items():
        assert v == orig_kwargs[k], f"Model was converted to IR with {k}: '{orig_kwargs[k]}', " \
            f"but you are trying to infer with {k} = '{v}'. " \
            f"Please use original value or rerun ov.convert_model with the desirable value of '{k}'"

def get_pathched_infer(ov_model: ov.Model, device: str, orig_kwargs: dict = None) -> callable:
    compiled_model = core.compile_model(ov_model, device)
    
    def infer_impl(x, x_lengths, sid=None, **kwargs):
        assert_kwargs_are_same(kwargs, orig_kwargs)
        ov_output = compiled_model((x, x_lengths, sid))
        return (torch.tensor(ov_output[0]), )
    return infer_impl

def get_patched_voice_conversion(ov_model: ov.Model, device: str, orig_kwargs: dict = None) -> callable:
    compiled_model = core.compile_model(ov_model, device)

    def voice_conversion_impl(y, y_lengths, sid_src, sid_tgt, **kwargs):
        assert_kwargs_are_same(kwargs, orig_kwargs)
        ov_output = compiled_model((y, y_lengths, sid_src, sid_tgt))
        return (torch.tensor(ov_output[0]), )
    return voice_conversion_impl


en_base_speaker_tts.model.infer = get_pathched_infer(ov_en_tts, device.value, orig_kwargs=tts_kwargs)
tone_color_converter.model.voice_conversion = get_patched_voice_conversion(ov_voice_conversion, device.value, orig_kwargs=voice_convert_kwargs)
if enable_chinese_lang:
    zh_base_speaker_tts.model.infer = get_pathched_infer(ov_zh_tts, device.value, orig_kwargs=tts_kwargs)

### Run inference
[back to top ⬆️](#Table-of-contents:)

In [18]:
save_path = f'{OUTPUT_DIR}/output_en_default.wav'

text = """
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve 
a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, 
recommendation systems, and many others.
"""

src_path = f'{OUTPUT_DIR}/tmp.wav'
en_base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)
# src_path = '/home/epavel/my_base_voice.m4a'

tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=en_source_default_se, 
    tgt_se=target_se, 
    output_path=save_path, 
    message="@MyShell")

 > Text splitted to sentences.
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision,
automatic speech recognition, natural language processing, recommendation systems, and many others.
ˈoʊpən vino* toolkit* ɪz ə ˌkɑmpɹiˈhɛnsɪv toolkit* fəɹ kˈwɪkli dɪˈvɛləpɪŋ ˌæpləˈkeɪʃənz ənd səˈluʃənz ðət sɑɫv ə vəɹˈaɪəti əv tæsks ˌɪnˈkludɪŋ ˌɛmjəˈleɪʃən əv ˈjumən ˈvɪʒən,
 length:173
 length:173
ˌɔtəˈmætɪk spitʃ ˌɹɛkɪgˈnɪʃən, ˈnætʃəɹəɫ ˈlæŋgwɪdʒ ˈpɹɑsɛsɪŋ, ˌɹɛkəmənˈdeɪʃən ˈsɪstəmz, ənd ˈmɛni ˈəðəɹz.
 length:105
 length:105


Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]


In [19]:
# play the original voice
Audio(src_path)

In [20]:
# play the speech with tone voice copied from the reference
Audio(save_path)

## Run OpenVoice Gradio online app
[back to top ⬆️](#Table-of-contents:)

We can also use [Gradio](https://www.gradio.app/) app to run TTS and voice tone conversion online.

In [18]:
from openvoice_gradio import get_demo

demo = get_demo(OUTPUT_DIR, tone_color_converter, en_base_speaker_tts, zh_base_speaker_tts, en_source_default_se, en_source_style_se, zh_source_se)
demo.queue(max_size=2)
demo.launch(server_name="0.0.0.0", server_port=7860)



Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.




Detected language:en
[(0.0, 8.178), (9.326, 12.914), (13.262, 16.402), (16.654, 29.49225)]
after vad: dur = 27.743990929705216
 > Text splitted to sentences.
He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick,
peppered, flour-fattened sauce.
hi hoʊpt ðɛɹ wʊd bi stu fəɹ ˈdɪnəɹ, ˈtəɹnəps ənd ˈkɛɹəts ənd bɹuzd pəˈteɪtoʊz ənd fæt ˈmətən ˈpisɪz tɪ bi ˈleɪdəɫd aʊt ɪn θɪk,
 length:126
 length:126


Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]


ˈpɛpəɹd, flouɹ-fattened* sɔs.
 length:29
 length:29


## Cleanup
[back to top ⬆️](#Table-of-contents:)

In [None]:
# please run this cell for stopping gradio interface
demo.close()

# clean up 
# import shutil
# shutil.rmtree(CKPT_BASE_PATH)
# shutil.rmtree(IRS_PATH)
# shutil.rmtree(OUTPUT_DIR)