### Introduction

OpenVoice is a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.

This notebooks provides example of converting original OpenVoice model (https://github.com/myshell-ai/OpenVoice) to OpenVINO IR format for faster inference.

clone the repository and install all dependencies

In [1]:
!git clone https://github.com/myshell-ai/OpenVoice

In [2]:
%pip install openvino==2023.2
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu  "torch>=2.1.0" "torchaudio>=2.1.0"
%pip install wavmark also installs torch

# todo: try to unfreeze dependencies
%pip install librosa==0.9.1 \
faster-whisper==0.9.0 \
pydub==0.25.1 \
whisper-timestamped==1.14.2 \
tqdm \
inflect==7.0.0 \
unidecode==1.3.7 \
eng_to_ipa==0.0.2 \
wavmark==0.0.2 \
pypinyin==0.50.0 \
cn2an==0.5.22 \
jieba==0.42.1 \
langid==1.1.6
gradio==4.15 \
ipywebrtc \
ipywidgets \

In [3]:
import os
import torch
from openvoice_utils import OVOpenVoiceTTS, OVOpenVoiceConvert

In [4]:
# cd to the original repo to save original data paths and imports
%cd OpenVoice

/home/epavel/devel/openvino_notebooks/notebooks/408-openvoice/OpenVoice


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


download all resources from HF Hub

In [5]:
!mkdir -p checkpoints/converter/
!mkdir -p checkpoints/base_speakers/EN/
!mkdir -p checkpoints/base_speakers/ZH/

!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/converter/checkpoint.pth -O checkpoints/converter/checkpoint.pth
!wget https://huggingface.co/myshell-ai/OpenVoice/raw/main/checkpoints/converter/config.json -O checkpoints/converter/config.json

!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/base_speakers/EN/checkpoint.pth -O checkpoints/base_speakers/EN/checkpoint.pth
!wget https://huggingface.co/myshell-ai/OpenVoice/raw/main/checkpoints/base_speakers/EN/config.json -O checkpoints/base_speakers/EN/config.json

!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/base_speakers/ZH/checkpoint.pth -O checkpoints/base_speakers/ZH/checkpoint.pth
!wget https://huggingface.co/myshell-ai/OpenVoice/raw/main/checkpoints/base_speakers/ZH/config.json -O checkpoints/base_speakers/ZH/config.json

!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/base_speakers/EN/en_default_se.pth -O checkpoints/base_speakers/EN/en_default_se.pth
!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/base_speakers/EN/en_style_se.pth -O checkpoints/base_speakers/EN/en_style_se.pth
!wget https://huggingface.co/myshell-ai/OpenVoice/resolve/main/checkpoints/base_speakers/ZH/zh_default_se.pth -O checkpoints/base_speakers/ZH/zh_default_se.pth

In [6]:
from api import BaseSpeakerTTS, ToneColorConverter

In [7]:
en_ckpt_base = 'checkpoints/base_speakers/EN'
zh_ckpt_base = 'checkpoints/base_speakers/ZH'

device="cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs'

en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_base}/config.json', device=device)
en_base_speaker_tts.load_ckpt(f'{en_ckpt_base}/checkpoint.pth')

zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_ckpt_base}/config.json', device=device)
zh_base_speaker_tts.load_ckpt(f'{zh_ckpt_base}/checkpoint.pth')

ckpt_converter = 'checkpoints/converter'
tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

Loaded checkpoint 'checkpoints/base_speakers/EN/checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint 'checkpoints/base_speakers/ZH/checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint 'checkpoints/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Convert models to OpenVINO

In [8]:
import ipywidgets as widgets

devices = ['CPU', 'GPU', 'AUTO']
device = widgets.Dropdown(options=devices, value=devices[0], disabled=False)
display(device)

Dropdown(options=('CPU', 'GPU', 'AUTO'), value='CPU')

In [9]:
en_tts_model = OVOpenVoiceTTS(en_base_speaker_tts, ir_path='en_openvoice_tts.xml')
zh_tts_model = OVOpenVoiceTTS(en_base_speaker_tts, ir_path='zh_openvoice_tts.xml')
color_convert_model = OVOpenVoiceConvert(tone_color_converter, ir_path='openvoice_converter.xml')

en_tts_model.compile(device.value)
zh_tts_model.compile(device.value)
color_convert_model.compile(device.value)

In [10]:
# load speaker embeddings
en_source_default_se = torch.load(f'{en_ckpt_base}/en_default_se.pth')
en_source_style_se = torch.load(f'{en_ckpt_base}/en_style_se.pth')
zh_source_se = torch.load(f'{zh_ckpt_base}/zh_default_se.pth')

First of all, select the reference tone of voice to which the generated text will be converted: your can select from existing ones or record your own by seleceing 'record_manually'

In [11]:
reference_speakers = [
    'resources/example_reference.mp3',
    'resources/demo_speaker0.mp3',
    'resources/demo_speaker1.mp3',
    'resources/demo_speaker2.mp3',
    'record_manually',
]

ref_speaker = widgets.Dropdown(
    options=reference_speakers,
    value=reference_speakers[0],
    description="reference voice from which tone color will be copied",
    disabled=False,
)

display(ref_speaker)

Dropdown(description='reference voice from which tone color will be copied', options=('resources/example_refer…

In [12]:
ref_speaker_path = ref_speaker.value

if ref_speaker.value == 'record_manually':
    ref_speaker_path = f'{output_dir}/custom_example_sample.webm'
    from ipywebrtc import AudioRecorder, CameraStream
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera, filename=ref_speaker_path, autosave=True)
    display(recorder)

In [13]:
from IPython.display import Audio
Audio(ref_speaker_path)

In [14]:
import se_extractor
target_se, audio_name = se_extractor.get_se(ref_speaker_path, tone_color_converter, target_dir='processed', vad=True)  ## ffmpeg must be installed

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.



### Inference

In [15]:
save_path = f'{output_dir}/output_en_default.wav'

text = """
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve 
a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, 
recommendation systems, and many others.
"""

src_path = f'{output_dir}/tmp.wav'
en_tts_model.tts(text, src_path, speaker='default', language='English', speed=1.0)

color_convert_model.convert(
    audio_src_path=src_path, 
    src_se=en_source_default_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message="@MyShell")

 > Text splitted to sentences.
OpenVINO toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision,
automatic speech recognition, natural language processing, recommendation systems, and many others.
ˈoʊpən vino* toolkit* ɪz ə ˌkɑmpɹiˈhɛnsɪv toolkit* fəɹ kˈwɪkli dɪˈvɛləpɪŋ ˌæpləˈkeɪʃənz ənd səˈluʃənz ðət sɑɫv ə vəɹˈaɪəti əv tæsks ˌɪnˈkludɪŋ ˌɛmjəˈleɪʃən əv ˈjumən ˈvɪʒən,
 length:173
 length:173
ˌɔtəˈmætɪk spitʃ ˌɹɛkɪgˈnɪʃən, ˈnætʃəɹəɫ ˈlæŋgwɪdʒ ˈpɹɑsɛsɪŋ, ˌɹɛkəmənˈdeɪʃən ˈsɪstəmz, ənd ˈmɛni ˈəðəɹz.
 length:105
 length:105


  return torch.istft(signal_wmd_fft, n_fft=self.n_fft, hop_length=self.hop_length, window=window,


In [16]:
Audio(src_path)

In [17]:
Audio(save_path)

### Run OpenVoice Gradio online app
We can also use [Gradio](https://www.gradio.app/) app to run TTS and voice tone conversion online.

In [18]:
from openvoice_gradio import get_demo

demo = get_demo(output_dir, color_convert_model, en_tts_model, zh_tts_model, en_source_default_se, en_source_style_se, zh_source_se)
demo.queue(max_size=2)
demo.launch(server_name="0.0.0.0", server_port=7860)



Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.




In [19]:
# please run this cell for stopping gradio interface
demo.close()

Closing server running on port: 7860
