## Multi-Accent and Multi-Lingual Voice Clone Demo with MeloTTS

In [7]:
import os
import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter

### Initialization

In this example, we will use the checkpoints from OpenVoiceV2. OpenVoiceV2 is trained with more aggressive augmentations and thus demonstrate better robustness in some cases.

In [8]:
ckpt_converter = 'checkpoints/converter'
device = "cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs_v2'

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

Loaded checkpoint 'checkpoints/converter/checkpoint.pth'
missing/unexpected keys: [] []


### Obtain Tone Color Embedding
We only extract the tone color embedding for the target speaker. The source tone color embeddings can be directly loaded from `checkpoints_v2/ses` folder.

In [11]:

reference_speaker = 'resources/demo_speaker2.mp3' # This is the voice you want to clone
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, vad=False)

OpenVoice version: v1
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Estimating duration from bitrate, this may be inaccurate


#### Use MeloTTS as Base Speakers

MeloTTS is a high-quality multi-lingual text-to-speech library by @MyShell.ai, supporting languages including English (American, British, Indian, Australian, Default), Spanish, French, Chinese, Japanese, Korean. In the following example, we will use the models in MeloTTS as the base speakers. 

In [12]:
from melo.api import TTS
import torch

output_dir = "outputs_v2"  # Replace with your actual output directory
device = "cuda" if torch.cuda.is_available() else "cpu"

text = "Did you ever hear a folk tale about a giant turtle?"
src_path = f"{output_dir}/tmp.wav"
speed = 1.0

model = TTS(language="EN", device=device)
speaker_ids = model.hps.data.spk2id

# Print available speaker keys
print("Available speaker keys:", list(speaker_ids.keys()))

# Try to find an appropriate English speaker
english_speakers = [key for key in speaker_ids.keys() if key.startswith("EN")]

if not english_speakers:
    raise ValueError(
        "No English speaker found. Available speakers: " + ", ".join(speaker_ids.keys())
    )

# Use the first available English speaker
speaker_key = english_speakers[0]
speaker_id = speaker_ids[speaker_key]

print(f"Using speaker: {speaker_key}")

source_se = torch.load(
    f'checkpoints_v2/base_speakers/ses/{speaker_key.lower().replace("_", "-")}.pth',
    map_location=device,
)
model.tts_to_file(text, speaker_id, src_path, speed=speed)
save_path = f'{output_dir}/output_v2_{speaker_key.lower().replace("_", "-")}.wav'

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

Available speaker keys: ['EN-US', 'EN-BR', 'EN_INDIA', 'EN-AU', 'EN-Default']
Using speaker: EN-US
 > Text split to sentences.
Did you ever hear a folk tale about a giant turtle?


100%|██████████| 1/1 [00:01<00:00,  1.86s/it]
