# Speech Synthesis

Stress placement and transcription can be useful for speech synthesis. This notebook contains an example of running an [XTTS](https://github.com/coqui-ai/TTS) model trained on Russian language [IPA transcription](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet). The model was trained on the [`RUSLAN`](https://ruslan-corpus.github.io/) and [Common Voice](https://commonvoice.mozilla.org/ru) corpora. Model weights can be downloaded from [Hugging Face](https://huggingface.co/omogr/XTTS-ru-ipa)

In [1]:
# @title Installing XTTS

# different XTTS installation options (with different issues):

# !git clone -b dev https://github.com/coqui-ai/TTS
# !pip install -e TTS
# !pip install git+https://github.com/coqui-ai/TTS

!pip install coqpit
!pip install trainer
!pip install pypinyin
!pip install hangul_romanize
!pip install num2words
!pip install TTS==0.22.0 --no-deps

!mkdir model
print('Loading XTTS weights from huggingface...')
!git clone https://huggingface.co/omogr/XTTS-ru-ipa model
!pip install git+https://github.com/omogr/omogre.git


Collecting coqpit
  Downloading coqpit-0.0.17-py3-none-any.whl.metadata (11 kB)
Downloading coqpit-0.0.17-py3-none-any.whl (13 kB)
Installing collected packages: coqpit
Successfully installed coqpit-0.0.17
Collecting trainer
  Downloading trainer-0.0.36-py3-none-any.whl.metadata (8.1 kB)
Downloading trainer-0.0.36-py3-none-any.whl (51 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.2/51.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trainer
Successfully installed trainer-0.0.36
Collecting pypinyin
  Downloading pypinyin-0.52.0-py2.py3-none-any.whl.metadata (12 kB)
Downloading pypinyin-0.52.0-py2.py3-none-any.whl (833 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m833.7/833.7 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypinyin
Successfully installed pypinyin-0.52.0
Collecting hangul_romanize
  Downloading hangul_romanize-0.1.0-py3-none-any.whl.metadata (1.2 kB)
Download

Download XTTS model weights from [Hugging Face](https://huggingface.co/omogr/XTTS-ru-ipa)
Install the [transcriptor](https://github.com/omogr/omogre).


In [5]:
import os
import torch
import torchaudio

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

from omogre import Transcriptor
import IPython.display as ipd

# @title Download transcriptor model weights. Initialize XTTS and transcriptor.

model_dir = 'model'

def clear_gpu_cache():
    """Clear the GPU cache if CUDA is available."""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

XTTS_MODEL = None

def load_model(xtts_model_path='model'):
    """
    Load the XTTS model.

    Parameters:
    - xtts_model_path (str): Path to the XTTS model directory.
    """
    global XTTS_MODEL
    clear_gpu_cache()

    assert xtts_model_path, "Model path must be provided."

    xtts_checkpoint = os.path.join(xtts_model_path, "model.pth")
    xtts_config = os.path.join(xtts_model_path, "config.json")
    xtts_vocab = os.path.join(xtts_model_path, "vocab.json")

    config = XttsConfig()
    config.load_json(xtts_config)
    XTTS_MODEL = Xtts.init_from_config(config)
    print("XTTS initialization ...")
    XTTS_MODEL.load_checkpoint(config, checkpoint_path=xtts_checkpoint,
                               vocab_path=xtts_vocab, use_deepspeed=False, speaker_file_path='-')
    if torch.cuda.is_available():
        XTTS_MODEL.cuda()
    print(" ... done")


class XttsInference:
    def __init__(self, transcriptor_data_path='omogre_data',
                 xtts_model_path='model'):
        """
        Initialize the transcriptor and load the XTTS model.

        Parameters:
        - transcriptor_data_path (str): Path where transcriptor data will be downloaded.
        - xtts_model_path (str): Path to the XTTS model directory.
        """
        clear_gpu_cache()
        self.transcriptor = Transcriptor(data_path=transcriptor_data_path)
        load_model(xtts_model_path=xtts_model_path)
        reference_audio = os.path.join(xtts_model_path, "reference_audio.wav")

        self.gpt_cond_latent, self.speaker_embedding = XTTS_MODEL.get_conditioning_latents(
            audio_path=reference_audio,
            gpt_cond_len=XTTS_MODEL.config.gpt_cond_len,
            max_ref_length=XTTS_MODEL.config.max_ref_len,
            sound_norm_refs=XTTS_MODEL.config.sound_norm_refs
        )

    def __call__(self, src_text):
        """
        Generate synthesized speech from input text.

        Parameters:
        - src_text (str): Source text to synthesize.

        Returns:
        - tuple: Transcribed text and audio waveform tensor.
        """
        tts_text = ' '.join(self.transcriptor([src_text]))
        # Run the XTTS model to synthesize speech from text.
        out = XTTS_MODEL.inference(
            text=tts_text,
            language='ru',
            gpt_cond_latent=self.gpt_cond_latent,
            speaker_embedding=self.speaker_embedding,
            temperature=XTTS_MODEL.config.temperature,
            length_penalty=XTTS_MODEL.config.length_penalty,
            repetition_penalty=XTTS_MODEL.config.repetition_penalty,
            top_k=XTTS_MODEL.config.top_k,
            top_p=XTTS_MODEL.config.top_p,
        )
        audio = torch.tensor(out["wav"]).unsqueeze(0)
        return tts_text, audio


xtts_inference = XttsInference()


XTTS initialization ...


In [6]:
# @title Example of generating audio for a single phrase
src_text = 'МИД Турции официально заявил, что Турция заинтересована во вступлении в БРИКС.' # @param {type:"string"}
output_file = 'audio.wav' # @param {type:"string"}

In [7]:
# @title Transcribe and generate audio
tts_text, audio = xtts_inference(src_text)
print('transcription:', tts_text)

# Save the result
torchaudio.save(output_file, audio, sample_rate=24000)
ipd.display(ipd.Audio(audio.to('cpu').detach(), rate=24000))
print('output_file:', output_file)

transcription: mʲ`it t`urtsɨɪ ɐfʲɪtsɨ`alʲnə zəjɪvʲ`iɫ, ʂt`o t`urtsɨjə zəɪnʲtʲɪrʲɪs`ovənə v`o fstʊplʲ`enʲɪɪ v brʲ`iks.


output_file: audio.wav
