# Text-to-Speech synthesis using Llasa and OpenVINO

Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the [XCodec2](https://huggingface.co/HKUSTAudio/xcodec2) codebook, which contains 65 536 tokens.  The model is capable of generating speech either solely from input text or by utilizing a given speech prompt.
The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possibility of existing method for compression, acceleration and finetuning for LLM to be applied. 

More details about model can be found in the [paper](https://arxiv.org/abs/2502.04128), [repository](https://github.com/zhenye234/LLaSA_training) and [model card](https://huggingface.co/HKUSTAudio/Llasa-3B).

In this tutorial we consider how to run Llasa pipeline using OpenVINO.

#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Select model](#Select-model)
- [Convert model](#Convert-model)
- [Run model inference](#Run-model-inference)
    - [Select inference device](#Select-inference-device)
- [Interactive demo](#Interactive-demo)


### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/llasa-speech-synthesis/llasa-speech-synthesis.ipynb" />


## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [1]:
import requests
from pathlib import Path

utility_files = ["cmd_helper.py", "notebook_utils.py", "pip_helper.py"]
base_utility_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/"

for utility_file in utility_files:
    if not Path(utility_file).exists():
        r = requests.get(base_utility_url + utility_file)
        with Path(utility_file).open("w") as f:
            f.write(r.text)


helper_files = ["gradio_helper.py"]
base_helper_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/llasa-speech-synthesis"

for helper_file in helper_files:
    if not Path(helper_file).exists():
        r = requests.get(base_helper_url + helper_file)
        with Path(helper_file).open("w") as f:
            f.write(r.text)

In [None]:
import platform
from pip_helper import pip_install

pip_install(
    "-q",
    "torch>=2.1",
    "torchaudio",
    "transformers>=4.46.1",
    "einops",
    "torchao",
    "torchtune>=0.3.1",
    "vector-quantize-pytorch>=1.17.8",
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
)

pip_install("-q", "--no-deps", "xcodec2")
pip_install(
    "-q",
    "gradio>=4.19",
    "openvino>=2025.0.0",
    "tqdm",
    "librosa",
    "soundfile",
    "nncf>=2.15",
)
pip_install("-q", "git+https://github.com/huggingface/optimum-intel.git", "--extra-index-url", "https://download.pytorch.org/whl/cpu")

if platform.system() == "Darwin":
    pip_install("-q", "numpy<2.0.0")

## Select model
[back to top ⬆️](#Table-of-contents:)

In [3]:
import ipywidgets as widgets

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("llasa-speech-synthesis.ipynb")

model_ids = ["HKUSTAudio/Llasa-1B", "HKUSTAudio/Llasa-1B-Multilingual", "HKUSTAudio/Llasa-3B", "HKUSTAudio/Llasa-8B"]

model_selector = widgets.Dropdown(options=model_ids, value=model_ids[0], description="Model:")

model_selector

Dropdown(description='Model:', options=('HKUSTAudio/Llasa-1B', 'HKUSTAudio/Llasa-1B-Multilingual', 'HKUSTAudio…

## Convert model
[back to top ⬆️](#Table-of-contents:)


 OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation format.  For convenience, we will use OpenVINO integration with HuggingFace Optimum. 🤗 [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.

Among other use cases, Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime. `optimum-cli` provides command line interface for model conversion and optimization. 

General command format:

```bash
optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>
```

where task is task to export the model for, if not specified, the task will be auto-inferred based on the model. You can find a mapping between tasks and model classes in Optimum TaskManager [documentation](https://huggingface.co/docs/optimum/exporters/task_manager). Additionally, you can specify weights compression using `--weight-format` argument with one of following options: `fp32`, `fp16`, `int8` and `int4`. Fro int8 and int4 [nncf](https://github.com/openvinotoolkit/nncf) will be used for  weight compression. More details about model export provided in [Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/openvino/export#export-your-model).

As LLaSA utilizes pure language modeling approach, model conversion process remains the same like conversion LLaMa models family for text generation purposes.

In [4]:
model_id = model_selector.value
base_model_path = Path(model_id.split("/")[-1])
print(f"Selected {model_id}")

Selected HKUSTAudio/Llasa-1B


In [5]:
to_compress = widgets.Checkbox(value=True, description="Compress wieghts")
to_compress

Checkbox(value=True, description='Compress wieghts')

In [7]:
from cmd_helper import optimum_cli

model_path = base_model_path / ("INT4" if to_compress.value else "FP16")
additional_args = (
    {"task": "text-generation-with-past", "weight-format": "int4"} if to_compress.value else {"task": "text-generation-with-past", "weight-format": "fp16"}
)

if not model_path.exists():
    optimum_cli(model_id, model_path, additional_args=additional_args)

**Export command:**

`optimum-cli export openvino --model HKUSTAudio/Llasa-1B Llasa-1B/INT4 --task text-generation-with-past --weight-format int4`

## Run model inference
[back to top ⬆️](#Table-of-contents:)


OpenVINO integration with Optimum Intel provides ready-to-use API for model inference that can be used for smooth integration with transformers-based solutions. For loading model, we will use `OVModelForCausalLM` class that have compatible interface with Transformers LLaMa implementation. For loading a model, `from_pretrained` method should be used. It accepts path to the model directory or model_id from HuggingFace hub (if model is not converted to OpenVINO format, conversion will be triggered automatically). Additionally, we can provide an inference device, quantization config (if model has not been quantized yet) and device-specific OpenVINO Runtime configuration. More details about model inference with Optimum Intel can be found in [documentation](https://huggingface.co/docs/optimum/intel/openvino/inference). We will use `OVModelForCausalLM` as replacement of original `AutoModelForCausalLM`. It remains compatible with original model codec and tokenizer

In [None]:
from transformers import AutoTokenizer
from xcodec2.modeling_xcodec2 import XCodec2Model

xcodec_path = "HKUST-Audio/xcodec2"

codec_model = XCodec2Model.from_pretrained(xcodec_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)

### Select inference device
[back to top ⬆️](#Table-of-contents:)

In [9]:
from notebook_utils import device_widget

device = device_widget(exclude=["NPU"])

device



Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [10]:
from optimum.intel.openvino import OVModelForCausalLM

ov_model = OVModelForCausalLM.from_pretrained(model_path, device=device.value)

In [None]:
import torch

input_text = "Hello, I'm working!"


def ids_to_speech_tokens(speech_ids):

    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str


def extract_speech_ids(speech_tokens_str):

    speech_ids = []
    for token_str in speech_tokens_str:
        if token_str.startswith("<|s_") and token_str.endswith("|>"):
            num_str = token_str[4:-2]

            num = int(num_str)
            speech_ids.append(num)
        else:
            print(f"Unexpected token: {token_str}")
    return speech_ids


formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

# Tokenize the text
chat = [{"role": "user", "content": "Convert the text to speech:" + formatted_text}, {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}]

input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors="pt", continue_final_message=True)
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

outputs = ov_model.generate(
    input_ids,
    max_length=2048,
    eos_token_id=speech_end_id,
    do_sample=True,
    top_p=1,  # Adjusts the diversity of generated content
    temperature=0.8,  # Controls randomness in output
)
# Extract the speech tokens
generated_ids = outputs[0][input_ids.shape[1] : -1]

speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

# Convert token <|s_23456|> to int 23456
speech_tokens = extract_speech_ids(speech_tokens)

speech_tokens = torch.tensor(speech_tokens).unsqueeze(0).unsqueeze(0)

# Decode the speech tokens to speech waveform
gen_wav = codec_model.decode_code(speech_tokens)

In [12]:
import IPython.display as ipd


def play(data, rate=None):
    ipd.display(ipd.Audio(data, rate=rate))


play(gen_wav[0, 0, :].cpu().numpy(), 16000)

## Interactive demo
[back to top ⬆️](#Table-of-contents:)

In [None]:
from gradio_helper import make_demo

demo = make_demo(ov_model, tokenizer, codec_model)

try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/