# VibeVoice Gradio Inference - Google Colab

This is a standalone notebook to run VibeVoice TTS with Gradio interface.

**⚠️ IMPORTANT: GPU Required**
Before running this notebook, make sure to enable GPU:
1. Go to `Runtime` → `Change runtime type`
2. Select `Hardware accelerator`: **GPU**
3. Select `GPU type`: **T4** (recommended)
4. Click `Save`

**Model**: FabioSarracino/VibeVoice-Large-Q8 (8-bit quantized)

**Features**:
- Single speaker and multi-speaker support
- Voice cloning with audio upload
- Interactive Gradio interface
- Support for pause tags: `[pause]` or `[pause:1500]`
- Automatic chunking for long texts

---

## Examples

### Single Speaker Example
Upload a voice for Speaker 1 only (or leave empty), then use:
```
Hello, this is a test. [pause:500] Let me continue speaking. [pause] And here's the final part.
```

### Multi Speaker Example
Upload different voices for Speaker 1 and Speaker 2, then use:
```
[1]: Hello, how are you today?
[2]: I'm doing great, thanks for asking!
[1]: That's wonderful to hear. [pause:1000] What are your plans for the weekend?
[2]: I'm planning to relax and maybe watch some movies.
```

### Notes
- Upload voice files for each speaker you want to use
- If only Speaker 1 has a voice uploaded, it will use Single Speaker mode
- If 2 or more speakers have voices uploaded, it will use Multi Speaker mode
- Use [1]:, [2]:, [3]:, [4]: format for multi-speaker text
- You can use [pause] or [pause:milliseconds] for pauses

## 1. Setup and Install Dependencies

In [1]:
# Install dependencies
!pip install -q accelerate>=1.6.0 transformers>=4.51.3 diffusers tqdm scipy ml-collections
!pip install -q torch>=2.0.0 torchaudio>=2.0.0 numpy>=1.20.0 librosa>=0.9.0 soundfile>=0.12.0
!pip install -q peft>=0.17.0 huggingface_hub>=0.25.1 bitsandbytes
!pip install -q gradio

print("Dependencies installed successfully!")

Dependencies installed successfully!


## 2. Clone Repository and Setup Environment

In [2]:
import os
import sys

# Clone repository from HuggingFace Spaces if not exists
if not os.path.exists('/content/VibeVoice-Large-Q8-Colab'):
    !git clone https://github.com/nurimator/VibeVoice-Large-Q8-Colab.git /content/VibeVoice-Large-Q8-Colab
    print("Repository cloned from HuggingFace Spaces!")
else:
    print("Repository already exists!")

# Add vvembed to path
vvembed_path = '/content/VibeVoice-Large-Q8-Colab/vvembed'
if vvembed_path not in sys.path:
    sys.path.insert(0, vvembed_path)
    print(f"Added {vvembed_path} to Python path")

Cloning into '/content/VibeVoice-Large-Q8-Colab'...
remote: Enumerating objects: 518, done.[K
remote: Counting objects: 100% (175/175), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 518 (delta 149), reused 121 (delta 119), pack-reused 343 (from 2)[K
Receiving objects: 100% (518/518), 312.60 KiB | 1.75 MiB/s, done.
Resolving deltas: 100% (330/330), done.
Repository cloned from HuggingFace Spaces!
Added /content/VibeVoice-Large-Q8-Colab/vvembed to Python path


## 3. Import Libraries

In [3]:
import torch
import numpy as np
import re
import librosa
import soundfile as sf
from typing import List, Tuple, Optional
import gradio as gr

# Import VibeVoice components
from modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from processor.vibevoice_processor import VibeVoiceProcessor
from transformers import BitsAndBytesConfig

print("All libraries imported successfully!")



All libraries imported successfully!


## 4. Define Helper Functions

In [4]:
# @title
def create_synthetic_voice_sample(speaker_idx: int, sample_rate: int = 24000) -> np.ndarray:
    """Create synthetic voice sample for a specific speaker"""
    duration = 1.0
    samples = int(sample_rate * duration)
    t = np.linspace(0, duration, samples, False)

    # Create realistic voice-like characteristics for each speaker
    base_frequencies = [120, 180, 140, 200]  # Mix of male/female-like frequencies
    base_freq = base_frequencies[speaker_idx % len(base_frequencies)]

    # Create vowel-like formants (like "ah" sound) - unique per speaker
    formant1 = 800 + speaker_idx * 100  # First formant
    formant2 = 1200 + speaker_idx * 150  # Second formant

    # Generate more voice-like waveform
    voice_sample = (
        # Fundamental with harmonics (voice-like)
        0.6 * np.sin(2 * np.pi * base_freq * t) +
        0.25 * np.sin(2 * np.pi * base_freq * 2 * t) +
        0.15 * np.sin(2 * np.pi * base_freq * 3 * t) +

        # Formant resonances (vowel-like characteristics)
        0.1 * np.sin(2 * np.pi * formant1 * t) * np.exp(-t * 2) +
        0.05 * np.sin(2 * np.pi * formant2 * t) * np.exp(-t * 3) +

        # Natural breath noise (reduced)
        0.02 * np.random.normal(0, 1, len(t))
    )

    # Add natural envelope (like human speech pattern)
    vibrato_freq = 4 + speaker_idx * 0.3  # Slightly different vibrato per speaker
    envelope = (np.exp(-t * 0.3) * (1 + 0.1 * np.sin(2 * np.pi * vibrato_freq * t)))
    voice_sample *= envelope * 0.08  # Lower volume

    return voice_sample.astype(np.float32)


def load_audio_file(file_path: str, target_sr: int = 24000) -> np.ndarray:
    """Load audio file and resample to target sample rate"""
    audio, sr = librosa.load(file_path, sr=None)

    # Convert to mono if stereo
    if audio.ndim > 1:
        audio = librosa.to_mono(audio)

    # Resample if needed
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)

    # Normalize
    audio_max = np.abs(audio).max()
    if audio_max > 0:
        audio = audio / max(audio_max, 1.0)

    return audio.astype(np.float32)


def parse_pause_keywords(text: str) -> List[Tuple[str, any]]:
    """Parse [pause] and [pause:ms] keywords from text"""
    segments = []
    pattern = r'\[pause(?::(\d+))?\]'

    last_end = 0
    for match in re.finditer(pattern, text):
        # Add text segment before pause (if any)
        if match.start() > last_end:
            text_segment = text[last_end:match.start()].strip()
            if text_segment:
                segments.append(('text', text_segment))

        # Add pause segment with duration (default 1000ms = 1 second)
        duration_ms = int(match.group(1)) if match.group(1) else 1000
        segments.append(('pause', duration_ms))
        last_end = match.end()

    # Add remaining text after last pause (if any)
    if last_end < len(text):
        remaining_text = text[last_end:].strip()
        if remaining_text:
            segments.append(('text', remaining_text))

    # If no pauses found, return original text as single segment
    if not segments:
        segments.append(('text', text))

    return segments


def generate_silence(duration_ms: int, sample_rate: int = 24000) -> np.ndarray:
    """Generate silence for specified duration"""
    num_samples = int(sample_rate * duration_ms / 1000.0)
    return np.zeros(num_samples, dtype=np.float32)


def split_text_into_chunks(text: str, max_words: int = 250) -> List[str]:
    """Split long text into manageable chunks at sentence boundaries"""
    sentence_pattern = r'(?<=[.!?])\s+(?=[A-Z])'
    sentences = re.split(sentence_pattern, text)

    if len(sentences) == 1 and len(text.split()) > max_words:
        sentences = text.replace('. ', '.|').split('|')
        sentences = [s.strip() for s in sentences if s.strip()]

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue

        sentence_words = sentence.split()
        sentence_word_count = len(sentence_words)

        if sentence_word_count > max_words:
            sub_parts = re.split(r'[,;]', sentence)
            for part in sub_parts:
                part = part.strip()
                if not part:
                    continue
                part_words = part.split()
                part_word_count = len(part_words)

                if current_word_count + part_word_count > max_words and current_chunk:
                    chunks.append(' '.join(current_chunk))
                    current_chunk = [part]
                    current_word_count = part_word_count
                else:
                    current_chunk.append(part)
                    current_word_count += part_word_count
        else:
            if current_word_count + sentence_word_count > max_words and current_chunk:
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentence]
                current_word_count = sentence_word_count
            else:
                current_chunk.append(sentence)
                current_word_count += sentence_word_count

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    if not chunks:
        chunks = [text]

    print(f"Split text into {len(chunks)} chunks (max {max_words} words each)")
    return chunks

print("Helper functions defined!")

Helper functions defined!


## 5. Load Model

In [5]:
# @title
# Configuration
MODEL_NAME = 'Quant-8Bit'
MODEL_PATH = 'FabioSarracino/VibeVoice-Large-Q8'
MODEL_URL = 'https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8'

print(f"Loading model: {MODEL_NAME}")
print(f"Model path: {MODEL_PATH}")
print(f"Model URL: {MODEL_URL}")
print("\nThis may take a few minutes on first run...\n")

# Check CUDA availability
if not torch.cuda.is_available():
    raise Exception("Quantized models require a CUDA GPU. Please enable GPU in Colab Runtime settings.")

print(f"CUDA available: {torch.cuda.get_device_name(0)}")

# Configure 8-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
)

# Load model with 8-bit quantization
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

print("Model loaded successfully!")

# Load processor
processor = VibeVoiceProcessor.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)

print("Processor loaded successfully!")
print("\nSetup complete! Ready to generate speech.")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading model: Quant-8Bit
Model path: FabioSarracino/VibeVoice-Large-Q8
Model URL: https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8

This may take a few minutes on first run...

CUDA available: Tesla T4


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]



model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Model loaded successfully!


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'. 
The class this function is called from is 'VibeVoiceTextTokenizerFast'.


Processor loaded successfully!

Setup complete! Ready to generate speech.


## 6. Generate Speech Functions

In [6]:
# @title
def generate_audio_segment(
    formatted_text: str,
    voice_samples: List[np.ndarray],
    cfg_scale: float,
    seed: int,
    use_sampling: bool,
    temperature: float,
    top_p: float
) -> np.ndarray:
    """Generate audio for a single text segment"""

    # Prepare inputs
    inputs = processor(
        [formatted_text],
        voice_samples=[voice_samples],
        return_tensors="pt",
        return_attention_mask=True
    )

    # Move to device
    device = next(model.parameters()).device
    inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

    # Generate
    with torch.no_grad():
        if use_sampling:
            output = model.generate(
                **inputs,
                tokenizer=processor.tokenizer,
                cfg_scale=cfg_scale,
                max_new_tokens=None,
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
            )
        else:
            output = model.generate(
                **inputs,
                tokenizer=processor.tokenizer,
                cfg_scale=cfg_scale,
                max_new_tokens=None,
                do_sample=False,
            )

    # Extract audio
    if hasattr(output, 'speech_outputs') and output.speech_outputs:
        speech_tensors = output.speech_outputs

        if isinstance(speech_tensors, list) and len(speech_tensors) > 0:
            audio_tensor = torch.cat(speech_tensors, dim=-1)
        else:
            audio_tensor = speech_tensors

        # Convert to numpy
        audio_np = audio_tensor.cpu().float().numpy()

        # Flatten if needed
        if audio_np.ndim > 1:
            audio_np = audio_np.flatten()

        return audio_np
    else:
        raise Exception("VibeVoice failed to generate audio")


def generate_speech(
    text: str,
    speaker_1_audio,
    speaker_2_audio,
    speaker_3_audio,
    speaker_4_audio,
    diffusion_steps: int,
    cfg_scale: float,
    seed: int,
    use_sampling: bool,
    temperature: float,
    top_p: float,
    max_words_per_chunk: int,
    progress=gr.Progress()
):
    """Main function to generate speech from text"""

    try:
        progress(0, desc="Preparing...")

        # Set seeds for reproducibility
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        np.random.seed(seed)

        # Set diffusion steps
        model.set_ddpm_inference_steps(diffusion_steps)

        # Load uploaded voice samples
        uploaded_voices = {}
        if speaker_1_audio is not None:
            uploaded_voices[1] = load_audio_file(speaker_1_audio)
        if speaker_2_audio is not None:
            uploaded_voices[2] = load_audio_file(speaker_2_audio)
        if speaker_3_audio is not None:
            uploaded_voices[3] = load_audio_file(speaker_3_audio)
        if speaker_4_audio is not None:
            uploaded_voices[4] = load_audio_file(speaker_4_audio)

        # Determine mode
        active_speakers = list(uploaded_voices.keys())
        if len(active_speakers) <= 1:
            mode = "Single Speaker"
        else:
            mode = "Multi Speaker"

        progress(0.1, desc=f"Mode: {mode}")

        # Parse text for pause keywords
        segments = parse_pause_keywords(text)
        all_audio_segments = []
        sample_rate = 24000

        if mode == 'Single Speaker':
            voice_samples = None
            total_segments = len([s for s in segments if s[0] == 'text'])
            current_segment = 0

            for seg_type, seg_content in segments:
                if seg_type == 'pause':
                    silence = generate_silence(seg_content, sample_rate)
                    all_audio_segments.append(silence)
                else:
                    word_count = len(seg_content.split())

                    if word_count > max_words_per_chunk:
                        chunks = split_text_into_chunks(seg_content, max_words_per_chunk)

                        for chunk_idx, chunk in enumerate(chunks):
                            progress(0.1 + (0.8 * (current_segment + chunk_idx/len(chunks)) / total_segments),
                                   desc=f"Generating chunk {chunk_idx+1}/{len(chunks)}")

                            formatted_text = f"Speaker 1: {chunk}"

                            if voice_samples is None:
                                if 1 in uploaded_voices:
                                    voice_samples = [uploaded_voices[1]]
                                else:
                                    voice_samples = [create_synthetic_voice_sample(0)]

                            chunk_audio = generate_audio_segment(
                                formatted_text, voice_samples, cfg_scale,
                                seed, use_sampling, temperature, top_p
                            )
                            all_audio_segments.append(chunk_audio)
                    else:
                        progress(0.1 + (0.8 * current_segment / total_segments),
                               desc=f"Generating speech ({word_count} words)")

                        formatted_text = f"Speaker 1: {seg_content}"

                        if voice_samples is None:
                            if 1 in uploaded_voices:
                                voice_samples = [uploaded_voices[1]]
                            else:
                                voice_samples = [create_synthetic_voice_sample(0)]

                        segment_audio = generate_audio_segment(
                            formatted_text, voice_samples, cfg_scale,
                            seed, use_sampling, temperature, top_p
                        )
                        all_audio_segments.append(segment_audio)

                    current_segment += 1
        else:
            # Multi speaker mode
            progress(0.2, desc="Processing multi-speaker text")

            bracket_pattern = r'\[(\d+)\]\s*:'
            speaker_numbers = sorted(list(set([int(m) for m in re.findall(bracket_pattern, text)])))

            if not speaker_numbers:
                speaker_numbers = [1]

            # Prepare voice samples
            voice_samples = []
            for speaker_num in speaker_numbers:
                if speaker_num in uploaded_voices:
                    voice_samples.append(uploaded_voices[speaker_num])
                else:
                    voice_samples.append(create_synthetic_voice_sample(speaker_num - 1))

            # Convert [N]: format to Speaker (N-1): format
            converted_text = text
            for speaker_num in sorted(speaker_numbers, reverse=True):
                pattern = f'\\[{speaker_num}\\]\\s*:'
                replacement = f'Speaker {speaker_num - 1}:'
                converted_text = re.sub(pattern, replacement, converted_text)

            converted_text = converted_text.replace('\n', ' ').replace('\r', ' ')
            converted_text = ' '.join(converted_text.split())

            progress(0.5, desc=f"Generating multi-speaker speech ({len(speaker_numbers)} speakers)")

            audio = generate_audio_segment(
                converted_text, voice_samples, cfg_scale,
                seed, use_sampling, temperature, top_p
            )
            all_audio_segments.append(audio)

        # Concatenate all audio segments
        progress(0.9, desc="Finalizing audio")

        if all_audio_segments:
            final_audio = np.concatenate(all_audio_segments)

            # Save to file
            output_path = "/tmp/vibevoice_output.wav"
            sf.write(output_path, final_audio, sample_rate)

            progress(1.0, desc="Complete!")

            duration = len(final_audio) / sample_rate
            info = f"Mode: {mode}\nDuration: {duration:.2f}s\nSample Rate: {sample_rate} Hz"

            return output_path, info
        else:
            return None, "Error: No audio segments generated"

    except Exception as e:
        import traceback
        error_msg = f"Error: {str(e)}\n\n{traceback.format_exc()}"
        return None, error_msg

print("Generation functions defined!")

Generation functions defined!


## 7. Launch Gradio Interface

In [None]:
# @title
# Create Gradio interface
with gr.Blocks(title="VibeVoice TTS") as demo:
    gr.Markdown("""
    # VibeVoice Text-to-Speech

    Upload voice samples for up to 4 speakers and generate speech with voice cloning.

    **For Single Speaker**: Upload only Speaker 1 voice and use plain text.

    **For Multi Speaker**: Upload voices for multiple speakers and use format: `[1]: text` `[2]: text`
    """)

    # Voice Samples - 4 columns in one row
    gr.Markdown("### Voice Samples")
    with gr.Row():
        speaker_1_audio = gr.Audio(label="Speaker 1", type="filepath", scale=1)
        speaker_2_audio = gr.Audio(label="Speaker 2", type="filepath", scale=1)
        speaker_3_audio = gr.Audio(label="Speaker 3", type="filepath", scale=1)
        speaker_4_audio = gr.Audio(label="Speaker 4", type="filepath", scale=1)

    # Generation Settings - 2 columns
    with gr.Row():
        # Left column - All sliders
        with gr.Column(scale=1):
            gr.Markdown("### Generation Parameters")
            diffusion_steps = gr.Slider(5, 100, value=20, step=1, label="Diffusion Steps")
            cfg_scale = gr.Slider(0.5, 3.5, value=1.3, step=0.05, label="CFG Scale")
            seed = gr.Number(value=42, label="Seed", precision=0)

            gr.Markdown("### Sampling Settings")
            use_sampling = gr.Checkbox(label="Use Sampling", value=False)
            temperature = gr.Slider(0.1, 2.0, value=0.95, step=0.05, label="Temperature")
            top_p = gr.Slider(0.1, 1.0, value=0.95, step=0.05, label="Top P")

            max_words_per_chunk = gr.Slider(100, 500, value=250, step=50, label="Max Words per Chunk")

        # Right column - Text input, button, and output
        with gr.Column(scale=1):
            gr.Markdown("### Text Input")
            text_input = gr.Textbox(
                label="Text to synthesize",
                placeholder="Enter text here...\n\nFor multi-speaker use:\n[1]: Hello\n[2]: Hi there!",
                lines=8,
                value="Hello, this is a test of the VibeVoice text-to-speech system."
            )

            generate_btn = gr.Button("Generate Speech", variant="primary", size="lg")

            gr.Markdown("### Output")
            audio_output = gr.Audio(label="Generated Speech", type="filepath")
            info_output = gr.Textbox(label="Generation Info", lines=3)

    # Examples
    gr.Markdown("""
    ### Examples

    **Single Speaker with pause:**
    ```
    Hello, this is a test. [pause:500] Let me continue speaking. [pause] And here's the final part.
    ```

    **Multi Speaker:**
    ```
    [1]: Hello, how are you today?
    [2]: I'm doing great, thanks for asking!
    [1]: That's wonderful to hear.
    ```
    """)

    # Connect the generate button
    generate_btn.click(
        fn=generate_speech,
        inputs=[
            text_input,
            speaker_1_audio,
            speaker_2_audio,
            speaker_3_audio,
            speaker_4_audio,
            diffusion_steps,
            cfg_scale,
            seed,
            use_sampling,
            temperature,
            top_p,
            max_words_per_chunk
        ],
        outputs=[audio_output, info_output]
    )

# Launch the interface
demo.launch(share=True, debug=True)