# VibeVoice Standalone Inference - Google Colab

This is a standalone notebook to run VibeVoice TTS with ipywidgets interface.

**⚠️ IMPORTANT: GPU Required**
Before running this notebook, make sure to enable GPU:
1. Go to `Runtime` → `Change runtime type`
2. Select `Hardware accelerator`: **GPU**
3. Select `GPU type`: **T4** (recommended)
4. Click `Save`

**Model**: FabioSarracino/VibeVoice-Large-Q8 (8-bit quantized)

**Features**:
- Single speaker and multi-speaker support
- Voice cloning with audio upload
- Interactive ipywidgets interface
- Support for pause tags: `[pause]` or `[pause:1500]`
- Automatic chunking for long texts

---

## Examples

### Single Speaker Example
Upload a voice for Speaker 1 only (or leave empty), then use:
```
Hello, this is a test. [pause:500] Let me continue speaking. [pause] And here's the final part.
```

### Multi Speaker Example
Upload different voices for Speaker 1 and Speaker 2, then use:
```
[1]: Hello, how are you today?
[2]: I'm doing great, thanks for asking!
[1]: That's wonderful to hear. [pause:1000] What are your plans for the weekend?
[2]: I'm planning to relax and maybe watch some movies.
```

### Notes
- Upload voice files for each speaker you want to use
- If only Speaker 1 has a voice uploaded, it will use Single Speaker mode
- If 2 or more speakers have voices uploaded, it will use Multi Speaker mode
- Use [1]:, [2]:, [3]:, [4]: format for multi-speaker text
- You can use [pause] or [pause:milliseconds] for pauses

## 1. Setup and Install Dependencies

In [None]:
# Install dependencies
!pip install -q accelerate>=1.6.0 transformers>=4.51.3 diffusers tqdm scipy ml-collections
!pip install -q torch>=2.0.0 torchaudio>=2.0.0 numpy>=1.20.0 librosa>=0.9.0 soundfile>=0.12.0
!pip install -q peft>=0.17.0 huggingface_hub>=0.25.1 bitsandbytes
!pip install -q ipywidgets

print("Dependencies installed successfully!")

## 2. Clone Repository and Setup Environment

In [None]:
import os
import sys

# Clone repository from HuggingFace Spaces if not exists
if not os.path.exists('/content/VibeVoice-Large-Q8-Colab'):
    !git clone https://github.com/nurimator/VibeVoice-Large-Q8-Colab.git /content/VibeVoice-Large-Q8-Colab
    print("Repository cloned from GitHub!")
else:
    print("Repository already exists!")

# Add vvembed to path
vvembed_path = '/content/VibeVoice-Large-Q8-Colab/vvembed'
if vvembed_path not in sys.path:
    sys.path.insert(0, vvembed_path)
    print(f"Added {vvembed_path} to Python path")

## 3. Import Libraries

In [None]:
import torch
import numpy as np
import re
import librosa
import soundfile as sf
from typing import List, Tuple, Optional
import ipywidgets as widgets
from IPython.display import display, Audio, HTML, clear_output
from google.colab import files

# Import VibeVoice components
from modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from processor.vibevoice_processor import VibeVoiceProcessor
from transformers import BitsAndBytesConfig

print("All libraries imported successfully!")

## 4. Define Helper Functions

In [None]:
def create_synthetic_voice_sample(speaker_idx: int, sample_rate: int = 24000) -> np.ndarray:
    """Create synthetic voice sample for a specific speaker"""
    duration = 1.0
    samples = int(sample_rate * duration)
    t = np.linspace(0, duration, samples, False)

    # Create realistic voice-like characteristics for each speaker
    base_frequencies = [120, 180, 140, 200]  # Mix of male/female-like frequencies
    base_freq = base_frequencies[speaker_idx % len(base_frequencies)]

    # Create vowel-like formants (like "ah" sound) - unique per speaker
    formant1 = 800 + speaker_idx * 100  # First formant
    formant2 = 1200 + speaker_idx * 150  # Second formant

    # Generate more voice-like waveform
    voice_sample = (
        # Fundamental with harmonics (voice-like)
        0.6 * np.sin(2 * np.pi * base_freq * t) +
        0.25 * np.sin(2 * np.pi * base_freq * 2 * t) +
        0.15 * np.sin(2 * np.pi * base_freq * 3 * t) +

        # Formant resonances (vowel-like characteristics)
        0.1 * np.sin(2 * np.pi * formant1 * t) * np.exp(-t * 2) +
        0.05 * np.sin(2 * np.pi * formant2 * t) * np.exp(-t * 3) +

        # Natural breath noise (reduced)
        0.02 * np.random.normal(0, 1, len(t))
    )

    # Add natural envelope (like human speech pattern)
    vibrato_freq = 4 + speaker_idx * 0.3  # Slightly different vibrato per speaker
    envelope = (np.exp(-t * 0.3) * (1 + 0.1 * np.sin(2 * np.pi * vibrato_freq * t)))
    voice_sample *= envelope * 0.08  # Lower volume

    return voice_sample.astype(np.float32)


def load_audio_file(file_path: str, target_sr: int = 24000) -> np.ndarray:
    """Load audio file and resample to target sample rate"""
    audio, sr = librosa.load(file_path, sr=None)

    # Convert to mono if stereo
    if audio.ndim > 1:
        audio = librosa.to_mono(audio)

    # Resample if needed
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)

    # Normalize
    audio_max = np.abs(audio).max()
    if audio_max > 0:
        audio = audio / max(audio_max, 1.0)

    return audio.astype(np.float32)


def parse_pause_keywords(text: str) -> List[Tuple[str, any]]:
    """Parse [pause] and [pause:ms] keywords from text"""
    segments = []
    pattern = r'\[pause(?::(\d+))?\]'

    last_end = 0
    for match in re.finditer(pattern, text):
        # Add text segment before pause (if any)
        if match.start() > last_end:
            text_segment = text[last_end:match.start()].strip()
            if text_segment:
                segments.append(('text', text_segment))

        # Add pause segment with duration (default 1000ms = 1 second)
        duration_ms = int(match.group(1)) if match.group(1) else 1000
        segments.append(('pause', duration_ms))
        last_end = match.end()

    # Add remaining text after last pause (if any)
    if last_end < len(text):
        remaining_text = text[last_end:].strip()
        if remaining_text:
            segments.append(('text', remaining_text))

    # If no pauses found, return original text as single segment
    if not segments:
        segments.append(('text', text))

    return segments


def generate_silence(duration_ms: int, sample_rate: int = 24000) -> np.ndarray:
    """Generate silence for specified duration"""
    num_samples = int(sample_rate * duration_ms / 1000.0)
    return np.zeros(num_samples, dtype=np.float32)


def split_text_into_chunks(text: str, max_words: int = 250) -> List[str]:
    """Split long text into manageable chunks at sentence boundaries"""
    sentence_pattern = r'(?<=[.!?])\s+(?=[A-Z])'
    sentences = re.split(sentence_pattern, text)

    if len(sentences) == 1 and len(text.split()) > max_words:
        sentences = text.replace('. ', '.|').split('|')
        sentences = [s.strip() for s in sentences if s.strip()]

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue

        sentence_words = sentence.split()
        sentence_word_count = len(sentence_words)

        if sentence_word_count > max_words:
            sub_parts = re.split(r'[,;]', sentence)
            for part in sub_parts:
                part = part.strip()
                if not part:
                    continue
                part_words = part.split()
                part_word_count = len(part_words)

                if current_word_count + part_word_count > max_words and current_chunk:
                    chunks.append(' '.join(current_chunk))
                    current_chunk = [part]
                    current_word_count = part_word_count
                else:
                    current_chunk.append(part)
                    current_word_count += part_word_count
        else:
            if current_word_count + sentence_word_count > max_words and current_chunk:
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentence]
                current_word_count = sentence_word_count
            else:
                current_chunk.append(sentence)
                current_word_count += sentence_word_count

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    if not chunks:
        chunks = [text]

    print(f"Split text into {len(chunks)} chunks (max {max_words} words each)")
    return chunks

print("Helper functions defined!")

## 5. Load Model

In [None]:
# Configuration
MODEL_NAME = 'Quant-8Bit'
MODEL_PATH = 'FabioSarracino/VibeVoice-Large-Q8'
MODEL_URL = 'https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8'

print(f"Loading model: {MODEL_NAME}")
print(f"Model path: {MODEL_PATH}")
print(f"Model URL: {MODEL_URL}")
print("\nThis may take a few minutes on first run...\n")

# Check CUDA availability
if not torch.cuda.is_available():
    raise Exception("Quantized models require a CUDA GPU. Please enable GPU in Colab Runtime settings.")

print(f"CUDA available: {torch.cuda.get_device_name(0)}")

# Configure 8-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
)

# Load model with 8-bit quantization
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

print("Model loaded successfully!")

# Load processor
processor = VibeVoiceProcessor.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True
)

print("Processor loaded successfully!")
print("\nSetup complete! Ready to generate speech.")

## 6. Generate Speech Functions

In [None]:
def generate_audio_segment(
    formatted_text: str,
    voice_samples: List[np.ndarray],
    cfg_scale: float,
    seed: int,
    use_sampling: bool,
    temperature: float,
    top_p: float
) -> np.ndarray:
    """Generate audio for a single text segment"""

    # Prepare inputs
    inputs = processor(
        [formatted_text],
        voice_samples=[voice_samples],
        return_tensors="pt",
        return_attention_mask=True
    )

    # Move to device
    device = next(model.parameters()).device
    inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

    # Generate
    with torch.no_grad():
        if use_sampling:
            output = model.generate(
                **inputs,
                tokenizer=processor.tokenizer,
                cfg_scale=cfg_scale,
                max_new_tokens=None,
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
            )
        else:
            output = model.generate(
                **inputs,
                tokenizer=processor.tokenizer,
                cfg_scale=cfg_scale,
                max_new_tokens=None,
                do_sample=False,
            )

    # Extract audio
    if hasattr(output, 'speech_outputs') and output.speech_outputs:
        speech_tensors = output.speech_outputs

        if isinstance(speech_tensors, list) and len(speech_tensors) > 0:
            audio_tensor = torch.cat(speech_tensors, dim=-1)
        else:
            audio_tensor = speech_tensors

        # Convert to numpy
        audio_np = audio_tensor.cpu().float().numpy()

        # Flatten if needed
        if audio_np.ndim > 1:
            audio_np = audio_np.flatten()

        return audio_np
    else:
        raise Exception("VibeVoice failed to generate audio")

print("Generation functions defined!")

## 7. Create Interactive Interface with ipywidgets

In [None]:
# Global variables to store uploaded audio files
uploaded_audio_files = {
    1: None,
    2: None,
    3: None,
    4: None
}

# Create widgets
output_area = widgets.Output()

# Text input
text_input = widgets.Textarea(
    value='Hello, this is a test of the VibeVoice text-to-speech system.',
    placeholder='Enter text here...',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(width='95%', height='120px')
)

# Upload buttons for voice samples
upload_btn_1 = widgets.Button(description='Upload Speaker 1 Voice', button_style='info')
upload_btn_2 = widgets.Button(description='Upload Speaker 2 Voice', button_style='info')
upload_btn_3 = widgets.Button(description='Upload Speaker 3 Voice', button_style='info')
upload_btn_4 = widgets.Button(description='Upload Speaker 4 Voice', button_style='info')

# Status labels for uploads
status_label_1 = widgets.Label(value='No file uploaded')
status_label_2 = widgets.Label(value='No file uploaded')
status_label_3 = widgets.Label(value='No file uploaded')
status_label_4 = widgets.Label(value='No file uploaded')

# Generation parameters
diffusion_steps = widgets.IntSlider(value=20, min=5, max=100, step=1, description='Diffusion Steps:')
cfg_scale = widgets.FloatSlider(value=1.3, min=0.5, max=3.5, step=0.05, description='CFG Scale:')
seed_input = widgets.IntText(value=42, description='Seed:')

# Sampling settings
use_sampling = widgets.Checkbox(value=False, description='Use Sampling')
temperature = widgets.FloatSlider(value=0.95, min=0.1, max=2.0, step=0.05, description='Temperature:')
top_p = widgets.FloatSlider(value=0.95, min=0.1, max=1.0, step=0.05, description='Top P:')
max_words = widgets.IntSlider(value=250, min=100, max=500, step=50, description='Max Words/Chunk:')

# Generate button
generate_btn = widgets.Button(description='Generate Speech', button_style='success', icon='play')

# Progress bar
progress_bar = widgets.IntProgress(
    value=0,
    min=0,
    max=100,
    description='Progress:',
    bar_style='info',
    orientation='horizontal',
    layout=widgets.Layout(width='95%')
)

# Status text
status_text = widgets.Label(value='Ready')

# Upload handlers
def upload_audio(speaker_num):
    def handler(b):
        uploaded = files.upload()
        if uploaded:
            filename = list(uploaded.keys())[0]
            # Save file temporarily
            with open(f'/tmp/speaker_{speaker_num}.wav', 'wb') as f:
                f.write(uploaded[filename])
            uploaded_audio_files[speaker_num] = f'/tmp/speaker_{speaker_num}.wav'
            
            # Update status label
            status_labels = {1: status_label_1, 2: status_label_2, 3: status_label_3, 4: status_label_4}
            status_labels[speaker_num].value = f'✓ {filename}'
    return handler

upload_btn_1.on_click(upload_audio(1))
upload_btn_2.on_click(upload_audio(2))
upload_btn_3.on_click(upload_audio(3))
upload_btn_4.on_click(upload_audio(4))

# Generate speech handler
def on_generate_clicked(b):
    with output_area:
        clear_output(wait=True)
        
        try:
            progress_bar.value = 0
            status_text.value = 'Preparing...'
            
            text = text_input.value
            if not text.strip():
                print("❌ Error: Please enter some text")
                return
            
            # Set seeds
            seed_val = seed_input.value
            torch.manual_seed(seed_val)
            torch.cuda.manual_seed(seed_val)
            torch.cuda.manual_seed_all(seed_val)
            np.random.seed(seed_val)
            
            # Set diffusion steps
            model.set_ddpm_inference_steps(diffusion_steps.value)
            
            # Load uploaded voices
            uploaded_voices = {}
            for speaker_num, file_path in uploaded_audio_files.items():
                if file_path is not None and os.path.exists(file_path):
                    uploaded_voices[speaker_num] = load_audio_file(file_path)
            
            # Determine mode
            active_speakers = list(uploaded_voices.keys())
            if len(active_speakers) <= 1:
                mode = "Single Speaker"
            else:
                mode = "Multi Speaker"
            
            progress_bar.value = 10
            status_text.value = f'Mode: {mode}'
            print(f"🎤 Mode: {mode}")
            
            # Parse text for pauses
            segments = parse_pause_keywords(text)
            all_audio_segments = []
            sample_rate = 24000
            
            if mode == 'Single Speaker':
                voice_samples = None
                total_segments = len([s for s in segments if s[0] == 'text'])
                current_segment = 0
                
                for seg_type, seg_content in segments:
                    if seg_type == 'pause':
                        silence = generate_silence(seg_content, sample_rate)
                        all_audio_segments.append(silence)
                    else:
                        word_count = len(seg_content.split())
                        
                        if word_count > max_words.value:
                            chunks = split_text_into_chunks(seg_content, max_words.value)
                            
                            for chunk_idx, chunk in enumerate(chunks):
                                progress_val = 10 + int(80 * (current_segment + chunk_idx/len(chunks)) / total_segments)
                                progress_bar.value = progress_val
                                status_text.value = f'Generating chunk {chunk_idx+1}/{len(chunks)}'
                                
                                formatted_text = f"Speaker 1: {chunk}"
                                
                                if voice_samples is None:
                                    if 1 in uploaded_voices:
                                        voice_samples = [uploaded_voices[1]]
                                    else:
                                        voice_samples = [create_synthetic_voice_sample(0)]
                                
                                chunk_audio = generate_audio_segment(
                                    formatted_text, voice_samples, cfg_scale.value,
                                    seed_val, use_sampling.value, temperature.value, top_p.value
                                )
                                all_audio_segments.append(chunk_audio)
                        else:
                            progress_val = 10 + int(80 * current_segment / total_segments)
                            progress_bar.value = progress_val
                            status_text.value = f'Generating speech ({word_count} words)'
                            
                            formatted_text = f"Speaker 1: {seg_content}"
                            
                            if voice_samples is None:
                                if 1 in uploaded_voices:
                                    voice_samples = [uploaded_voices[1]]
                                else:
                                    voice_samples = [create_synthetic_voice_sample(0)]
                            
                            segment_audio = generate_audio_segment(
                                formatted_text, voice_samples, cfg_scale.value,
                                seed_val, use_sampling.value, temperature.value, top_p.value
                            )
                            all_audio_segments.append(segment_audio)
                        
                        current_segment += 1
            else:
                # Multi speaker mode
                progress_bar.value = 20
                status_text.value = 'Processing multi-speaker text'
                
                bracket_pattern = r'\[(\d+)\]\s*:'
                speaker_numbers = sorted(list(set([int(m) for m in re.findall(bracket_pattern, text)])))
                
                if not speaker_numbers:
                    speaker_numbers = [1]
                
                # Prepare voice samples
                voice_samples = []
                for speaker_num in speaker_numbers:
                    if speaker_num in uploaded_voices:
                        voice_samples.append(uploaded_voices[speaker_num])
                    else:
                        voice_samples.append(create_synthetic_voice_sample(speaker_num - 1))
                
                # Convert [N]: format to Speaker (N-1): format
                converted_text = text
                for speaker_num in sorted(speaker_numbers, reverse=True):
                    pattern = f'\\[{speaker_num}\\]\\s*:'
                    replacement = f'Speaker {speaker_num - 1}:'
                    converted_text = re.sub(pattern, replacement, converted_text)
                
                converted_text = converted_text.replace('\n', ' ').replace('\r', ' ')
                converted_text = ' '.join(converted_text.split())
                
                progress_bar.value = 50
                status_text.value = f'Generating multi-speaker speech ({len(speaker_numbers)} speakers)'
                
                audio = generate_audio_segment(
                    converted_text, voice_samples, cfg_scale.value,
                    seed_val, use_sampling.value, temperature.value, top_p.value
                )
                all_audio_segments.append(audio)
            
            # Concatenate all audio segments
            progress_bar.value = 90
            status_text.value = 'Finalizing audio'
            
            if all_audio_segments:
                final_audio = np.concatenate(all_audio_segments)
                
                # Save to file
                output_path = "/tmp/vibevoice_output.wav"
                sf.write(output_path, final_audio, sample_rate)
                
                progress_bar.value = 100
                status_text.value = 'Complete!'
                
                duration = len(final_audio) / sample_rate
                print(f"\n✅ Generation Complete!")
                print(f"Mode: {mode}")
                print(f"Duration: {duration:.2f}s")
                print(f"Sample Rate: {sample_rate} Hz")
                print(f"\n🔊 Playing audio...")
                
                # Display audio player
                display(Audio(final_audio, rate=sample_rate))
            else:
                print("❌ Error: No audio segments generated")
                
        except Exception as e:
            import traceback
            print(f"❌ Error: {str(e)}")
            print(traceback.format_exc())
            status_text.value = 'Error occurred'

generate_btn.on_click(on_generate_clicked)

# Layout
print("Creating interface...")

## 8. Display Interface

In [None]:
# Display the interface
display(HTML('<h2>🎙️ VibeVoice Text-to-Speech Interface</h2>'))
display(HTML('<p>Upload voice samples for up to 4 speakers and generate speech with voice cloning.</p>'))

display(HTML('<h3>Voice Samples</h3>'))
display(widgets.HBox([upload_btn_1, status_label_1]))
display(widgets.HBox([upload_btn_2, status_label_2]))
display(widgets.HBox([upload_btn_3, status_label_3]))
display(widgets.HBox([upload_btn_4, status_label_4]))

display(HTML('<h3>Text Input</h3>'))
display(text_input)

display(HTML('<h3>Generation Parameters</h3>'))
display(widgets.VBox([
    diffusion_steps,
    cfg_scale,
    seed_input,
    widgets.HTML('<b>Sampling Settings</b>'),
    use_sampling,
    temperature,
    top_p,
    max_words
]))

display(HTML('<h3>Generate</h3>'))
display(generate_btn)
display(progress_bar)
display(status_text)

display(HTML('<h3>Output</h3>'))
display(output_area)

display(HTML('''
<h3>Examples</h3>
<p><b>Single Speaker with pause:</b></p>
<pre>Hello, this is a test. [pause:500] Let me continue speaking. [pause] And here's the final part.</pre>

<p><b>Multi Speaker:</b></p>
<pre>[1]: Hello, how are you today?
[2]: I'm doing great, thanks for asking!
[1]: That's wonderful to hear.</pre>
'''))