# üéß Chatterbox-TTS-Extended on Google Colab

## Advanced Text-to-Speech with Artifact Reduction

This notebook allows you to run Chatterbox-TTS-Extended on Google Colab, leveraging free GPU resources for high-quality speech synthesis with built-in artifact reduction.

### Features:
- üé§ High-quality voice cloning and TTS
- üîß Advanced artifact reduction with RNNoise
- üéØ Whisper-based quality validation
- üé® Voice conversion capabilities
- üì¶ Multiple export formats (WAV, MP3, FLAC)

### Requirements:
- Google Colab account (free tier works!)
- GPU runtime (recommended: T4 or better)

---

## üìö Quick Links & Resources

**Documentation:**
- üìñ [Full Colab Guide](https://github.com/m-marie1/Chatterbox-TTS-Extended/blob/main/COLAB_GUIDE.md) - Comprehensive guide
- ‚ö° [Quick Reference](https://github.com/m-marie1/Chatterbox-TTS-Extended/blob/main/COLAB_QUICKREF.md) - Cheat sheet
- üìã [README](https://github.com/m-marie1/Chatterbox-TTS-Extended/blob/main/README.md) - Feature documentation

**Support:**
- üêõ [Report Issues](https://github.com/m-marie1/Chatterbox-TTS-Extended/issues)
- üí¨ [Discussions](https://github.com/m-marie1/Chatterbox-TTS-Extended/discussions)

**Tips:**
- ‚ö° For fastest start: Run all cells and accept default settings
- üéØ For best quality: Enable RNNoise + use 3 candidates + Whisper validation
- üíæ Remember to download files before session ends!

---

## üìã Step 1: Environment Setup

**IMPORTANT**: Make sure you have enabled GPU runtime:
1. Go to `Runtime` ‚Üí `Change runtime type`
2. Select `GPU` as Hardware accelerator
3. Choose `T4` GPU (or better if available)
4. Click `Save`

This cell will:
- Check GPU availability
- Verify Python version
- Display system information

In [None]:
# Check GPU and environment
import sys
import subprocess

print("üîç Checking environment...\n")
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Check GPU
try:
    gpu_info = subprocess.check_output(['nvidia-smi'], encoding='utf-8')
    print("\n‚úÖ GPU detected:")
    print(gpu_info)
except:
    print("\n‚ö†Ô∏è  WARNING: No GPU detected. This will be VERY slow!")
    print("Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")

## üì¶ Step 2: Install System Dependencies

Installing FFmpeg and other system-level tools required for audio processing.

In [None]:
%%capture
# Install FFmpeg (required for audio processing)
!apt-get update -qq
!apt-get install -y -qq ffmpeg

# Verify FFmpeg installation
!ffmpeg -version | head -n 1

## üì• Step 3: Clone Repository

Cloning the Chatterbox-TTS-Extended repository.

In [None]:
import os

# Remove existing directory if present
if os.path.exists('Chatterbox-TTS-Extended'):
    print("üìÅ Removing existing directory...")
    !rm -rf Chatterbox-TTS-Extended

# Clone the repository
print("üì• Cloning Chatterbox-TTS-Extended repository...")
!git clone https://github.com/m-marie1/Chatterbox-TTS-Extended.git

# Change to repository directory
%cd Chatterbox-TTS-Extended

print("\n‚úÖ Repository cloned successfully!")

## üêç Step 4: Install Python Dependencies

Installing all required Python packages. This may take 3-5 minutes.

**Note**: We're using Colab-optimized versions to avoid conflicts with pre-installed packages.

In [None]:
# Install PyTorch with CUDA support (Colab uses CUDA 12.x)
print("üîß Installing PyTorch with CUDA support...")
!pip install -q torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121

# Install core dependencies
print("\nüì¶ Installing core dependencies...")
!pip install -q gradio numpy faster-whisper openai-whisper ffmpeg-python
!pip install -q resampy==0.4.3 librosa==0.10.0 soundfile nltk

# Install auto-editor for audio cleanup
print("\nüé¨ Installing auto-editor...")
!pip install -q auto-editor==27.1.1

# Install Hugging Face and model dependencies
print("\nü§ó Installing Hugging Face dependencies...")
!pip install -q transformers==4.46.3 diffusers==0.29.0 omegaconf==2.3.0

# Install specific model dependencies
print("\nüéØ Installing model-specific dependencies...")
!pip install -q resemble-perth==1.0.1 silero-vad==5.1.2 conformer==0.3.2

# Install pyrnnoise for artifact reduction
print("\nüîá Installing pyrnnoise for noise reduction...")
!pip install -q pyrnnoise==0.3.8

# Install s3tokenizer
print("\nüî§ Installing s3tokenizer...")
!pip install -q s3tokenizer

# Install spaces (for Gradio compatibility)
print("\nüöÄ Installing spaces...")
!pip install -q spaces

print("\n‚úÖ All dependencies installed successfully!")

# Verify key installations
print("\nüìä Verifying installations...")
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

## üìö Step 5: Download NLTK Data

Downloading required NLTK tokenizer data for text processing.

In [None]:
import nltk
print("üìö Downloading NLTK data...")
nltk.download('punkt_tab', quiet=True)
print("‚úÖ NLTK data downloaded!")

## üöÄ Step 6: Launch Chatterbox-TTS-Extended

This will start the Gradio interface. The model will be loaded on first use.

**Features available:**
- **TTS Tab**: Text-to-Speech with advanced options
  - Multiple candidate generation for best quality
  - Whisper validation to reduce artifacts
  - RNNoise denoising for clean audio
  - Auto-editor for silence removal
  - Batch processing support
- **Voice Conversion Tab**: Convert voice to match a reference

**Tips for Colab:**
- First generation will take longer as models load
- Use smaller Whisper models (tiny/base) to save VRAM
- Enable "Use faster-whisper" for better performance
- Start with 1-2 candidates per chunk to avoid OOM errors
- If you get CUDA out of memory, restart runtime and reduce settings

In [None]:
# Launch the Gradio interface
print("üöÄ Launching Chatterbox-TTS-Extended...\n")
print("‚è≥ First generation will take longer as models download and load.")
print("üìä Monitor the output below for progress updates.\n")

# Run with public sharing enabled for Colab
!python Chatter.py --share

## ‚öôÔ∏è Recommended Settings for Colab

### For Free Tier (T4 GPU, ~15GB VRAM):
```
Whisper Model: tiny or base
Use faster-whisper: ‚úÖ Enabled
Candidates per chunk: 2-3
Parallel workers: 2-3
Enable RNNoise: ‚úÖ Enabled (removes artifacts!)
```

### For Pro/Pro+ (A100/V100, more VRAM):
```
Whisper Model: small or medium
Use faster-whisper: ‚úÖ Enabled
Candidates per chunk: 3-5
Parallel workers: 4-6
Enable RNNoise: ‚úÖ Enabled
```

### To Reduce Artifacts (Main Goal!):
1. **Enable RNNoise denoising** - This is the key feature!
2. Use **3+ candidates per chunk** with Whisper validation
3. Enable **Auto-Editor** for cleanup
4. Use **faster-whisper** for efficient validation
5. Set **Max Attempts to 3** to retry failed chunks

---

## üîß Troubleshooting

### Common Issues and Solutions:

#### 1. **CUDA Out of Memory Error**
```python
# Solution: Restart runtime and reduce settings
# Runtime ‚Üí Restart runtime
# Then use these settings:
# - Whisper model: tiny
# - Candidates: 1-2
# - Parallel workers: 1
```

#### 2. **Slow Performance**
```python
# Make sure GPU is enabled:
import torch
print(f"GPU available: {torch.cuda.is_available()}")
# If False, go to Runtime ‚Üí Change runtime type ‚Üí GPU
```

#### 3. **Model Download Failures**
```python
# Retry the cell or check your internet connection
# Models are downloaded from Hugging Face on first use
```

#### 4. **Audio Has Noise/Artifacts**
```python
# Enable these features in the UI:
# ‚úÖ Denoise with RNNoise (pyrnnoise)
# ‚úÖ Post-process with Auto-Editor
# ‚úÖ Use faster-whisper validation
# Increase candidates per chunk to 3-5
```

#### 5. **Session Timeout**
```python
# Colab free tier has time limits
# Save your audio files regularly
# Consider upgrading to Colab Pro for longer sessions
```

#### 6. **FFmpeg Errors**
```python
# Reinstall FFmpeg:
!apt-get install --reinstall -y ffmpeg
```

#### 7. **Import Errors**
```python
# Restart runtime and run all cells in order
# Runtime ‚Üí Restart runtime
```

---

## üß™ Quick Test (Optional)

Test the installation with a simple command-line generation before launching the UI.

In [None]:
# Quick test to verify everything is working
print("üß™ Testing installation...\n")

try:
    # Test imports
    import torch
    import torchaudio
    import gradio as gr
    from chatterbox.src.chatterbox.tts import ChatterboxTTS
    
    print("‚úÖ Core imports successful")
    print(f"‚úÖ PyTorch: {torch.__version__}")
    print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
    print(f"‚úÖ Gradio: {gr.__version__}")
    
    # Test optional imports
    try:
        import pyrnnoise
        print("‚úÖ pyrnnoise (RNNoise) available - artifact reduction enabled!")
    except:
        print("‚ö†Ô∏è  pyrnnoise not available - denoising will be skipped")
    
    try:
        from faster_whisper import WhisperModel
        print("‚úÖ faster-whisper available")
    except:
        print("‚ö†Ô∏è  faster-whisper not available")
    
    print("\n‚úÖ Installation test passed! Ready to use.")
    
except Exception as e:
    print(f"\n‚ùå Installation test failed: {e}")
    print("Please run the installation cells again.")

## üí° Usage Tips

### Getting the Best Results:

1. **Reference Audio**: Upload a clean 3-10 second sample of the target voice
2. **Text Preprocessing**: Enable all text cleanup options
3. **Quality Settings**: 
   - Use 3-5 candidates per chunk
   - Enable Whisper validation
   - Enable RNNoise denoising
4. **Export**: Choose FLAC for best quality or MP3 for smaller files

### Saving Your Work:

Generated audio files are saved in the `output/` directory. Download them before your session ends:

```python
# List generated files
!ls -lh output/

# Download all output files
from google.colab import files
import os

for file in os.listdir('output'):
    if file.endswith(('.wav', '.mp3', '.flac')):
        files.download(f'output/{file}')
```

### Managing Memory:

```python
# Clear GPU memory if needed
import torch
import gc

torch.cuda.empty_cache()
gc.collect()
print("GPU memory cleared")
```

---

## üì• Download Generated Audio Files

Use this cell to download all generated audio files to your local machine.

In [None]:
from google.colab import files
import os

print("üìÅ Available output files:\n")

if os.path.exists('output'):
    output_files = [f for f in os.listdir('output') if f.endswith(('.wav', '.mp3', '.flac'))]
    
    if output_files:
        for file in output_files:
            print(f"  - {file}")
        
        print("\nüì• Downloading files...")
        for file in output_files:
            try:
                files.download(f'output/{file}')
                print(f"‚úÖ Downloaded: {file}")
            except Exception as e:
                print(f"‚ùå Failed to download {file}: {e}")
    else:
        print("No audio files found. Generate some audio first!")
else:
    print("Output directory not found. Generate some audio first!")

## üßπ Clear GPU Memory

Run this cell if you encounter memory issues or want to free up GPU memory.

In [None]:
import torch
import gc

print("üßπ Clearing GPU memory...\n")

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    
    # Get memory stats
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    
    print(f"GPU Memory Allocated: {allocated:.2f} GB")
    print(f"GPU Memory Reserved: {reserved:.2f} GB")

# Run garbage collection
gc.collect()

print("\n‚úÖ Memory cleared!")

## üåç Language Learning Content Generation (Multilingual)

Generate language learning content with **proper multilingual pronunciation**! This example demonstrates using the Chatterbox Multilingual model for authentic language learning materials.

This example creates German-English dialogue content with:
- **German-only version** (each line repeated twice with authentic German pronunciation)
- **German-English paired version** (DE‚ÜíEN‚ÜíDE‚ÜíEN for each line)

### ‚ú® What's New?

- Uses **real multilingual model** from ResembleAI/chatterbox for proper pronunciation
- Supports **23 languages**: ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh
- Compatible with all artifact reduction features (RNNoise, Whisper validation, Auto-Editor)
- Uses **MTLTokenizer** with language-specific preprocessing

### üìù Usage Notes:

1. The multilingual model will be downloaded on first use (~2-3GB)
2. You can use reference audio files for voice cloning in any supported language
3. The same multilingual support is also available in the main UI by enabling "Multilingual Model" checkbox

**Tip**: Upload your own reference audio files and update the `ref_map` paths for personalized voices!

In [None]:
# Language Learning Content Generator with Multilingual Support
# This cell demonstrates how to use the multilingual TTS model for language learning

import os
import numpy as np
import soundfile as sf
import librosa
import torch

from chatterbox.src.chatterbox.tts import ChatterboxTTS

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# --- Load MULTILINGUAL model ---
print("Loading ChatterboxTTS MULTILINGUAL model...")
print("‚ö†Ô∏è  Note: The multilingual model may take longer to download on first use.")
print("   It will download from ResembleAI/chatterbox (multilingual model files).")

model = ChatterboxTTS.from_pretrained(device=device, use_multilingual=True)
sr = model.sr
print("‚úÖ Multilingual model loaded! Sample rate:", sr)
print(f"   Supports 23 languages: de, fr, es, it, pt, ru, nl, pl, tr, ar, zh, ja, ko, hi, and more!")

# -------------------- Helpers --------------------

def make_silence(duration_sec, sr):
    return np.zeros(int(duration_sec * sr), dtype=np.float32)

def trim_clip(wav, top_db=35):
    """Remove trailing/leading noise"""
    trimmed, _ = librosa.effects.trim(wav, top_db=top_db)
    return trimmed.astype(np.float32)

def normalize(wav):
    """Normalize loudness"""
    if np.max(np.abs(wav)) > 0:
        wav = wav / np.max(np.abs(wav)) * 0.97
    return wav.astype(np.float32)

def safe_generate(model, text, language_id="en", audio_prompt_path=None,
                  cfg_weight=0.5, exaggeration=0.5):
    """
    Generate speech with language support.
    
    Args:
        model: ChatterboxTTS model instance
        text: Text to synthesize
        language_id: Language code (en, de, fr, es, it, pt, etc.)
        audio_prompt_path: Optional reference audio for voice cloning
        cfg_weight: CFG weight (0.0-1.0)
        exaggeration: Emotion exaggeration (0.0-2.0)
    """
    wav = model.generate(
        text,
        audio_prompt_path=audio_prompt_path,
        language_id=language_id,
        cfg_weight=cfg_weight,
        exaggeration=exaggeration
    )
    if isinstance(wav, torch.Tensor):
        wav = wav.detach().cpu().numpy()
    wav = np.asarray(wav).reshape(-1).astype(np.float32)
    return normalize(trim_clip(wav))

# -------------------- Dialogue --------------------

turns = [
    {"speaker":"Anna",
     "german":"Hallo Markus, sch√∂n dich zu sehen! Wie geht's dir?",
     "english":"Hello Markus, nice to see you! How are you?"},
    {"speaker":"Markus",
     "german":"Mir geht's gut, danke! Und dir?",
     "english":"I'm doing well, thanks! And you?"},
    {"speaker":"Anna",
     "german":"Auch gut! Hast du Lust, einen Kaffee zu trinken?",
     "english":"I'm good too! Do you feel like having a coffee?"},
    {"speaker":"Markus",
     "german":"Sehr gerne. Ich nehme einen Cappuccino.",
     "english":"Gladly. I'll have a cappuccino."},
    {"speaker":"Anna",
     "german":"Perfekt, ich nehme einen Latte Macchiato. Und danach k√∂nnen wir einen Spaziergang machen.",
     "english":"Perfect, I'll take a latte macchiato. And after that we can go for a walk."},
    {"speaker":"Markus",
     "german":"Das klingt wunderbar. Ich freue mich schon!",
     "english":"That sounds wonderful. I'm looking forward to it!"},
]

# Optional: reference voices (upload your own clean reference files)
# If you have reference files, upload them to Colab and update these paths:
ref_map = {
    "Anna": None,    # e.g., "/content/anna_ref.wav" if you upload one
    "Markus": None   # e.g., "/content/markus_ref.wav" if you upload one
}

# Create output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# -------------------- 1) German-only --------------------

german_clips = []
print("\nüéôÔ∏è  Generating: German-only (each turn x2)")
for t in turns:
    speaker, text_de = t["speaker"], t["german"]
    ref = ref_map.get(speaker)
    for _ in range(2):
        print(f"{speaker} (DE): {text_de}")
        german_clips.append(safe_generate(model, text_de, language_id="de", audio_prompt_path=ref))
        german_clips.append(make_silence(0.4, sr))  # short pause between repeats
    german_clips.append(make_silence(0.8, sr))      # pause between speakers

german_full = np.concatenate(german_clips)
sf.write("output/german_only_full.wav", german_full, sr)
print("üéß Saved: output/german_only_full.wav")

# -------------------- 2) German+English pairs --------------------

pair_clips = []
print("\nüéôÔ∏è  Generating: German+English (each turn repeated DE‚ÜíEN‚ÜíDE‚ÜíEN)")
for t in turns:
    speaker = t["speaker"]
    ref = ref_map.get(speaker)
    seq = [(t["german"], "de"), (t["english"], "en")] * 2
    for s_text, s_lang in seq:
        lang_name = "German" if s_lang == "de" else "English"
        print(f"{speaker} ({lang_name}): {s_text}")
        pair_clips.append(safe_generate(model, s_text, language_id=s_lang, audio_prompt_path=ref))
        pair_clips.append(make_silence(0.35, sr))  # pause inside pair
    pair_clips.append(make_silence(0.9, sr))      # pause between speakers

pair_full = np.concatenate(pair_clips)
sf.write("output/german_with_translation.wav", pair_full, sr)
print("üéß Saved: output/german_with_translation.wav")

print("\n‚úÖ Done! Files saved to output/ directory")
print("üì• Download them using the file browser or the download cell above.")

print("\nüí° Tips:")
print("   ‚Ä¢ The multilingual model uses authentic language-specific pronunciation")
print("   ‚Ä¢ Supported languages: ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh")
print("   ‚Ä¢ Upload reference audio files for better voice consistency")

---

## üìö Additional Resources

- **GitHub Repository**: [Chatterbox-TTS-Extended](https://github.com/m-marie1/Chatterbox-TTS-Extended)
- **Original Chatterbox**: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox)
- **Report Issues**: [GitHub Issues](https://github.com/m-marie1/Chatterbox-TTS-Extended/issues)

## ü§ù Credits

- **Chatterbox-TTS-Extended**: Extended version with artifact reduction
- **Original Chatterbox**: Resemble AI
- **RNNoise**: Xiph.Org Foundation
- **Whisper**: OpenAI

---

**Enjoy high-quality, artifact-free speech synthesis! üéâ**