🎭 AI Video Dubbing Pipeline v3.0

Emotion-Aware • Character-Consistent • Professional Quality

A complete AI-powered video dubbing solution featuring emotion detection, character voice consistency, and multiple state-of-the-art TTS engines.

🌟 Key Features

Core Capabilities

🎵 Emotion-Aware TTS - Uses Bark AI to preserve emotional tone (happy, sad, angry, etc.)
🎭 Character Voice Consistency - Maintains unique voice profiles for each speaker
🗣️ Multi-Speaker Detection - Automatic speaker diarization using Pyannote
🎨 Prosody Transfer - Matches pitch, rhythm, and energy from original speech
🎺 Music-Aware Processing - Uses AudioCraft for background score preservation

Quality Enhancements

📊 Voice Profile Management - Save and reuse character voices across projects
🔊 Advanced Audio Mixing - Dynamic EQ, compression, and spatial audio
👄 Lip Sync Support - Optional Wav2Lip integration for perfect synchronization
⚡ Smart Chunking - Handles videos of any length automatically
🌍 Multi-Language - Supports 9+ languages with native TTS engines

Technical Features

🧠 Emotion Detection - Analyzes emotional content using wav2vec2 models
🎤 High-Quality Separation - Demucs-based vocal/background isolation
🔄 Fallback System - Multiple TTS engines (Bark → XTTS → Edge TTS)
📈 Real-Time Progress - Live updates during processing
💾 Efficient Caching - Reuses voice profiles and models

🚀 Quick Start

Installation

# Clone or download the files
cd dubbing_pipeline_v3

# Install dependencies
pip install -r requirements.txt

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install ffmpeg sox libsox-dev libsndfile1

# For speaker diarization (optional but recommended)
# Get a HuggingFace token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token_here"

Run the Application

Option 1: Jupyter Notebook

jupyter notebook dubbing_pipeline_v3_enhanced.ipynb

Option 2: Standalone Python Script

python dubbing_app_v3.py

Option 3: Command Line (coming soon)

python dubbing_cli.py input_video.mp4 --lang es --emotion

📖 Usage Guide

Web Interface

Upload Video - Drag and drop or select your video file
Choose Language - Select target language from dropdown
Configure Options:
- ✅ Emotion-Aware TTS - Preserves emotional tone (recommended)
- ⏺️ Lip Sync - Applies lip synchronization (slower, requires good GPU)
Start Dubbing - Click the button and wait for processing
Download Result - Download your dubbed video

Supported Languages

Language	Code	TTS Quality	Emotion Support
Spanish	es	⭐⭐⭐⭐⭐	✅
French	fr	⭐⭐⭐⭐⭐	✅
German	de	⭐⭐⭐⭐⭐	✅
Italian	it	⭐⭐⭐⭐	✅
Japanese	ja	⭐⭐⭐⭐	✅
Korean	ko	⭐⭐⭐⭐	✅
Chinese	zh	⭐⭐⭐⭐	✅
Hindi	hi	⭐⭐⭐⭐	✅
English	en	⭐⭐⭐⭐⭐	✅

🧠 How It Works

Pipeline Overview

Input Video
    ↓
[1] Audio Separation (Demucs)
    ├─→ Vocals
    └─→ Background Music
    ↓
[2] Transcription (Whisper)
    └─→ Timestamped text segments
    ↓
[3] Speaker Diarization (Pyannote)
    └─→ Identify who speaks when
    ↓
[4] Emotion Detection (wav2vec2)
    └─→ Detect emotional tone per segment
    ↓
[5] Voice Profile Creation
    └─→ Extract embeddings for each speaker
    ↓
[6] Translation (Google Translator)
    └─→ Translate to target language
    ↓
[7] TTS Synthesis (Bark/XTTS/Edge)
    └─→ Generate emotional, character-consistent speech
    ↓
[8] Prosody Transfer
    └─→ Match pitch and rhythm
    ↓
[9] Audio Mixing
    └─→ Combine vocals + background
    ↓
[10] Video Merging (optional lip sync)
    └─→ Final dubbed video

TTS Engine Selection Strategy

The system intelligently selects the best TTS engine based on requirements:

Bark (Priority 1) - When emotion is important
- ✅ Best emotional expression
- ✅ Natural prosody
- ❌ Slower generation
- ❌ English-focused (but works for others)
XTTS (Priority 2) - When voice cloning is important
- ✅ Excellent voice cloning
- ✅ Multi-language support
- ✅ Fast generation
- ❌ Less emotional variety
Edge TTS (Priority 3) - Reliable fallback
- ✅ Always available
- ✅ Very fast
- ✅ High quality
- ❌ No voice cloning
- ❌ Limited emotion control

🎨 Advanced Features

Character Voice Consistency

The system automatically:

Identifies different speakers using diarization
Creates unique voice profiles with embeddings
Maintains consistent voice characteristics per speaker
Saves profiles for reuse in future projects

Emotion Preservation

Detected emotions include:

😊 Happy - Cheerful, upbeat tone
😢 Sad - Somber, melancholic tone
😠 Angry - Intense, forceful tone
😮 Surprised - Excited, elevated pitch
😨 Fear - Tense, uncertain tone
😐 Neutral - Standard conversational tone

Prosody Transfer

The system analyzes and transfers:

Pitch Contour - F0 trajectory over time
Energy Levels - Volume and intensity
Speaking Rate - Tempo and rhythm
Pauses - Natural breaks in speech

Music & Background Handling

Using AudioCraft (optional):

Preserves background music quality
Maintains spatial audio characteristics
Applies intelligent EQ to separate voices from music
Dynamic compression for balanced output

⚙️ Configuration

Environment Variables

# Required for speaker diarization
export HF_TOKEN="your_huggingface_token"

# Optional: CUDA configuration
export CUDA_VISIBLE_DEVICES=0

# Optional: Custom working directory
export DUBBING_WORKSPACE="/path/to/workspace"

Config Options (in code)

class Config:
    CHUNK_DURATION_SEC = 180      # Chunk videos into 3-minute segments
    SAMPLE_RATE = 22050           # Audio sample rate
    WHISPER_MODEL = "medium"      # or "large-v3" for better quality
    TTS_PRIORITY = ["bark", "xtts", "edge"]  # Engine priority order

📊 Performance

Processing Time

Video Length	Without Lip Sync	With Lip Sync	Hardware
1 minute	~2-3 minutes	~5-10 minutes	RTX 3080
5 minutes	~10-15 minutes	~25-50 minutes	RTX 3080
30 minutes	~1-1.5 hours	~2.5-5 hours	RTX 3080
1 hour	~2-3 hours	~5-10 hours	RTX 3080

Hardware Requirements

Minimum (CPU only):

CPU: 4+ cores
RAM: 8GB
Storage: 10GB free
Time: ~10x video length

Recommended (GPU):

GPU: NVIDIA RTX 2060 or better (6GB+ VRAM)
CPU: 6+ cores
RAM: 16GB
Storage: 20GB free
Time: ~2-3x video length

Optimal (High-end GPU):

GPU: NVIDIA RTX 3080/4080 (10GB+ VRAM)
CPU: 8+ cores
RAM: 32GB
Storage: 50GB free
Time: ~1-2x video length

🔧 Troubleshooting

Common Issues

1. "CUDA out of memory"

# Solution: Reduce chunk duration
Config.CHUNK_DURATION_SEC = 120  # Reduce from 180 to 120 seconds

2. "Pyannote models not found"

# Solution: Set HuggingFace token
export HF_TOKEN="your_token"
# Or disable diarization (single speaker voice)

3. "Bark model slow/crashing"

# Solution: Prioritize faster engines
Config.TTS_PRIORITY = ["xtts", "edge", "bark"]

4. "FFmpeg not found"

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from: https://ffmpeg.org/download.html

5. "Lip sync quality poor"

Ensure input video has clear, frontal faces
Try lower resolution input video
Increase --resize_factor in Wav2Lip settings
Consider using audio-only dubbing (faster and reliable)

🤝 Contributing

Contributions welcome! Areas for improvement:

📝 License

This project uses multiple open-source models and libraries:

Whisper - MIT License (OpenAI)
Bark - MIT License (Suno AI)
XTTS - MPL 2.0 (Coqui)
Demucs - MIT License (Meta)
Pyannote - MIT License
Wav2Lip - See original repository

Please review individual licenses before commercial use.

🙏 Acknowledgments

Built with amazing open-source projects:

OpenAI Whisper - Speech recognition
Suno Bark - Emotional TTS
Coqui TTS - Voice cloning
Meta Demucs - Source separation
Pyannote Audio - Speaker diarization
Wav2Lip - Lip synchronization
Meta AudioCraft - Audio generation

📧 Support

For issues, questions, or feature requests:

Open an issue on GitHub
Check existing documentation
Review troubleshooting section

🗺️ Roadmap

v3.1 (Next Release)

Real-time emotion adjustment UI
Voice profile marketplace
Batch processing queue
REST API

v3.2

Custom TTS model training
Advanced lip sync (SadTalker integration)
Multi-track audio mixing
Subtitle sync

v4.0

Real-time dubbing (low-latency)
Live streaming support
Mobile app
Cloud processing

Made with ❤️ for the open-source community

Version: 3.0.0
Last Updated: February 2026
Status: Active Development

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
TECHNICAL_DOCS.md		TECHNICAL_DOCS.md
dubbing_app_v3.py		dubbing_app_v3.py
dubbing_pipeline_v3_enhanced.ipynb		dubbing_pipeline_v3_enhanced.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎭 AI Video Dubbing Pipeline v3.0

🌟 Key Features

Core Capabilities

Quality Enhancements

Technical Features

🚀 Quick Start

Installation

Run the Application

📖 Usage Guide

Web Interface

Supported Languages

🧠 How It Works

Pipeline Overview

TTS Engine Selection Strategy

🎨 Advanced Features

Character Voice Consistency

Emotion Preservation

Prosody Transfer

Music & Background Handling

⚙️ Configuration

Environment Variables

Config Options (in code)

📊 Performance

Processing Time

Hardware Requirements

🔧 Troubleshooting

Common Issues

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Support

🗺️ Roadmap

v3.1 (Next Release)

v3.2

v4.0

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages