Emotion-Aware โข Character-Consistent โข Professional Quality
A complete AI-powered video dubbing solution featuring emotion detection, character voice consistency, and multiple state-of-the-art TTS engines.
- ๐ต Emotion-Aware TTS - Uses Bark AI to preserve emotional tone (happy, sad, angry, etc.)
- ๐ญ Character Voice Consistency - Maintains unique voice profiles for each speaker
- ๐ฃ๏ธ Multi-Speaker Detection - Automatic speaker diarization using Pyannote
- ๐จ Prosody Transfer - Matches pitch, rhythm, and energy from original speech
- ๐บ Music-Aware Processing - Uses AudioCraft for background score preservation
- ๐ Voice Profile Management - Save and reuse character voices across projects
- ๐ Advanced Audio Mixing - Dynamic EQ, compression, and spatial audio
- ๐ Lip Sync Support - Optional Wav2Lip integration for perfect synchronization
- โก Smart Chunking - Handles videos of any length automatically
- ๐ Multi-Language - Supports 9+ languages with native TTS engines
- ๐ง Emotion Detection - Analyzes emotional content using wav2vec2 models
- ๐ค High-Quality Separation - Demucs-based vocal/background isolation
- ๐ Fallback System - Multiple TTS engines (Bark โ XTTS โ Edge TTS)
- ๐ Real-Time Progress - Live updates during processing
- ๐พ Efficient Caching - Reuses voice profiles and models
# Clone or download the files
cd dubbing_pipeline_v3
# Install dependencies
pip install -r requirements.txt
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install ffmpeg sox libsox-dev libsndfile1
# For speaker diarization (optional but recommended)
# Get a HuggingFace token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token_here"Option 1: Jupyter Notebook
jupyter notebook dubbing_pipeline_v3_enhanced.ipynbOption 2: Standalone Python Script
python dubbing_app_v3.pyOption 3: Command Line (coming soon)
python dubbing_cli.py input_video.mp4 --lang es --emotion- Upload Video - Drag and drop or select your video file
- Choose Language - Select target language from dropdown
- Configure Options:
- โ Emotion-Aware TTS - Preserves emotional tone (recommended)
- โบ๏ธ Lip Sync - Applies lip synchronization (slower, requires good GPU)
- Start Dubbing - Click the button and wait for processing
- Download Result - Download your dubbed video
| Language | Code | TTS Quality | Emotion Support |
|---|---|---|---|
| Spanish | es | โญโญโญโญโญ | โ |
| French | fr | โญโญโญโญโญ | โ |
| German | de | โญโญโญโญโญ | โ |
| Italian | it | โญโญโญโญ | โ |
| Japanese | ja | โญโญโญโญ | โ |
| Korean | ko | โญโญโญโญ | โ |
| Chinese | zh | โญโญโญโญ | โ |
| Hindi | hi | โญโญโญโญ | โ |
| English | en | โญโญโญโญโญ | โ |
Input Video
โ
[1] Audio Separation (Demucs)
โโโ Vocals
โโโ Background Music
โ
[2] Transcription (Whisper)
โโโ Timestamped text segments
โ
[3] Speaker Diarization (Pyannote)
โโโ Identify who speaks when
โ
[4] Emotion Detection (wav2vec2)
โโโ Detect emotional tone per segment
โ
[5] Voice Profile Creation
โโโ Extract embeddings for each speaker
โ
[6] Translation (Google Translator)
โโโ Translate to target language
โ
[7] TTS Synthesis (Bark/XTTS/Edge)
โโโ Generate emotional, character-consistent speech
โ
[8] Prosody Transfer
โโโ Match pitch and rhythm
โ
[9] Audio Mixing
โโโ Combine vocals + background
โ
[10] Video Merging (optional lip sync)
โโโ Final dubbed video
The system intelligently selects the best TTS engine based on requirements:
-
Bark (Priority 1) - When emotion is important
- โ Best emotional expression
- โ Natural prosody
- โ Slower generation
- โ English-focused (but works for others)
-
XTTS (Priority 2) - When voice cloning is important
- โ Excellent voice cloning
- โ Multi-language support
- โ Fast generation
- โ Less emotional variety
-
Edge TTS (Priority 3) - Reliable fallback
- โ Always available
- โ Very fast
- โ High quality
- โ No voice cloning
- โ Limited emotion control
The system automatically:
- Identifies different speakers using diarization
- Creates unique voice profiles with embeddings
- Maintains consistent voice characteristics per speaker
- Saves profiles for reuse in future projects
Detected emotions include:
- ๐ Happy - Cheerful, upbeat tone
- ๐ข Sad - Somber, melancholic tone
- ๐ Angry - Intense, forceful tone
- ๐ฎ Surprised - Excited, elevated pitch
- ๐จ Fear - Tense, uncertain tone
- ๐ Neutral - Standard conversational tone
The system analyzes and transfers:
- Pitch Contour - F0 trajectory over time
- Energy Levels - Volume and intensity
- Speaking Rate - Tempo and rhythm
- Pauses - Natural breaks in speech
Using AudioCraft (optional):
- Preserves background music quality
- Maintains spatial audio characteristics
- Applies intelligent EQ to separate voices from music
- Dynamic compression for balanced output
# Required for speaker diarization
export HF_TOKEN="your_huggingface_token"
# Optional: CUDA configuration
export CUDA_VISIBLE_DEVICES=0
# Optional: Custom working directory
export DUBBING_WORKSPACE="/path/to/workspace"class Config:
CHUNK_DURATION_SEC = 180 # Chunk videos into 3-minute segments
SAMPLE_RATE = 22050 # Audio sample rate
WHISPER_MODEL = "medium" # or "large-v3" for better quality
TTS_PRIORITY = ["bark", "xtts", "edge"] # Engine priority order| Video Length | Without Lip Sync | With Lip Sync | Hardware |
|---|---|---|---|
| 1 minute | ~2-3 minutes | ~5-10 minutes | RTX 3080 |
| 5 minutes | ~10-15 minutes | ~25-50 minutes | RTX 3080 |
| 30 minutes | ~1-1.5 hours | ~2.5-5 hours | RTX 3080 |
| 1 hour | ~2-3 hours | ~5-10 hours | RTX 3080 |
Minimum (CPU only):
- CPU: 4+ cores
- RAM: 8GB
- Storage: 10GB free
- Time: ~10x video length
Recommended (GPU):
- GPU: NVIDIA RTX 2060 or better (6GB+ VRAM)
- CPU: 6+ cores
- RAM: 16GB
- Storage: 20GB free
- Time: ~2-3x video length
Optimal (High-end GPU):
- GPU: NVIDIA RTX 3080/4080 (10GB+ VRAM)
- CPU: 8+ cores
- RAM: 32GB
- Storage: 50GB free
- Time: ~1-2x video length
1. "CUDA out of memory"
# Solution: Reduce chunk duration
Config.CHUNK_DURATION_SEC = 120 # Reduce from 180 to 120 seconds2. "Pyannote models not found"
# Solution: Set HuggingFace token
export HF_TOKEN="your_token"
# Or disable diarization (single speaker voice)3. "Bark model slow/crashing"
# Solution: Prioritize faster engines
Config.TTS_PRIORITY = ["xtts", "edge", "bark"]4. "FFmpeg not found"
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from: https://ffmpeg.org/download.html5. "Lip sync quality poor"
- Ensure input video has clear, frontal faces
- Try lower resolution input video
- Increase
--resize_factorin Wav2Lip settings - Consider using audio-only dubbing (faster and reliable)
Contributions welcome! Areas for improvement:
- Support for more languages
- Better emotion classification models
- Real-time processing
- Web API/REST interface
- Docker containerization
- Improved lip sync quality
- Voice gender detection
- Subtitle generation
- Batch processing UI
This project uses multiple open-source models and libraries:
- Whisper - MIT License (OpenAI)
- Bark - MIT License (Suno AI)
- XTTS - MPL 2.0 (Coqui)
- Demucs - MIT License (Meta)
- Pyannote - MIT License
- Wav2Lip - See original repository
Please review individual licenses before commercial use.
Built with amazing open-source projects:
- OpenAI Whisper - Speech recognition
- Suno Bark - Emotional TTS
- Coqui TTS - Voice cloning
- Meta Demucs - Source separation
- Pyannote Audio - Speaker diarization
- Wav2Lip - Lip synchronization
- Meta AudioCraft - Audio generation
For issues, questions, or feature requests:
- Open an issue on GitHub
- Check existing documentation
- Review troubleshooting section
- Real-time emotion adjustment UI
- Voice profile marketplace
- Batch processing queue
- REST API
- Custom TTS model training
- Advanced lip sync (SadTalker integration)
- Multi-track audio mixing
- Subtitle sync
- Real-time dubbing (low-latency)
- Live streaming support
- Mobile app
- Cloud processing
Made with โค๏ธ for the open-source community
Version: 3.0.0
Last Updated: February 2026
Status: Active Development