Skip to content

harshal0704/DubCore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽญ AI Video Dubbing Pipeline v3.0

Emotion-Aware โ€ข Character-Consistent โ€ข Professional Quality

A complete AI-powered video dubbing solution featuring emotion detection, character voice consistency, and multiple state-of-the-art TTS engines.


๐ŸŒŸ Key Features

Core Capabilities

  • ๐ŸŽต Emotion-Aware TTS - Uses Bark AI to preserve emotional tone (happy, sad, angry, etc.)
  • ๐ŸŽญ Character Voice Consistency - Maintains unique voice profiles for each speaker
  • ๐Ÿ—ฃ๏ธ Multi-Speaker Detection - Automatic speaker diarization using Pyannote
  • ๐ŸŽจ Prosody Transfer - Matches pitch, rhythm, and energy from original speech
  • ๐ŸŽบ Music-Aware Processing - Uses AudioCraft for background score preservation

Quality Enhancements

  • ๐Ÿ“Š Voice Profile Management - Save and reuse character voices across projects
  • ๐Ÿ”Š Advanced Audio Mixing - Dynamic EQ, compression, and spatial audio
  • ๐Ÿ‘„ Lip Sync Support - Optional Wav2Lip integration for perfect synchronization
  • โšก Smart Chunking - Handles videos of any length automatically
  • ๐ŸŒ Multi-Language - Supports 9+ languages with native TTS engines

Technical Features

  • ๐Ÿง  Emotion Detection - Analyzes emotional content using wav2vec2 models
  • ๐ŸŽค High-Quality Separation - Demucs-based vocal/background isolation
  • ๐Ÿ”„ Fallback System - Multiple TTS engines (Bark โ†’ XTTS โ†’ Edge TTS)
  • ๐Ÿ“ˆ Real-Time Progress - Live updates during processing
  • ๐Ÿ’พ Efficient Caching - Reuses voice profiles and models

๐Ÿš€ Quick Start

Installation

# Clone or download the files
cd dubbing_pipeline_v3

# Install dependencies
pip install -r requirements.txt

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install ffmpeg sox libsox-dev libsndfile1

# For speaker diarization (optional but recommended)
# Get a HuggingFace token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token_here"

Run the Application

Option 1: Jupyter Notebook

jupyter notebook dubbing_pipeline_v3_enhanced.ipynb

Option 2: Standalone Python Script

python dubbing_app_v3.py

Option 3: Command Line (coming soon)

python dubbing_cli.py input_video.mp4 --lang es --emotion

๐Ÿ“– Usage Guide

Web Interface

  1. Upload Video - Drag and drop or select your video file
  2. Choose Language - Select target language from dropdown
  3. Configure Options:
    • โœ… Emotion-Aware TTS - Preserves emotional tone (recommended)
    • โบ๏ธ Lip Sync - Applies lip synchronization (slower, requires good GPU)
  4. Start Dubbing - Click the button and wait for processing
  5. Download Result - Download your dubbed video

Supported Languages

Language Code TTS Quality Emotion Support
Spanish es โญโญโญโญโญ โœ…
French fr โญโญโญโญโญ โœ…
German de โญโญโญโญโญ โœ…
Italian it โญโญโญโญ โœ…
Japanese ja โญโญโญโญ โœ…
Korean ko โญโญโญโญ โœ…
Chinese zh โญโญโญโญ โœ…
Hindi hi โญโญโญโญ โœ…
English en โญโญโญโญโญ โœ…

๐Ÿง  How It Works

Pipeline Overview

Input Video
    โ†“
[1] Audio Separation (Demucs)
    โ”œโ”€โ†’ Vocals
    โ””โ”€โ†’ Background Music
    โ†“
[2] Transcription (Whisper)
    โ””โ”€โ†’ Timestamped text segments
    โ†“
[3] Speaker Diarization (Pyannote)
    โ””โ”€โ†’ Identify who speaks when
    โ†“
[4] Emotion Detection (wav2vec2)
    โ””โ”€โ†’ Detect emotional tone per segment
    โ†“
[5] Voice Profile Creation
    โ””โ”€โ†’ Extract embeddings for each speaker
    โ†“
[6] Translation (Google Translator)
    โ””โ”€โ†’ Translate to target language
    โ†“
[7] TTS Synthesis (Bark/XTTS/Edge)
    โ””โ”€โ†’ Generate emotional, character-consistent speech
    โ†“
[8] Prosody Transfer
    โ””โ”€โ†’ Match pitch and rhythm
    โ†“
[9] Audio Mixing
    โ””โ”€โ†’ Combine vocals + background
    โ†“
[10] Video Merging (optional lip sync)
    โ””โ”€โ†’ Final dubbed video

TTS Engine Selection Strategy

The system intelligently selects the best TTS engine based on requirements:

  1. Bark (Priority 1) - When emotion is important

    • โœ… Best emotional expression
    • โœ… Natural prosody
    • โŒ Slower generation
    • โŒ English-focused (but works for others)
  2. XTTS (Priority 2) - When voice cloning is important

    • โœ… Excellent voice cloning
    • โœ… Multi-language support
    • โœ… Fast generation
    • โŒ Less emotional variety
  3. Edge TTS (Priority 3) - Reliable fallback

    • โœ… Always available
    • โœ… Very fast
    • โœ… High quality
    • โŒ No voice cloning
    • โŒ Limited emotion control

๐ŸŽจ Advanced Features

Character Voice Consistency

The system automatically:

  1. Identifies different speakers using diarization
  2. Creates unique voice profiles with embeddings
  3. Maintains consistent voice characteristics per speaker
  4. Saves profiles for reuse in future projects

Emotion Preservation

Detected emotions include:

  • ๐Ÿ˜Š Happy - Cheerful, upbeat tone
  • ๐Ÿ˜ข Sad - Somber, melancholic tone
  • ๐Ÿ˜  Angry - Intense, forceful tone
  • ๐Ÿ˜ฎ Surprised - Excited, elevated pitch
  • ๐Ÿ˜จ Fear - Tense, uncertain tone
  • ๐Ÿ˜ Neutral - Standard conversational tone

Prosody Transfer

The system analyzes and transfers:

  • Pitch Contour - F0 trajectory over time
  • Energy Levels - Volume and intensity
  • Speaking Rate - Tempo and rhythm
  • Pauses - Natural breaks in speech

Music & Background Handling

Using AudioCraft (optional):

  • Preserves background music quality
  • Maintains spatial audio characteristics
  • Applies intelligent EQ to separate voices from music
  • Dynamic compression for balanced output

โš™๏ธ Configuration

Environment Variables

# Required for speaker diarization
export HF_TOKEN="your_huggingface_token"

# Optional: CUDA configuration
export CUDA_VISIBLE_DEVICES=0

# Optional: Custom working directory
export DUBBING_WORKSPACE="/path/to/workspace"

Config Options (in code)

class Config:
    CHUNK_DURATION_SEC = 180      # Chunk videos into 3-minute segments
    SAMPLE_RATE = 22050           # Audio sample rate
    WHISPER_MODEL = "medium"      # or "large-v3" for better quality
    TTS_PRIORITY = ["bark", "xtts", "edge"]  # Engine priority order

๐Ÿ“Š Performance

Processing Time

Video Length Without Lip Sync With Lip Sync Hardware
1 minute ~2-3 minutes ~5-10 minutes RTX 3080
5 minutes ~10-15 minutes ~25-50 minutes RTX 3080
30 minutes ~1-1.5 hours ~2.5-5 hours RTX 3080
1 hour ~2-3 hours ~5-10 hours RTX 3080

Hardware Requirements

Minimum (CPU only):

  • CPU: 4+ cores
  • RAM: 8GB
  • Storage: 10GB free
  • Time: ~10x video length

Recommended (GPU):

  • GPU: NVIDIA RTX 2060 or better (6GB+ VRAM)
  • CPU: 6+ cores
  • RAM: 16GB
  • Storage: 20GB free
  • Time: ~2-3x video length

Optimal (High-end GPU):

  • GPU: NVIDIA RTX 3080/4080 (10GB+ VRAM)
  • CPU: 8+ cores
  • RAM: 32GB
  • Storage: 50GB free
  • Time: ~1-2x video length

๐Ÿ”ง Troubleshooting

Common Issues

1. "CUDA out of memory"

# Solution: Reduce chunk duration
Config.CHUNK_DURATION_SEC = 120  # Reduce from 180 to 120 seconds

2. "Pyannote models not found"

# Solution: Set HuggingFace token
export HF_TOKEN="your_token"
# Or disable diarization (single speaker voice)

3. "Bark model slow/crashing"

# Solution: Prioritize faster engines
Config.TTS_PRIORITY = ["xtts", "edge", "bark"]

4. "FFmpeg not found"

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from: https://ffmpeg.org/download.html

5. "Lip sync quality poor"

  • Ensure input video has clear, frontal faces
  • Try lower resolution input video
  • Increase --resize_factor in Wav2Lip settings
  • Consider using audio-only dubbing (faster and reliable)

๐Ÿค Contributing

Contributions welcome! Areas for improvement:

  • Support for more languages
  • Better emotion classification models
  • Real-time processing
  • Web API/REST interface
  • Docker containerization
  • Improved lip sync quality
  • Voice gender detection
  • Subtitle generation
  • Batch processing UI

๐Ÿ“ License

This project uses multiple open-source models and libraries:

  • Whisper - MIT License (OpenAI)
  • Bark - MIT License (Suno AI)
  • XTTS - MPL 2.0 (Coqui)
  • Demucs - MIT License (Meta)
  • Pyannote - MIT License
  • Wav2Lip - See original repository

Please review individual licenses before commercial use.


๐Ÿ™ Acknowledgments

Built with amazing open-source projects:


๐Ÿ“ง Support

For issues, questions, or feature requests:

  • Open an issue on GitHub
  • Check existing documentation
  • Review troubleshooting section

๐Ÿ—บ๏ธ Roadmap

v3.1 (Next Release)

  • Real-time emotion adjustment UI
  • Voice profile marketplace
  • Batch processing queue
  • REST API

v3.2

  • Custom TTS model training
  • Advanced lip sync (SadTalker integration)
  • Multi-track audio mixing
  • Subtitle sync

v4.0

  • Real-time dubbing (low-latency)
  • Live streaming support
  • Mobile app
  • Cloud processing

Made with โค๏ธ for the open-source community

Version: 3.0.0
Last Updated: February 2026
Status: Active Development

About

Cinema is a universal language, but dialogue shouldn't be a hurdle. Our tool turns 'reading a movie' back into 'watching a masterpiece.' No barriers. Just film.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors