Non-realtime dubbing - high-fidelity, open-access, dubbing for English (En) to non-English.
Below is a quick demo of dubpls (work in progress :). Note how dubpls tries to preserve the original speaker's voice and emotion.
| Original (Undubbed) | Dubbed (French) |
|---|---|
|
|
|
Video Credit: Marvel Studios (Deadpool vs. Wolverine)
A. Demixing:
- De-mixes the speech audio into dialog + M&E i.e Music & Effects
- Choice: TIGER / Demucs v4
B. Speaker Diarization & Identification:
- maps each audio to new/registered user ids.
- Audio-only diarization: we use only audio to correctly label each audio to the speaker. This mainly helps with zero-shot voice cloning so that we don't clone the female audio using male's previous audio, etc. WhisperX helps with correct timestamp mapping (more on this below).
- Choice: PyAudio + WhisperX
- (optional) visual-audio diarization: This uses the video part to ensure that we only translate only for the foreground persons and not some BG noise. Choice: TalkNet, Dolphin.
C. Transcription & Translation:
- For multi-speaker convos, especially those lacking clear gap between consecutive speakers' speeches, we need to know exactly when a word was spoken (a.k.a Forced Alignment). This is done by WhisperX which uses VAD (for Hallucination Handling) and maps a timestamp to each word in the transcribed text.
- For translation, we use IndicTrans2 as it beats most general EN-HI translators.
- Isomorphic Translation: This ensures that the syllable count of the output text is close to that of input. (e.g 10% of input text). Choice: Any local LLM like quantized
Llama-3-8B-Instruct. This can also be helpful in adding consistency checks e.g. Prompt: "Ensure the honorifics (Aap/Tum) remain consistent for Speaker A across this dialogue conversation."
D. TTS & Zero-shot voice & emotion cloning:
- To clone voice we prefer CosyVoice 3.0 over F5-TTS.
- We can extract emotion using Speech Emotion Recognition (SER). Choice:
wav2vec2-large-robust-12-ft-emotion-msp-dim. This should classify emotion and add corresponding token to CosyVoice (<|angry|>).
E. (Optional) Synchronization:
- Aims for perfect audio-lip synchronization for audio-video inputs.
- Models like
IndexTTS-2orWSOLAcan modulate HI output audio to fit to EN speech length.
- StreamSpeech - only support for Fr, En, Es, De
- CosyVoice-3.0 TTS+zero-shot voice cloning - works for Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects.
- Other En-Ch models This can also be used as reference for other languages in future.
- NVIDIA Nemotron (NIM) for cascaded system.
- Separation of voice, music and effects from singel audio - really cool eg from movies
- Speaker Diarization/Separation with visual cues
- TODO LLM-based TTS models
- TODO Making NeuTTS 200x realtime
- Video-dubbing
- NLLB demo for any2any language translation
- Voice cloning using Coqui
- Fine-tune indic TTS
- this fine-tune of neutts-air
- Comparative study on Prosody
driver: 580.95.05
PyTorch==2.8.0+cu128
a. Install WhisperX using official instructions.
b. Clone repo and install dependencies
git clone --recursive https://github.com/pra-dan/dubpls.git
pip install -r external/TIGER/requirements.txt
pip install -r requirements.txt
c. Install CosyVoice using official instructions and download weights and install ttsfrd for CosyVoice3.
d. Download the translation model weights and move to "models" directory
| language | weights / quants |
|---|---|
| Hindi | mradermacher's quant of Sarvam |
| French | mradermacher's quant of TowerInstruct-Mistral-7B |
e. Get HF read-access token and add to env. From official WhisperX docs:
To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the
--hf_tokenargument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 ...
Save the token to a .env file e.g.,
HF_READ_TOKEN=hf_LxPdD...Choose language in the config.yaml.
python3 main.py
- Improve LLM/prompting for better tone determination.
- Add support for Hindi.
- Reduce multiple environments to one or independent services.
a. Explicitly add lib path to stay away from dependency issues:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
b. If you use torch>2.6, whisperX will likely give another issue. The suggested solution is
export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=true