dubpls

Non-realtime dubbing - high-fidelity, open-access, dubbing for English (En) to non-English.

Demo

Below is a quick demo of dubpls (work in progress :). Note how dubpls tries to preserve the original speaker's voice and emotion.

Original (Undubbed)	Dubbed (French)

Video Credit: Marvel Studios (Deadpool vs. Wolverine)

Plan

A. Demixing:

De-mixes the speech audio into dialog + M&E i.e Music & Effects
Choice: TIGER / Demucs v4

B. Speaker Diarization & Identification:

maps each audio to new/registered user ids.
Audio-only diarization: we use only audio to correctly label each audio to the speaker. This mainly helps with zero-shot voice cloning so that we don't clone the female audio using male's previous audio, etc. WhisperX helps with correct timestamp mapping (more on this below).
Choice: PyAudio + WhisperX
(optional) visual-audio diarization: This uses the video part to ensure that we only translate only for the foreground persons and not some BG noise. Choice: TalkNet, Dolphin.

C. Transcription & Translation:

For multi-speaker convos, especially those lacking clear gap between consecutive speakers' speeches, we need to know exactly when a word was spoken (a.k.a Forced Alignment). This is done by WhisperX which uses VAD (for Hallucination Handling) and maps a timestamp to each word in the transcribed text.
For translation, we use IndicTrans2 as it beats most general EN-HI translators.
Isomorphic Translation: This ensures that the syllable count of the output text is close to that of input. (e.g 10% of input text). Choice: Any local LLM like quantized Llama-3-8B-Instruct. This can also be helpful in adding consistency checks e.g. Prompt: "Ensure the honorifics (Aap/Tum) remain consistent for Speaker A across this dialogue conversation."

D. TTS & Zero-shot voice & emotion cloning:

To clone voice we prefer CosyVoice 3.0 over F5-TTS.
We can extract emotion using Speech Emotion Recognition (SER). Choice: wav2vec2-large-robust-12-ft-emotion-msp-dim. This should classify emotion and add corresponding token to CosyVoice (<|angry|>).

E. (Optional) Synchronization:

Aims for perfect audio-lip synchronization for audio-video inputs.
Models like IndexTTS-2 or WSOLA can modulate HI output audio to fit to EN speech length.

Resources

StreamSpeech - only support for Fr, En, Es, De
CosyVoice-3.0 TTS+zero-shot voice cloning - works for Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects.
Other En-Ch models This can also be used as reference for other languages in future.
NVIDIA Nemotron (NIM) for cascaded system.
Separation of voice, music and effects from singel audio - really cool eg from movies
Speaker Diarization/Separation with visual cues
TODO LLM-based TTS models
TODO Making NeuTTS 200x realtime
Video-dubbing
NLLB demo for any2any language translation
Voice cloning using Coqui
Fine-tune indic TTS
this fine-tune of neutts-air
Comparative study on Prosody

Setup

Tested Env:

driver: 580.95.05
PyTorch==2.8.0+cu128

a. Install WhisperX using official instructions.

b. Clone repo and install dependencies

git clone --recursive https://github.com/pra-dan/dubpls.git
pip install -r external/TIGER/requirements.txt
pip install -r requirements.txt

c. Install CosyVoice using official instructions and download weights and install ttsfrd for CosyVoice3.

d. Download the translation model weights and move to "models" directory

language	weights / quants
Hindi	mradermacher's quant of Sarvam
French	mradermacher's quant of TowerInstruct-Mistral-7B

e. Get HF read-access token and add to env. From official WhisperX docs:

To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 ...

Save the token to a .env file e.g.,

HF_READ_TOKEN=hf_LxPdD...

Run pipeline

Choose language in the config.yaml.

python3 main.py

TODO

Improve LLM/prompting for better tone determination.
Add support for Hindi.
Reduce multiple environments to one or independent services.

Additional fixes

a. Explicitly add lib path to stay away from dependency issues:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH

b. If you use torch>2.6, whisperX will likely give another issue. The suggested solution is

export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=true

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
external		external
translations		translations
tts		tts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
breakdown_audio.py		breakdown_audio.py
config.yaml		config.yaml
diarize.py		diarize.py
docker-compose.yaml		docker-compose.yaml
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dubpls

Demo

Plan

Resources

Setup

Tested Env:

Run pipeline

TODO

Additional fixes

About

Uh oh!

Releases

Packages

Languages

License

pra-dan/dubpls

Folders and files

Latest commit

History

Repository files navigation

dubpls

Demo

Plan

Resources

Setup

Tested Env:

Run pipeline

TODO

Additional fixes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages