Skip to content

pra-dan/dubpls

Repository files navigation

dubpls

Non-realtime dubbing - high-fidelity, open-access, dubbing for English (En) to non-English.

Demo

Below is a quick demo of dubpls (work in progress :). Note how dubpls tries to preserve the original speaker's voice and emotion.

Original (Undubbed) Dubbed (French)
Original Video Dubbed Video

Video Credit: Marvel Studios (Deadpool vs. Wolverine)

Plan

A. Demixing:

  • De-mixes the speech audio into dialog + M&E i.e Music & Effects
  • Choice: TIGER / Demucs v4

B. Speaker Diarization & Identification:

  • maps each audio to new/registered user ids.
  • Audio-only diarization: we use only audio to correctly label each audio to the speaker. This mainly helps with zero-shot voice cloning so that we don't clone the female audio using male's previous audio, etc. WhisperX helps with correct timestamp mapping (more on this below).
  • Choice: PyAudio + WhisperX
  • (optional) visual-audio diarization: This uses the video part to ensure that we only translate only for the foreground persons and not some BG noise. Choice: TalkNet, Dolphin.

C. Transcription & Translation:

  • For multi-speaker convos, especially those lacking clear gap between consecutive speakers' speeches, we need to know exactly when a word was spoken (a.k.a Forced Alignment). This is done by WhisperX which uses VAD (for Hallucination Handling) and maps a timestamp to each word in the transcribed text.
  • For translation, we use IndicTrans2 as it beats most general EN-HI translators.
  • Isomorphic Translation: This ensures that the syllable count of the output text is close to that of input. (e.g 10% of input text). Choice: Any local LLM like quantized Llama-3-8B-Instruct. This can also be helpful in adding consistency checks e.g. Prompt: "Ensure the honorifics (Aap/Tum) remain consistent for Speaker A across this dialogue conversation."

D. TTS & Zero-shot voice & emotion cloning:

  • To clone voice we prefer CosyVoice 3.0 over F5-TTS.
  • We can extract emotion using Speech Emotion Recognition (SER). Choice: wav2vec2-large-robust-12-ft-emotion-msp-dim. This should classify emotion and add corresponding token to CosyVoice (<|angry|>).

E. (Optional) Synchronization:

  • Aims for perfect audio-lip synchronization for audio-video inputs.
  • Models like IndexTTS-2 or WSOLA can modulate HI output audio to fit to EN speech length.

Resources


Setup

Tested Env:

driver: 580.95.05
PyTorch==2.8.0+cu128

a. Install WhisperX using official instructions.

b. Clone repo and install dependencies

git clone --recursive https://github.com/pra-dan/dubpls.git
pip install -r external/TIGER/requirements.txt
pip install -r requirements.txt

c. Install CosyVoice using official instructions and download weights and install ttsfrd for CosyVoice3.

d. Download the translation model weights and move to "models" directory

language weights / quants
Hindi mradermacher's quant of Sarvam
French mradermacher's quant of TowerInstruct-Mistral-7B

e. Get HF read-access token and add to env. From official WhisperX docs:

To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 ...

Save the token to a .env file e.g.,

HF_READ_TOKEN=hf_LxPdD...

Run pipeline

Choose language in the config.yaml.

python3 main.py

TODO

  • Improve LLM/prompting for better tone determination.
  • Add support for Hindi.
  • Reduce multiple environments to one or independent services.

Additional fixes

a. Explicitly add lib path to stay away from dependency issues:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH

b. If you use torch>2.6, whisperX will likely give another issue. The suggested solution is

export TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=true 

About

Non-realtime dubbing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published