RADMMM

Pytorch Lightning(PTL) repository for RADMMM and VANI. Intended to be released publically.

Installation

Please use the Dockerfile listed inside docker/ to build a Docker image or use the following:

pip install requirements.txt

Getting started

In order to get started, please follow the steps below:

Download the data. Use the links below to download the opensource dataset or use this link to download our version of the following dataset. Please place it under a directory with the following name:

multilingual-dataset/

Language	Train Prompts	Val Prompts	Dataset Link	Speaker Name
English (US)	10000	10	https://keithito.com/LJ-Speech-Dataset	LJ Speech
German (DE)	10000	10	https://opendata.iisys.de/opendata/Datasets/HUI-Audio-Corpus-German/dataset_full/Bernd_Ungerer.zip	Bernd Ungerer
French (FR)	10000	10	https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset	Nadine Eckert
Spanish (ES)	10000	10	https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset	Tux
Hindi (HI)	8237	10	https://aclanthology.org/2020.lrec-1.789.pdf	Indic TTS
Portuguese (BR)	3085	10	https://github.com/Edresson/TTS-Portuguese-Corpus	Edresson Casanova
Spanish (LATN)	7376	10	https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset	Karen Savage
Total	58698	70

Filelists are already present in datasets/
Place the vocoders in the vocoders/hifigan_vocoder directory.
Preprocess the dataset to phonemize the data:

python3 scripts/phonemize_text.py -c configs/RADMMM_opensource_data_config_phonemizerless.yaml

Train the model by following the steps in Training or download the pretrained checkpoints for RADMMM(mel-spectogram generator as well as HiFi-GAN(vocoder) as explained in Pretrained Checkpoints.
Run inference by following the steps in Inference. Inference requires the text to be phonemized and we suggest using Phonemizer (GPL License) for best results. We provide an alternative to Phonemizer based on language dictionaries that can be downloaded from this link. Please place these under the directory assets/ and refer to usage in Inference.

Training

Train the decoder and attribute predictor using the commands given below:

Train the decoder (for single GPU use trainer.devices=1, for multi-gpu use trainer.devices=nGPUs)

python3 tts_main.py fit -c configs/RADMMM_train_config.yaml -c configs/RADMMM_opensource_data_config_phonemizerless.yaml -c configs/RADMMM_model_config.yaml --trainer.num_nodes=1 --trainer.devices=1

Train the attribute prediction modules (for single GPU use trainer.devices=1, for multi-gpu use trainer.devices=nGPUs)

python3 tts_main.py fit -c configs/RADMMM_f0model_config.yaml -c configs/RADMMM_energymodel_config.yaml -c configs/RADMMM_durationmodel_config.yaml -c configs/RADMMM_vpredmodel_config.yaml -c configs/RADMMM_train_config.yaml -c configs/RADMMM_opensource_data_config_phonemizerless.yaml -c configs/RADMMM_model_config.yaml --trainer.num_nodes=1 --trainer.devices=1

Inference

Inference can be performed in the following way. There's a separate INPUT_FILEPATH required for inference, some samples of which are in the model_inputs/ directory.

The provided example transcript demonstrates how to specify tts-mode transcripts. Note that it is possible to mix and match speaker identities for individual attribute predictors using the keys decoder_spk_id, duration_spk_id, f0_spk_id, and energy_spk_id. Any unspecified speaker ids will default to whatever is specified for spk_id. For implementation details, please see the dataset class TextOnlyData in data.py.

NOTE: speaker id mapping is determined by the dictionary constructed in the training dataset. If the training dataset is modified or unavailable during inference, be sure to manually specify the original dictionary used during training as self.speaker_ids in the setup method of datamodules.py. Similar for speaker statistics, please use the method load_speaker_stats(speaker) to get stats for the speaker.

# setup the model paths and configs (see Pretrained Checkpoints for details)
MODEL_PATH=<model_path>
CONFIG_PATH=<config_path>

# setup the vocoder paths and configs (see Pretrained Checkpoints for details)
VOCODER_PATH=<vocoder_path>
VOCODER_CONFIG_PATH=<vocoder_config_path>

INPUT_FILEPATH=model_inputs/resynthesis_prompts.json
#INPUT_FILEPATH=model_inputs/language_transfer_prompts.json

python tts_main.py predict -c $CONFIG_PATH --ckpt_path=$MODEL_PATH --model.predict_mode="tts" --data.inference_transcript=$INPUT_FILEPATH --model.prediction_output_dir=outdir --trainer.devices=1 --data.batch_size=1 --model.vocoder_checkpoint_path=$VOCODER_PATH --model.vocoder_config_path=$VOCODER_CONFIG_PATH --data.phonemizer_cfg='{"en_US": "assets/en_US_word_ipa_map.txt","de_DE": "assets/de_DE_word_ipa_map.txt","en_UK": "assets/en_UK_word_ipa_map.txt","es_CO": "assets/es_CO_word_ipa_map.txt","es_ES": "assets/es_ES_word_ipa_map.txt","fr_FR": "assets/fr_FR_word_ipa_map.txt","hi_HI": "assets/hi_HI_word_ipa_map.txt","pt_BR": "assets/pt_BR_word_ipa_map.txt","te_TE": "assets/te_TE_word_ipa_map.txt", "es_MX": "assets/es_ES_word_ipa_map.txt"}'

Pretrained checkpoint(s)

RADMMM (mel-spectogram synthesizer)

RADMMM checkpoint
RADMMM config

Vocoder (HiFi-GAN - waveform synthesizer)

HiFi-GAN checkpoint. HiFi-GAN config.

Reference samples using pretrained model

Link

Reference Samples and more information

Please visit this page for samples related to models trained on the opensource dataset.

Support

rbadlani@nvidia.com

Contributing

Please create a new branch for your feature or bug fix. Create a PR to the main branch describing your contributions.

Authors and acknowledgment

Authors: Rohan Badlani, Rafael Valle and Kevin J. Shih.

The authors would like to thank Akshit Arora, Subhankar Ghosh, Siddharth Gururani, João Felipe Santos, Boris Ginsburg and Bryan Catanzaro for their support and guidance.

Phonemizer(GPL License), without modifications, is separately used to convert from graphemes to IPA phonemes.
Pratt is used for speech manipulation.

The symbol set used in RADMMM heavily draws inspiration from WikiPedia's IPA CHART

The code in this repository is heavily inspired by or makes use of source code from the following works:

RADTTS
Tacotron's implementation by Keith Ito
STFT code from Prem Seetharaman
Masked Autoregressive Flows
Flowtron
neural spline functions used in this work: https://github.com/ndeutschmann/zunis
Original Source for neural spline functions: https://github.com/bayesiains/nsf
Bipartite Architecture based on code from WaveGlow
HiFi-GAN
Glow-TTS
WaveGlow

Relevant Papers

Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro. RAD-MMM: Multi-lingual Multi-accented Multi-speaker Text to Speech. Interspeech 2023.

Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro. VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation. ICASSP 2023

Rafael Valle, João Felipe Santos, Kevin J. Shih, Rohan Badlani, Bryan Catanzaro. High-Acoustic Fidelity Text To Speech Synthesis With Fine-Grained Control Of Speech Attributes. ICASSP 2023.

Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro. One TTS Alignment to Rule Them All. ICASSP 2022

Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro. RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

Kevin J Shih, Rafael Valle, Rohan Badlani, Jõao Felipe Santos, Bryan Catanzaro. Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
configs		configs
datasets		datasets
docker		docker
model_inputs		model_inputs
models		models
scripts		scripts
tts_text_processing		tts_text_processing
visuals		visuals
vocoders		vocoders
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
alignment.py		alignment.py
attribute_predictors.py		attribute_predictors.py
audio_processing.py		audio_processing.py
common.py		common.py
data.py		data.py
data_modules.py		data_modules.py
decoders.py		decoders.py
inference.ipynb		inference.ipynb
loss.py		loss.py
maskedbatchnorm1d.py		maskedbatchnorm1d.py
partialconv1d.py		partialconv1d.py
plotting_utils.py		plotting_utils.py
radam.py		radam.py
requirements.txt		requirements.txt
splines.py		splines.py
stft_loss.py		stft_loss.py
training_callbacks.py		training_callbacks.py
tts_lightning_modules.py		tts_lightning_modules.py
tts_main.py		tts_main.py
utils.py		utils.py
wave_transforms.py		wave_transforms.py

License

NVIDIA/RAD-MMM

Folders and files

Latest commit

History

Repository files navigation

RADMMM

Installation

Getting started

Training

Inference

Pretrained checkpoint(s)

RADMMM (mel-spectogram synthesizer)

Vocoder (HiFi-GAN - waveform synthesizer)

Reference samples using pretrained model

Reference Samples and more information

Support

Contributing

Authors and acknowledgment

Relevant Papers

License

About

Resources

License

Stars

Watchers

Forks

Languages