A Vietnamese TTS

Duration model + Acoustic model + HiFiGAN vocoder for vietnamese text-to-speech application.

Online demo at https://huggingface.co/spaces/ntt123/vietTTS.

A synthesized audio clip: clip.wav. A colab notebook: notebook.

🔔Checkout the experimental multi-speaker branch (git checkout multi-speaker) for multi-speaker support.🔔

Install

git clone https://github.com/NTT123/vietTTS.git
cd vietTTS 
pip3 install -e .

Quick start using pretrained models

bash ./scripts/quick_start.sh

Download InfoRe dataset

python ./scripts/download_aligned_infore_dataset.py

Note: this is a denoised and aligned version of the original dataset which is donated by the InfoRe Technology company (see here). You can download the original dataset (InfoRe Technology 1) at here.

See notebooks/denoise_infore_dataset.ipynb for instructions on how to denoise the dataset. We use the Montreal Forced Aligner (MFA) to align transcript and speech (textgrid files). See notebooks/align_text_audio_infore_mfa.ipynb for instructions on how to create textgrid files.

Train duration model

python -m vietTTS.nat.duration_trainer

Train acoustic model

python -m vietTTS.nat.acoustic_trainer

Train HiFiGAN vocoder

We use the original implementation from HiFiGAN authors at https://github.com/jik876/hifi-gan. Use the config file at assets/hifigan/config.json to train your model.

git clone https://github.com/jik876/hifi-gan.git

# create dataset in hifi-gan format
ln -sf `pwd`/train_data hifi-gan/data
cd hifi-gan/data
ls -1 *.TextGrid | sed -e 's/\.TextGrid$//' > files.txt
cd ..
head -n 100 data/files.txt > val_files.txt
tail -n +101 data/files.txt > train_files.txt
rm data/files.txt

# training
python train.py \
  --config ../assets/hifigan/config.json \
  --input_wavs_dir=data \
  --input_training_file=train_files.txt \
  --input_validation_file=val_files.txt

Finetune on Ground-Truth Aligned melspectrograms:

cd /path/to/vietTTS # go to vietTTS directory
python -m vietTTS.nat.zero_silence_segments -o train_data # zero all [sil, sp, spn] segments
python -m vietTTS.nat.gta -o /path/to/hifi-gan/ft_dataset  # create gta melspectrograms at hifi-gan/ft_dataset directory

# turn on finetune
cd /path/to/hifi-gan
python train.py \
  --fine_tuning True \
  --config ../assets/hifigan/config.json \
  --input_wavs_dir=data \
  --input_training_file=train_files.txt \
  --input_validation_file=val_files.txt

Then, use the following command to convert pytorch model to haiku format:

cd ..
python -m vietTTS.hifigan.convert_torch_model_to_haiku \
  --config-file=assets/hifigan/config.json \
  --checkpoint-file=hifi-gan/cp_hifigan/g_[latest_checkpoint]

Synthesize speech

python -m vietTTS.synthesizer \
  --lexicon-file=train_data/lexicon.txt \
  --text="hôm qua em tới trường" \
  --output=clip.wav

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
notebooks		notebooks
scripts		scripts
tests		tests
vietTTS		vietTTS
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Vietnamese TTS

Install

Quick start using pretrained models

Download InfoRe dataset

Train duration model

Train acoustic model

Train HiFiGAN vocoder

Synthesize speech

About

Releases

Packages

Languages

License

nmnduy/vietTTS

Folders and files

Latest commit

History

Repository files navigation

A Vietnamese TTS

Install

Quick start using pretrained models

Download InfoRe dataset

Train duration model

Train acoustic model

Train HiFiGAN vocoder

Synthesize speech

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages