Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-speaker VITS & Hi-Fi TTS dataset structure #131

Merged
merged 16 commits into from
Feb 23, 2024

Conversation

zyingt
Copy link
Contributor

@zyingt zyingt commented Feb 4, 2024

✨ Description

This PR introduces multi-speaker support for the current VITS model. It allows for the synthesis of speech in multiple voices and enables users to choose the specific speaker's voice that suits their preferences. To test this PR, you may follow the guidelines specified in the latest egs/tts/VITS/README.md.

🚧 Related Issues

None

πŸ‘¨β€πŸ’» Changes Proposed

[1] Enabling multi-speaker VITS support:

  • Updated egs/tts/VITS/run.sh, exp_config.json and README.md to include necessary arguments and instructions for enabling multi-speaker training and inferencing in VITS
  • Included intersperse function in utils/data_utils.py, allowing the insertion of blanks (0) within consecutive phone IDs to regulate speaking speed.

[2] Streamlined Hi-Fi TTS dataset preprocessing:

  • Introduced the Hi-Fi TTS dataset structure in egs/datasets/README.md
  • Updated preprocessors/processor.py to accommodate the Hi-Fi TTS preprocessor

[3] Changes on VITS dataset loader:

  • Included metadata filter in models/tts/vits/vits_dataset.py to exclude very short segments such that frame_len < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
  • Shifted variable declaration of processed_data_dir from class VITSTestDataset(TTSTestDataset) in models/tts/vits/vits_dataset.py out of ifs condition as it has been referenced within elif cfg.preprocess.use_phone: (line 88, latest) without prior declaration.

[4] Enhance model compatibility for different accelerate versions

  • The Hi-fi TTS VITS checkpoint was trained on accelerate v0.25, the resulting model file is model.safetensors instead of pytorch_model.bin. To enable users to use the checkpoint successfully, models/tts/base/tts_inferece.py is modified to add another way of loading model when users' accelerate version is <0.25.

[5] Black formatting

πŸ§‘β€πŸ€β€πŸ§‘ Who Can Review?

@lmxue @RMSnow

πŸ›  TODO

  • Test multi-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on Hi-Fi TTS (Done)
  • Test single-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on LJSpeech (Done)

βœ… Checklist

  • Code has been reviewed
  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@lmxue lmxue requested review from RMSnow and lmxue February 4, 2024 09:03
egs/tts/VITS/README.md Show resolved Hide resolved
utils/data_utils.py Show resolved Hide resolved
egs/tts/VITS/README.md Show resolved Hide resolved
models/tts/base/tts_dataset.py Outdated Show resolved Hide resolved
models/tts/vits/vits_inference.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@RMSnow RMSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use black to format the code.

egs/tts/VITS/README.md Show resolved Hide resolved
@lmxue lmxue requested a review from RMSnow February 23, 2024 09:09
Fix typos and revise the explanation for `n_speaker`
Copy link
Collaborator

@lmxue lmxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good now.

@RMSnow RMSnow merged commit 6e9d34f into open-mmlab:main Feb 23, 2024
1 check passed
ArkhamImp pushed a commit to ArkhamImp/Amphion that referenced this pull request Apr 17, 2024
Support Multi-speaker VITS & Hi-Fi TTS dataset preprocessing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants