Multi-speaker VITS & Hi-Fi TTS dataset structure #131

zyingt · 2024-02-04T08:49:58Z

✨ Description

This PR introduces multi-speaker support for the current VITS model. It allows for the synthesis of speech in multiple voices and enables users to choose the specific speaker's voice that suits their preferences. To test this PR, you may follow the guidelines specified in the latest egs/tts/VITS/README.md.

🚧 Related Issues

None

👨‍💻 Changes Proposed

[1] Enabling multi-speaker VITS support:

Updated egs/tts/VITS/run.sh, exp_config.json and README.md to include necessary arguments and instructions for enabling multi-speaker training and inferencing in VITS
Included intersperse function in utils/data_utils.py, allowing the insertion of blanks (0) within consecutive phone IDs to regulate speaking speed.

[2] Streamlined Hi-Fi TTS dataset preprocessing:

Introduced the Hi-Fi TTS dataset structure in egs/datasets/README.md
Updated preprocessors/processor.py to accommodate the Hi-Fi TTS preprocessor

[3] Changes on VITS dataset loader:

Included metadata filter in models/tts/vits/vits_dataset.py to exclude very short segments such that frame_len < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
Shifted variable declaration of processed_data_dir from class VITSTestDataset(TTSTestDataset) in models/tts/vits/vits_dataset.py out of ifs condition as it has been referenced within elif cfg.preprocess.use_phone: (line 88, latest) without prior declaration.

[4] Enhance model compatibility for different accelerate versions

The Hi-fi TTS VITS checkpoint was trained on accelerate v0.25, the resulting model file is model.safetensors instead of pytorch_model.bin. To enable users to use the checkpoint successfully, models/tts/base/tts_inferece.py is modified to add another way of loading model when users' accelerate version is <0.25.

[5] Black formatting

🧑‍🤝‍🧑 Who Can Review?

@lmxue @RMSnow

🛠 TODO

Test multi-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on Hi-Fi TTS (Done)
Test single-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on LJSpeech (Done)

✅ Checklist

Code has been reviewed
Code complies with the project's code standards and best practices
Code has passed all tests
Code does not affect the normal use of existing features
Code has been commented properly
Documentation has been updated (if applicable)
Demo/checkpoint has been attached (if applicable)

egs/tts/VITS/README.md

utils/data_utils.py

This reverts commit 8ef137c.

egs/tts/VITS/README.md

models/tts/base/tts_dataset.py

models/tts/vits/vits_inference.py

…usage, black format

RMSnow

Use black to format the code.

egs/tts/VITS/README.md

Fix typos and revise the explanation for `n_speaker`

lmxue

It looks good now.

Support Multi-speaker VITS & Hi-Fi TTS dataset preprocessing

zyingt added 4 commits January 29, 2024 21:35

Added multi-speaker support to VITS

d04d5fd

Added multispeaker support to VITS

c8820a6

Multi-speaker VITS support

ec033d3

Multi-speaker VITS support

b4a1d3d

lmxue requested review from RMSnow and lmxue February 4, 2024 09:03

RMSnow requested changes Feb 5, 2024

View reviewed changes

egs/tts/VITS/README.md Show resolved Hide resolved

utils/data_utils.py Show resolved Hide resolved

zyingt added 4 commits February 8, 2024 14:01

Merge README.md, added function comment

8ef137c

Revert "Merge README.md, added function comment"

8455a5f

This reverts commit 8ef137c.

Merged README.md, updated comments

dd796ec

Fixed typos

8c910d7

lmxue requested changes Feb 12, 2024

View reviewed changes

egs/tts/VITS/README.md Show resolved Hide resolved

models/tts/base/tts_dataset.py Outdated Show resolved Hide resolved

models/tts/vits/vits_inference.py Outdated Show resolved Hide resolved

zyingt and others added 3 commits February 13, 2024 13:08

Merge branch 'open-mmlab:main' into multispeaker-vits

eb8b7f8

Enabling intersperse function for single-speaker VITS, added example …

a273505

…usage, black format

Merge branch 'open-mmlab:main' into multispeaker-vits

288e739

yuantuo666 requested review from RMSnow and lmxue February 19, 2024 16:34

zyingt and others added 2 commits February 21, 2024 15:14

enhance model loading compatibility

ab84579

Merge branch 'open-mmlab:main' into multispeaker-vits

4c62032

RMSnow requested changes Feb 22, 2024

View reviewed changes

black format

9a743be

RMSnow requested changes Feb 22, 2024

View reviewed changes

egs/tts/VITS/README.md Show resolved Hide resolved

lmxue requested a review from RMSnow February 23, 2024 09:09

lmxue added 2 commits February 23, 2024 21:11

Update exp_config.json

d6c857c

Update README.md

6046ea8

Fix typos and revise the explanation for `n_speaker`

lmxue approved these changes Feb 23, 2024

View reviewed changes

RMSnow approved these changes Feb 23, 2024

View reviewed changes

RMSnow merged commit 6e9d34f into open-mmlab:main Feb 23, 2024
1 check passed

ArkhamImp pushed a commit to ArkhamImp/Amphion that referenced this pull request Apr 17, 2024

Support Multi-speaker VITS (open-mmlab#131)

6c44baa

Support Multi-speaker VITS & Hi-Fi TTS dataset preprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-speaker VITS & Hi-Fi TTS dataset structure #131

Multi-speaker VITS & Hi-Fi TTS dataset structure #131

zyingt commented Feb 4, 2024 •

edited

RMSnow left a comment

lmxue left a comment

Multi-speaker VITS & Hi-Fi TTS dataset structure #131

Multi-speaker VITS & Hi-Fi TTS dataset structure #131

Conversation

zyingt commented Feb 4, 2024 • edited

✨ Description

🚧 Related Issues

👨‍💻 Changes Proposed

🧑‍🤝‍🧑 Who Can Review?

🛠 TODO

✅ Checklist

RMSnow left a comment

Choose a reason for hiding this comment

lmxue left a comment

Choose a reason for hiding this comment

zyingt commented Feb 4, 2024 •

edited