Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NaturalSpeech2 #35

Merged
merged 14 commits into from
Dec 18, 2023
Merged

Add NaturalSpeech2 #35

merged 14 commits into from
Dec 18, 2023

Conversation

HeCheng0625
Copy link
Collaborator

@HeCheng0625 HeCheng0625 commented Dec 16, 2023

Add NaturalSpeech2 models, train, and inference. NS2 predicts latent of encodec, and use decoder of encodec to generate wavform. We also offer a pretrained checkpoint (trained on LibriTTS) for user to inference.

Paper: https://arxiv.org/abs/2304.09116
CKPT: https://huggingface.co/amphion/naturalspeech2_libritts
Demo: https://huggingface.co/spaces/amphion/NaturalSpeech2

@HeCheng0625
Copy link
Collaborator Author

Add NaturalSpeech2 models, train, and inference. NS2 predicts latent of encodec, and use decoder of encodec to generate wavform. We also offer a pretrained checkpoint (trained on LibriTTS) for user to inference.

@RMSnow
Copy link
Collaborator

RMSnow commented Dec 17, 2023

Where is the pretrained checkpoint? And are there any generated samples?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.

Same suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this trainer and TTS trainer in models/tts/base/tts_trainer.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because some initialization is useless for ns2. I don't want to inherit TTS trainer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?

Same question

Copy link
Collaborator

@lmxue lmxue Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this module to the directory of modules/encoder or modules/naturalspech2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? Prior encoder and diffusion are also parts of NS2 models.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about your discussions. A general advice: if prior_encoder.py can be used commonly for other model except for ns2, then you need to move it into modules.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prior_encoder is designed especially for NS2 (until now). So I will put it under models/tts/ns2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the models folder should only contain the model (e.g., fastspeech2, vits, valle) only, the related module should be placed in the modules folder, especially since you have created the folder modules/naturalspeech2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)

Approve for the "merge" idea. This name wavenet.py is confusing to some extent. It is not a vocoder. I think it is more like diffwavenet, which has existed in Amphion:

class BiDilConv(nn.Module):

@HeCheng0625 You can merge this wavenet.py with the existing one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do it future. Now the wavenet is designed only for NS2. And it has a lot of different input compared to BiDilConv.

Copy link
Collaborator

@RMSnow RMSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the copyright information for all the newly added files, including .py and .sh files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.

Same suggestions.

from models.base.base_sampler import build_samplers


class TTSTrainer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no inheritance? Although Line27 says "it inherits..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copyright information for all the newly added files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?

Same question

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about your discussions. A general advice: if prior_encoder.py can be used commonly for other model except for ns2, then you need to move it into modules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)

Approve for the "merge" idea. This name wavenet.py is confusing to some extent. It is not a vocoder. I think it is more like diffwavenet, which has existed in Amphion:

class BiDilConv(nn.Module):

@HeCheng0625 You can merge this wavenet.py with the existing one.

@HeCheng0625
Copy link
Collaborator Author

Where is the pretrained checkpoint? And are there any generated samples?

Paper: https://arxiv.org/abs/2304.09116
CKPT: https://huggingface.co/amphion/naturalspeech2_libritts
Demo: https://huggingface.co/spaces/amphion/NaturalSpeech2

@RMSnow RMSnow merged commit cc620a3 into open-mmlab:main Dec 18, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants