Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppg training #4

Closed
tts-nlp opened this issue Jun 15, 2021 · 3 comments
Closed

ppg training #4

tts-nlp opened this issue Jun 15, 2021 · 3 comments

Comments

@tts-nlp
Copy link

tts-nlp commented Jun 15, 2021

hi @liusongxiang , how to training ppg model?

@liusongxiang
Copy link
Owner

Please refer to Espnet for the trainig process.

@Lukelluke
Copy link

Lukelluke commented Jul 14, 2021

Please refer to Espnet for the trainig process.

Hi, Dr.liu,

Your job is so gorgeous, use the Encoder part of ASR task instead of traditional Kaldi way, this inspires us much!

We explored ESPNet, but still have some questions wanna get ur help:

Q1: You provided the “/conformer_ppg_model/*” files, and if we wanna make these runnable as the "espnet/egs/librispeech/asr1/" example in ESPNet, how should we make preparation steps? For instance, 1、the data preparation, 2、the files organization, and 3、how to prepare the corresponding "run.sh" script?

Q2: As ur description in paper, the bottle neck features of Encoder output in ASR task ,are extracted as "speaker independent information". Do this kind of features can be equal to traditional "ppg" features? Further more, can we researchers work in Voice Conversion field, take this way to extract "ppg" features, instead of traditional Kaldi way?

Best wishes!
Luke

@liusongxiang
Copy link
Owner

Thank you for the questions.
For Q1:
I adapted espnet a lot; it seems that espnet asr models always downsample the encoder input along the temporal axis more than 4x and do not support phoneme as output symbols. Source codes should be modified correspondingly for VC applications. But the basic steps for the training process is very similar to those presented in espnet asr recipes, including the data preparation, files organization. The run.sh should be modified a little bit, e.g., the language model can be skipped. Sufficient familiarity of espnet source code should be necessary if you want to train a content encoder using your own data.

For Q2:
Please refer to this paper for your questions: TTS Skins: Speaker Conversion via ASR.
Good VC performance validate the speaker independence property of the bottle neck feature obtained in this way. The paper listed above says that BNF is better than PPG features, but this could really be a model selection thing.

Hope this can help.
Songxiang Liu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants