Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training the model for a different language #36

Open
ivancarapinha opened this issue Apr 21, 2020 · 1 comment
Open

Training the model for a different language #36

ivancarapinha opened this issue Apr 21, 2020 · 1 comment

Comments

@ivancarapinha
Copy link

ivancarapinha commented Apr 21, 2020

Hello @jxzhanggg,
First of all, thank you for your helpful replies to the previous issues I posted.
I would like to adapt this voice conversion model to European Portuguese. The thing is, I do not have a data set as large as VCTK in terms of nr. of utterances per speaker. I do have enough training data for at least 5-6 speakers (more than 500 utterances per speaker), sampled at 16 kHz. I tried several configurations, with batch sizes 8, 16 and 32 for pre-training but never managed to generate intelligible speech (decoder alignments did not converge). I changed the phonemizer backend in extract_features.py from Festival to Espeak, so that I could obtain phoneme transcriptions in Portuguese. I noticed that the total number of different phonemes increased substantially, from 41 (in English) to 66 (in Portuguese). I assume this makes the decoding task more difficult. Also, I experimented with the fine-tune model and the results improved a little bit (sometimes one or two words are intelligible, but still unintelligible utterances overall).

My questions are the following:

  • Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?
  • What would you suggest in order to solve the decoder alignment problem?

Thank you very much

@jxzhanggg
Copy link
Owner

jxzhanggg commented Apr 23, 2020

Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?

I think more data is always favorable for training the model. So it should be useful to pretrain on more data.

What would you suggest in order to solve the decoder alignment problem?

I found the alignment converagence can be tricky, here're some of my experience:

  1. Try to use shorter utterances, you can cut long utterances into smaller pieces if possible.
  2. Also you can gradually increase the maximum length of utterances, it's kind like curriculum learning. At the begining, only use short utterances to train the model to make alignment easy to learn.
  3. If alignment collapse during training, try to decease the learning rate or enlarge batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants