You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @jxzhanggg,
First of all, thank you for your helpful replies to the previous issues I posted.
I would like to adapt this voice conversion model to European Portuguese. The thing is, I do not have a data set as large as VCTK in terms of nr. of utterances per speaker. I do have enough training data for at least 5-6 speakers (more than 500 utterances per speaker), sampled at 16 kHz. I tried several configurations, with batch sizes 8, 16 and 32 for pre-training but never managed to generate intelligible speech (decoder alignments did not converge). I changed the phonemizer backend in extract_features.py from Festival to Espeak, so that I could obtain phoneme transcriptions in Portuguese. I noticed that the total number of different phonemes increased substantially, from 41 (in English) to 66 (in Portuguese). I assume this makes the decoding task more difficult. Also, I experimented with the fine-tune model and the results improved a little bit (sometimes one or two words are intelligible, but still unintelligible utterances overall).
My questions are the following:
Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?
What would you suggest in order to solve the decoder alignment problem?
Thank you very much
The text was updated successfully, but these errors were encountered:
Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?
I think more data is always favorable for training the model. So it should be useful to pretrain on more data.
What would you suggest in order to solve the decoder alignment problem?
I found the alignment converagence can be tricky, here're some of my experience:
Try to use shorter utterances, you can cut long utterances into smaller pieces if possible.
Also you can gradually increase the maximum length of utterances, it's kind like curriculum learning. At the begining, only use short utterances to train the model to make alignment easy to learn.
If alignment collapse during training, try to decease the learning rate or enlarge batch size.
Hello @jxzhanggg,
First of all, thank you for your helpful replies to the previous issues I posted.
I would like to adapt this voice conversion model to European Portuguese. The thing is, I do not have a data set as large as VCTK in terms of nr. of utterances per speaker. I do have enough training data for at least 5-6 speakers (more than 500 utterances per speaker), sampled at 16 kHz. I tried several configurations, with batch sizes 8, 16 and 32 for pre-training but never managed to generate intelligible speech (decoder alignments did not converge). I changed the phonemizer backend in
extract_features.py
from Festival to Espeak, so that I could obtain phoneme transcriptions in Portuguese. I noticed that the total number of different phonemes increased substantially, from 41 (in English) to 66 (in Portuguese). I assume this makes the decoding task more difficult. Also, I experimented with the fine-tune model and the results improved a little bit (sometimes one or two words are intelligible, but still unintelligible utterances overall).My questions are the following:
Thank you very much
The text was updated successfully, but these errors were encountered: