-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model #158
Comments
The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other? Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size. |
I only mentioned the vocoder to cover all of the differences. No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)
Okay, I'll readjust the configuration using your recommendations and see if I get better results! In what cases would you recommend using fine-tuning for? |
So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.
Currently the only recommended usecase is about training the aux decoder and diffusion decoder separately, when enabling shallow diffusion. Fine-tuning is not that helpful in regular cases. If you fine-tune a model, it will not save much training steps if you want to totally wash out the timbres in the pre-trained model. If you train enough steps, it will cause catastrophic forgetting. If you discard some layers or embeddings before fine-tuning, it may perform even worse than starting from scratch. Meanwhile, fine-tuning requires careful adjustment in the training-related hyperparameters to get the best results. In a word, do not use fine-tuning unless guided by the documentation or unless you are expert and are clearly aware of what you are doing. And especially, for people who own enough high-quality and well labeled datasets, please train from scratch. |
Yes, they sound distinct in TensorBoard but almost identical in OpenUTAU. There are slight differences with waveforms but generally all of the unique timbre gets removed and they all sound like they've been trained together as opposed to separate speakers. I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters. Do you think that might have something to do with this issue? Thank you for better explaining the use of finetuning! I'll be sure to stick to tuning from scratch going forward. The only other thing I can think that might be causing this issue is the amount of data I'm using and the amount of speakers? I never had this issue when I was training smaller amounts of data (~2hrs, 2 different vocalists, 6 different "voice modes"/speakers in diffsinger) and now my dataset is ~6hrs, 6 different vocalists and up to 23 different speakers in diffsinger. That's about all my GPU can handle (I train locally). I ran another test last night training only about 2hrs of data across 3 vocalists and 10 speakers in the config, and still got the same issue after 200epochs/12k steps of acoustic training. I can confirm the data is high quality and tagged well (all done by me by hand). Thanks so much for all of your help! |
I mean, if the timbres are distinct from each other on TensorBoard, they are expected to be distinct from each other in OpenUTAU as well - because the conditions are the same. If you do believe the TensorBoard samples really sound as you expected, then there must be something wrong; otherwise, they should have been in trouble when they were on TensorBoard. A possible way for debugging is to export the DS files from OpenUTAU, and use I personally train with 5 or 6 vocalists and ~9 timbres in total for my every experiment. I never encountered any issue with the differences between timbres. Some people in the community trains larger datasets than me with multi-timbre singers in them, and they have no problem either.
According to experiences from I and other people in our community in China, the variance parameters did not cause deterioration of the quality, but improves stability and controllability. But if you do not train them well, it can cause some problem, and their is some interesting findings in our recent research about the mutual influence between variance modules. These has been updated into the documentation, and a minor release will also be published to notify users about it. |
Thanks for the tip on debugging by inferring directly from the checkpoint, turns out it's either an OpenUTAU issue or a deployment issue, because direct inference from the checkpoint via command line actually gave me the proper output with separate timbre. I'll have to keep messing around with the OpenUTAU library file structure to figure out why it's getting embeds confused, which is my guess. Do they have to be in OPU configs in a certain order that you're aware of? |
When exporting to ONNX you should use So you might mixed up your embeds somehow, or it's probably just a personal mistake in the usage of OpenUTAU. You should first check your embeds to see if they are really different, and your configs, and then OpenUTAU itself (for example, use a clean install or reset all the preferences in case there are some misconfiguration in the expression settings). |
First of all, thank you so much for all of your dedication helping me solve this issue, I've learned a ton! Second of all, I discovered that the issue IS OpenUTAU. Apparently, if the embed files are not in the same directory as the character.yaml file, you have to specify where they are. I thought it pulled from the list of speakers, so it was totally my misunderstanding. |
Hi, I've been having this issue for quite some time and have tried a ton of different things to resolve it, with no luck. I've been training some English models with multi-speaker, there is about 3 characters and 12 speaker's total in the model. Each character sounds distinct from each-other, but each separate tone ends up sounding exactly the same (example: soft/power have the exact same synthesized timbre despite the source recordings sounding unique from each-other). I've tried re-writing the configuration, setting up the repo for training from scratch, removing about 3hrs of audio from the dataset, and nothing has changed. I'll include my acoustic and base configurations below. Any help is appreciated :)
Note: I'm using a custom fine-tuned vocoder for validation
main acoustic config (tgm_acoustic_leif.yaml)
base config (base.yaml)
The text was updated successfully, but these errors were encountered: