The "Potluck Vocoder" is a PC-NSF-HIFIGAN Model for DiffSinger. Most data was contributed from the DiffSinger community, and would not be possible without so much generosity, thus the name "Potluck".
The goal of this project was to create a vocoder whos weights are not bound to a non-commercial license, thus allowing unlimited commercial usage of DiffSinger models (running under the assumption that DS Model weights are also commercial friendly).
This model has a generous, commercial friendly license, but do note that unlimited commercial usage is ONLY allowed for DiffSinger usage. The model may be used in other systems, but currently is only allowed to be monetized with DiffSinger. Please ensure you read the license before using, training and/or distributing your own Potluck FT model. Please view the License here.
The Potluck Vocoder has been released as a fine-tuning checkpoint and a "lite" model for general usage. I was not able to achieve a model that was a fit for all voices, so I highly recommend fine-tuning the vocoder on your dataset and to include as much data as you can to achieve the best result possible.
To learn how to fine-tune your own model, please read this guide here.
Don't. Seriously, I don't recommend it. It took me weeks of tinkering, and about 75 kWh of energy, and I'm still not sure if I was able to do it in the most effective way. I thought it was impossible, so I'm elated that I was able to do it period, but if you really want to know exactly what I did I'll outline a rough idea here:
- Have at least 100 hours of data after silence is removed. I used about 115 for phase 1 and 135 for each fine-tuning phase.
- This guide assumes you have a 24gb VRAM GPU.
- I used Muon_AdamW as the optimizer, which is included in the SingingVocoders repo. It can be adjusted with the following in each configuation:
discriminate_optimizer_args:
optimizer_cls: modules.optimizer.muon.Muon_AdamW
lr: 0.0002
muon_args:
weight_decay: 0.1
adamw_args:
weight_decay: 0.0
verbose: false
generater_optimizer_args:
optimizer_cls: modules.optimizer.muon.Muon_AdamW
lr: 0.0002
muon_args:
weight_decay: 0.03
adamw_args:
weight_decay: 0.0
verbose: false- Training Phase 1:
base_hifi.yamlwithpc_aug: false&key_aug: true, and train for 100k steps (200k iterations). - Fine-Tune Phase 1:
base_hifi_ft.yamlwithpc_aug: true&key_aug: false, and train for 50k steps (100k iterations). - Fine-Tune Phase 2: Same setup as before, and train for 50k steps (100k iterations).
You will then want to fine-tune based on the singers you actually want to use the vocoder with. I followed my own Fine-Tuning guide for that.
NOTE: This is not the most efficient method, it probably is very wrong and bad, but it worked very well for me and I'm very happy with my results!
NOTE 2: Training iterations are double the amount of steps as the Vocoder is training 2 models at once, the generator and discriminator.