Potluck Vocoder 🍲🍗🌮🍻🍜

The "Potluck Vocoder" is a PC-NSF-HIFIGAN Model for DiffSinger. Most data was contributed from the DiffSinger community, and would not be possible without so much generosity, thus the name "Potluck".

The goal of this project was to create a vocoder whos weights are not bound to a non-commercial license, thus allowing unlimited commercial usage of DiffSinger models (running under the assumption that DS Model weights are also commercial friendly).

License

This model has a generous, commercial friendly license, but do note that unlimited commercial usage is ONLY allowed for DiffSinger usage. The model may be used in other systems, but currently is only allowed to be monetized with DiffSinger. Please ensure you read the license before using, training and/or distributing your own Potluck FT model. Please view the License here.

Information

The Potluck Vocoder has been released as a fine-tuning checkpoint and a "lite" model for general usage. I was not able to achieve a model that was a fit for all voices, so I highly recommend fine-tuning the vocoder on your dataset and to include as much data as you can to achieve the best result possible.

To learn how to fine-tune your own model, please read this guide here.

How to train your own Vocoder from scratch

Don't. Seriously, I don't recommend it. It took me weeks of tinkering, and about 75 kWh of energy, and I'm still not sure if I was able to do it in the most effective way. I thought it was impossible, so I'm elated that I was able to do it period, but if you really want to know exactly what I did I'll outline a rough idea here:

Have at least 100 hours of data after silence is removed. I used about 115 for phase 1 and 135 for each fine-tuning phase.
This guide assumes you have a 24gb VRAM GPU.
I used Muon_AdamW as the optimizer, which is included in the SingingVocoders repo. It can be adjusted with the following in each configuation:

discriminate_optimizer_args:
  optimizer_cls: modules.optimizer.muon.Muon_AdamW
  lr: 0.0002
  muon_args:
    weight_decay: 0.1
  adamw_args:
    weight_decay: 0.0
  verbose: false

generater_optimizer_args:
  optimizer_cls: modules.optimizer.muon.Muon_AdamW
  lr: 0.0002
  muon_args:
    weight_decay: 0.03
  adamw_args:
    weight_decay: 0.0
  verbose: false

Training pipeline

Training Phase 1: base_hifi.yaml with pc_aug: false & key_aug: true, and train for 100k steps (200k iterations).
Fine-Tune Phase 1: base_hifi_ft.yaml with pc_aug: true & key_aug: false, and train for 50k steps (100k iterations).
Fine-Tune Phase 2: Same setup as before, and train for 50k steps (100k iterations).

You will then want to fine-tune based on the singers you actually want to use the vocoder with. I followed my own Fine-Tuning guide for that.

NOTE: This is not the most efficient method, it probably is very wrong and bad, but it worked very well for me and I'm very happy with my results!

NOTE 2: Training iterations are double the amount of steps as the Vocoder is training 2 models at once, the generator and discriminator.

Credits

Training code & model created by OpenVPI.
"split.py" script from HiFiPLN.
Hours of data contributed by the DiffSinger community, credit listed here.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
credits.txt		credits.txt
finetune_guide.md		finetune_guide.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Potluck Vocoder 🍲🍗🌮🍻🍜

License

Information

How to train your own Vocoder from scratch

Training pipeline

Credits

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

License

mrtigermeat/potluck-vocoder

Folders and files

Latest commit

History

Repository files navigation

Potluck Vocoder 🍲🍗🌮🍻🍜

License

Information

How to train your own Vocoder from scratch

Training pipeline

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Packages