Skip to content

Potluck Vocoder - Commercial Friendly PC-NSF-HIFIGAN Model for DiffSinger Models

License

Notifications You must be signed in to change notification settings

mrtigermeat/potluck-vocoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Potluck Vocoder ๐Ÿฒ๐Ÿ—๐ŸŒฎ๐Ÿป๐Ÿœ

The "Potluck Vocoder" is a PC-NSF-HIFIGAN Model for DiffSinger. Most data was contributed from the DiffSinger community, and would not be possible without so much generosity, thus the name "Potluck".

The goal of this project was to create a vocoder whos weights are not bound to a non-commercial license, thus allowing unlimited commercial usage of DiffSinger models (running under the assumption that DS Model weights are also commercial friendly).

License

This model has a generous, commercial friendly license, but do note that unlimited commercial usage is ONLY allowed for DiffSinger usage. The model may be used in other systems, but currently is only allowed to be monetized with DiffSinger. Please ensure you read the license before using, training and/or distributing your own Potluck FT model. Please view the License here.

Information

The Potluck Vocoder has been released as a fine-tuning checkpoint and a "lite" model for general usage. I was not able to achieve a model that was a fit for all voices, so I highly recommend fine-tuning the vocoder on your dataset and to include as much data as you can to achieve the best result possible.

To learn how to fine-tune your own model, please read this guide here.

How to train your own Vocoder from scratch

Don't. Seriously, I don't recommend it. It took me weeks of tinkering, and about 75 kWh of energy, and I'm still not sure if I was able to do it in the most effective way. I thought it was impossible, so I'm elated that I was able to do it period, but if you really want to know exactly what I did I'll outline a rough idea here:

  • Have at least 100 hours of data after silence is removed. I used about 115 for phase 1 and 135 for each fine-tuning phase.
  • This guide assumes you have a 24gb VRAM GPU.
  • I used Muon_AdamW as the optimizer, which is included in the SingingVocoders repo. It can be adjusted with the following in each configuation:
discriminate_optimizer_args:
  optimizer_cls: modules.optimizer.muon.Muon_AdamW
  lr: 0.0002
  muon_args:
    weight_decay: 0.1
  adamw_args:
    weight_decay: 0.0
  verbose: false

generater_optimizer_args:
  optimizer_cls: modules.optimizer.muon.Muon_AdamW
  lr: 0.0002
  muon_args:
    weight_decay: 0.03
  adamw_args:
    weight_decay: 0.0
  verbose: false

Training pipeline

  • Training Phase 1: base_hifi.yaml with pc_aug: false & key_aug: true, and train for 100k steps (200k iterations).
  • Fine-Tune Phase 1: base_hifi_ft.yaml with pc_aug: true & key_aug: false, and train for 50k steps (100k iterations).
  • Fine-Tune Phase 2: Same setup as before, and train for 50k steps (100k iterations).

You will then want to fine-tune based on the singers you actually want to use the vocoder with. I followed my own Fine-Tuning guide for that.

NOTE: This is not the most efficient method, it probably is very wrong and bad, but it worked very well for me and I'm very happy with my results!

NOTE 2: Training iterations are double the amount of steps as the Vocoder is training 2 models at once, the generator and discriminator.

Credits

  • Training code & model created by OpenVPI.
  • "split.py" script from HiFiPLN.
  • Hours of data contributed by the DiffSinger community, credit listed here.

About

Potluck Vocoder - Commercial Friendly PC-NSF-HIFIGAN Model for DiffSinger Models

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors