Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model #158

Closed
spicytigermeat opened this issue Dec 3, 2023 · 8 comments

Comments

@spicytigermeat
Copy link

Hi, I've been having this issue for quite some time and have tried a ton of different things to resolve it, with no luck. I've been training some English models with multi-speaker, there is about 3 characters and 12 speaker's total in the model. Each character sounds distinct from each-other, but each separate tone ends up sounding exactly the same (example: soft/power have the exact same synthesized timbre despite the source recordings sounding unique from each-other). I've tried re-writing the configuration, setting up the repo for training from scratch, removing about 3hrs of audio from the dataset, and nothing has changed. I'll include my acoustic and base configurations below. Any help is appreciated :)

Note: I'm using a custom fine-tuned vocoder for validation

main acoustic config (tgm_acoustic_leif.yaml)
base_config:
  - configs/base.yaml

task_cls: training.acoustic_task.AcousticTask

num_spk: 13
speakers:
  # commented numbers are the index of the speaker
  - tiger_fresh #0
  - triton_gale #1
  - canary_core #2
  - leif_blossom_e #3
  - leif_blossom_j #4
  - leif_lush_e #5
  - leif_lush_j #6
  - leif_uprooted_e #7
  - leif_uprooted_j #8
  - leif_petal_e #9
  - leif_petal_j #10
  - tiger_disco #11
  - tiger_electric #12
  #- ritsu #13
raw_data_dir:
  - data/training_data/tiger_fresh #0
  - data/training_data/triton_gale #1
  - data/training_data/canary_core #2
  - data/training_data/leif_blossom_e #4
  - data/training_data/leif_blossom_j #5
  - data/training_data/leif_lush_e #6
  - data/training_data/leif_lush_j #7
  - data/training_data/leif_uprooted_e #8
  - data/training_data/leif_uprooted_j #9
  - data/training_data/leif_petal_e #10
  - data/training_data/leif_petal_j #11
  - data/training_data/tiger_disco #13
  - data/training_data/tiger_electric #14
  #- data/training_data/ritsu #22
spk_ids: []
test_prefixes: 
  # tiger_fresh
  - 0:familiar_seg016
  - 0:golden_hour_seg000
  - 0:rougenodengon_seg006
  - 0:sungoesdown_seg007
  - 0:videogames_seg002
  # triton_gale
  - 1:natalie_dont_seg000
  - 1:housewife_seg000
  - 1:blinding_lights_seg002
  - 1:surround_me_seg002
  - 1:your_power_seg000
  # canary_core
  - 2:intergalactia_seg009
  - 2:still_alive_seg014
  - 2:cyber_angel_seg003
  - 2:canary_t2_02_seg003
  - 2:canary_t2_02_seg000
  # leif_blossom_e
  - 3:leif_blossom_06_seg000
  - 3:leif_blossom_17_seg000
  - 3:leif_blossom_11_seg001
  - 3:leif_blossom_24_seg000
  - 3:leif_blossom_36_seg002
  # leif_blossom_j
  - 4:leif_blossom_j_05_seg000
  - 4:leif_blossom_j_12_seg004
  - 4:leif_blossom_j_10_seg000
  - 4:leif_blossom_j_13_seg003
  - 4:leif_blossom_j_15_seg001
  # leif_lush_e
  - 5:leif_lush_04_seg000
  - 5:leif_lush_10_seg000
  - 5:leif_lush_11_seg001
  - 5:leif_lush_24_seg002
  - 5:leif_lush_33_seg000
  # leif_lush_j
  - 6:leif_lush_j_02_seg001
  - 6:leif_lush_j_07_seg000
  - 6:leif_lush_j_06_seg002
  - 6:leif_lush_j_13_seg002
  - 6:leif_lush_j_18_seg001
  # leif_uprooted_e
  - 7:leif_uprooted_01_seg001
  - 7:leif_uprooted_05_seg000
  - 7:leif_uprooted_10_seg000
  - 7:leif_uprooted_17_seg003
  - 7:leif_uprooted_20_seg000
  # leif_uprooted_j
  - 8:leif_uprooted_j_08_seg003
  - 8:leif_uprooted_j_13_seg000
  - 8:leif_uprooted_j_03_seg001
  - 8:leif_uprooted_j_08_seg000
  - 8:leif_uprooted_j_12_seg003
  # leif_petal_e
  - 9:leif_petal_08_seg001
  - 9:leif_petal_11_seg002
  - 9:leif_petal_01_seg001
  - 9:leif_petal_10_seg002
  - 9:leif_petal_20_seg004
  # leif_petal_j
  - 10:leif_petal_j_04_seg001
  - 10:leif_petal_j_07_seg003
  - 10:leif_petal_j_04_seg003
  - 10:leif_petal_j_13_seg003
  - 10:leif_petal_j_15_seg003
  # tiger_disco
  - 11:afternoon_in_heaven_seg008
  - 11:i_still_wanna_know_seg009
  - 11:dreamsweet_seg009
  - 11:fireflies_seg008
  - 11:so_cold_seg003
  # tiger_electric
  - 12:funky_again_seg008
  - 12:independant_together_seg013
  - 12:rightround_seg013
  - 12:you_and_i_9_seg000
  - 12:still_feel_seg011
  # ritsu
  #- 13:sakura_seg011
  #- 13:tsutsuuraura_seg004
  #- 13:traumerei_seg009
  #- 13:WAVE_seg001
  #- 13:Worlds_End_Celebrate_seg003
  # stock_data
  #- 21:Blank_S_1_seg001
  #- 21:Spotless_M_4_seg000

vocoder: NsfHifiGAN
vocoder_ckpt: checkpoints/tgm_hifigan/generator.ckpt
audio_sample_rate: 44100
audio_num_mel_bins: 128
hop_size: 512            # Hop size.
fft_size: 2048           # FFT size.
win_size: 2048           # FFT size.
fmin: 40
fmax: 16000

binarization_args:
  shuffle: true
  num_workers: 0 #default: 0
augmentation_args:
  random_pitch_shifting:
    enabled: false
    range: [-5., 5.]
    scale: 1.0
  fixed_pitch_shifting:
    enabled: false
    targets: [-5., 5.]
    scale: 0.75
  random_time_stretching:
    enabled: false
    range: [0.5, 2.]
    domain: log  # or linear
    scale: 1.0

binary_data_dir: data/binary/tgm_b04/acoustic_binary_4
binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer
dictionary: dictionaries/tgm_dictionary_norx.txt
num_pad_tokens: 1
spec_min: [-5]
spec_max: [0]
mel_vmin: -6. #-6.
mel_vmax: 1.5
interp_uv: true
energy_smooth_width: 0.12
breathiness_smooth_width: 0.12

use_spk_id: true
f0_embed_type: discrete
use_energy_embed: false
use_breathiness_embed: false
use_key_shift_embed: false
use_speed_embed: false

timesteps: 1000
max_beta: 0.02
rel_pos: true
diff_accelerator: ddim
pndm_speedup: 10
hidden_size: 256
residual_layers: 20
residual_channels: 512
dilation_cycle_length: 4  # *
diff_decoder_type: 'wavenet'
diff_loss_type: l1
schedule_type: 'linear'

# shallow diffusion
use_shallow_diffusion: true
K_step: 400
K_step_infer: 400

shallow_diffusion_args:
  train_aux_decoder: true
  train_diffusion: true
  val_gt_start: true
  aux_decoder_arch: convnext
  aux_decoder_args:
    num_channels: 512
    num_layers: 6
    kernel_size: 7
    dropout_rate: 0.1
  aux_decoder_grad: 0.1

lambda_aux_mel_loss: 0.2

# train and eval
num_sanity_val_steps: 1
optimizer_args:
  optimizer_cls: torch.optim.AdamW
  lr: 0.0004
  beta1: 0.9
  beta2: 0.98
  weight_decay: 0
lr_scheduler_args:
  scheduler_cls: torch.optim.lr_scheduler.StepLR
  warmup_steps: 10000
  step_size: 15000
  gamma: 0.5
max_batch_frames: 80000
max_batch_size: 16
dataset_size_key: 'lengths'
val_with_vocoder: true
val_check_interval: 2000
num_valid_plots: 10
max_updates: 320000
num_ckpt_keep: 20
permanent_ckpt_start: 200000
permanent_ckpt_interval: 40000


finetune_enabled: true
finetune_ckpt_path: checkpoints/tgm_acou_b04-2/ARCHIVE_CKPT/model_ckpt_steps_47500.ckpt

finetune_ignored_params:
  - model.fs2.encoder.embed_tokens
  - model.fs2.txt_embed
  - model.fs2.spk_embed
finetune_strict_shapes: true

freezing_enabled: false
frozen_params: []

use_melody_encoder: true
use_glide_embed: false
base config (base.yaml)
# task
task_cls: ''
seed: 1234
save_codes:
  - configs
  - modules
  - training
  - utils

#############
# dataset
#############
sort_by_len: true
raw_data_dir: ''
binary_data_dir: ''
binarizer_cls: ''
binarization_args:
  shuffle: false
  num_workers: 0

audio_num_mel_bins: 128
audio_sample_rate: 44100
hop_size: 512  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
win_size: 2048  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
fmin: 40  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax: 16000  # To be increased/reduced depending on data.
fft_size: 2048  # Extra window size is filled with 0 paddings to match this parameter
mel_vmin: -6
mel_vmax: 1.5
sampler_frame_count_grid: 6
ds_workers: 4
dataloader_prefetch_factor: 2

#########
# model
#########
hidden_size: 256
dropout: 0.1
use_pos_embed: true
enc_layers: 4
num_heads: 2
enc_ffn_kernel_size: 9
ffn_act: gelu
ffn_padding: 'SAME'
use_spk_id: false

###########
# optimization
###########
optimizer_args:
  optimizer_cls: torch.optim.AdamW
  lr: 0.0004
  beta1: 0.9
  beta2: 0.98
  weight_decay: 0
lr_scheduler_args:
  scheduler_cls: torch.optim.lr_scheduler.StepLR
  step_size: 50000
  gamma: 0.5
clip_grad_norm: 1

###########
# train and eval
###########
num_ckpt_keep: 5
accumulate_grad_batches: 1
log_interval: 100
num_sanity_val_steps: 1  # steps of validation at the beginning
val_check_interval: 2000
max_updates: 120000
max_batch_frames: 32000
max_batch_size: 100000
max_val_batch_frames: 60000
max_val_batch_size: 1
train_set_name: 'train'
valid_set_name: 'valid'
pe: 'rmvpe'
pe_ckpt: checkpoints/rmvpe/model.pt
vocoder: ''
vocoder_ckpt: ''
num_valid_plots: 10

###########
# pytorch lightning
# Read https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api for possible values
###########
pl_trainer_accelerator: 'auto'
pl_trainer_devices: 'auto'
pl_trainer_precision: '16-mixed'
pl_trainer_num_nodes: 1
pl_trainer_strategy: 
  name: auto
  process_group_backend: nccl
  find_unused_parameters: false
nccl_p2p: true

###########
# finetune
###########

finetune_enabled: false
finetune_ckpt_path: null
finetune_ignored_params: []


finetune_strict_shapes: true

freezing_enabled: false
frozen_params: []
@yqzhishen
Copy link
Member

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

@spicytigermeat
Copy link
Author

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

I only mentioned the vocoder to cover all of the differences. No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

Okay, I'll readjust the configuration using your recommendations and see if I get better results! In what cases would you recommend using fine-tuning for?

@yqzhishen
Copy link
Member

yqzhishen commented Dec 6, 2023

No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

In what cases would you recommend using fine-tuning for?

Currently the only recommended usecase is about training the aux decoder and diffusion decoder separately, when enabling shallow diffusion. Fine-tuning is not that helpful in regular cases. If you fine-tune a model, it will not save much training steps if you want to totally wash out the timbres in the pre-trained model. If you train enough steps, it will cause catastrophic forgetting. If you discard some layers or embeddings before fine-tuning, it may perform even worse than starting from scratch. Meanwhile, fine-tuning requires careful adjustment in the training-related hyperparameters to get the best results. In a word, do not use fine-tuning unless guided by the documentation or unless you are expert and are clearly aware of what you are doing. And especially, for people who own enough high-quality and well labeled datasets, please train from scratch.

@spicytigermeat
Copy link
Author

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

Yes, they sound distinct in TensorBoard but almost identical in OpenUTAU. There are slight differences with waveforms but generally all of the unique timbre gets removed and they all sound like they've been trained together as opposed to separate speakers. I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters. Do you think that might have something to do with this issue?

Thank you for better explaining the use of finetuning! I'll be sure to stick to tuning from scratch going forward.

The only other thing I can think that might be causing this issue is the amount of data I'm using and the amount of speakers? I never had this issue when I was training smaller amounts of data (~2hrs, 2 different vocalists, 6 different "voice modes"/speakers in diffsinger) and now my dataset is ~6hrs, 6 different vocalists and up to 23 different speakers in diffsinger. That's about all my GPU can handle (I train locally).

I ran another test last night training only about 2hrs of data across 3 vocalists and 10 speakers in the config, and still got the same issue after 200epochs/12k steps of acoustic training. I can confirm the data is high quality and tagged well (all done by me by hand). Thanks so much for all of your help!

@yqzhishen
Copy link
Member

yqzhishen commented Dec 7, 2023

I mean, if the timbres are distinct from each other on TensorBoard, they are expected to be distinct from each other in OpenUTAU as well - because the conditions are the same. If you do believe the TensorBoard samples really sound as you expected, then there must be something wrong; otherwise, they should have been in trouble when they were on TensorBoard.

A possible way for debugging is to export the DS files from OpenUTAU, and use python scripts/infer.py acoustic your_project.ds --spk your_spk to verify if the model is really correctly trained.

I personally train with 5 or 6 vocalists and ~9 timbres in total for my every experiment. I never encountered any issue with the differences between timbres. Some people in the community trains larger datasets than me with multi-timbre singers in them, and they have no problem either.

I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters.

According to experiences from I and other people in our community in China, the variance parameters did not cause deterioration of the quality, but improves stability and controllability. But if you do not train them well, it can cause some problem, and their is some interesting findings in our recent research about the mutual influence between variance modules. These has been updated into the documentation, and a minor release will also be published to notify users about it.

@spicytigermeat
Copy link
Author

Thanks for the tip on debugging by inferring directly from the checkpoint, turns out it's either an OpenUTAU issue or a deployment issue, because direct inference from the checkpoint via command line actually gave me the proper output with separate timbre. I'll have to keep messing around with the OpenUTAU library file structure to figure out why it's getting embeds confused, which is my guess. Do they have to be in OPU configs in a certain order that you're aware of?

@yqzhishen
Copy link
Member

When exporting to ONNX you should use --export_spk spk1 --export _spk spk2 ... to export all your desired embeds, or if this option is unset, the exporter exports all the embeds. Then you should write them down in the OpenUTAU config as its wiki says, and yes they should be in an ordered list, but in any order you would like.

So you might mixed up your embeds somehow, or it's probably just a personal mistake in the usage of OpenUTAU. You should first check your embeds to see if they are really different, and your configs, and then OpenUTAU itself (for example, use a clean install or reset all the preferences in case there are some misconfiguration in the expression settings).

@spicytigermeat
Copy link
Author

spicytigermeat commented Dec 9, 2023

First of all, thank you so much for all of your dedication helping me solve this issue, I've learned a ton!

Second of all, I discovered that the issue IS OpenUTAU. Apparently, if the embed files are not in the same directory as the character.yaml file, you have to specify where they are. I thought it pulled from the list of speakers, so it was totally my misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants