Tricks to prepare the training dataset #5

adagio715 · 2022-09-17T03:55:04Z

Hello,
I'm very interested in your great work! I have 3 questions, would you mind helping me with them?

I was wondering whether your network could be extended to use on other kinds of audio data, such as music. To this end, I tested it on different instrument datasets. In test 1, I had 33 instruments (like the "speaker" in your case), each containing only about 3 minutes of audio data. In test 2, I had 11 instruments, each containing about 1 hour of audio data. So overall, audio data in test1 is shorter than audio data in test2. However, when I ran the experiments on the same machine, test2 ran ~2-3 times faster than test1. Does the training speed have something to do with the number of "speakers" more than with the whole duration of the training data?
How would you recommend the duration for each .wav file. In your dataset, each piece of training data is rather short (~2-3 seconds). Would your network also work for long data such as 1-2 minutes?
For inference, you set the infer_step as 8 with a specific infer_schedule. Is 8 the best parameter in your experiments? If we want to test different infer_step, how should we set the infer_schedule?

Thank you very much for your help in advance!

The text was updated successfully, but these errors were encountered:

eloimoliner · 2022-09-20T17:09:14Z

Hi,
I'm not an author of this paper, but also working on bandwidth extension of music, so I'm very interested if you manage to apply this method to music signals. I've also tried the same (for piano only), but quite unsuccesfully.

I think I can try to guess some answers to your quieries:

I guess that depends on how the dataloading is designed, perhaps there is something suboptimal there on how the audio files are loaded. It is strange though, are you sure you are training on the same conditions? Just for curiosity, for how long are you training?

Anyway, it seems a bit too ambitious to me to train this model in a multi-instrument dataset. Also considering that this model is quite small (around 1M params), it would be hard for it to generete such diverse music. Did you get any interesting result?

It would be possible to extend the length of the segments, but 1-2 minutes is too much, you will run out of memory. What you should I guess is process small chunks and then concatenate them somehow. with overlap and add for example
8 seems to be a very small number, probably this fast inference schedule was finetuned for the speech dataset they trained on. If you train on a different dataset , this schedule would likely not work well, you will probably have to increase the number of steps or define another schedule by yourself.

psp0001060 · 2022-10-03T22:28:17Z

@eloimoliner
Hi,
I'm not an author of this paper , but also working on bandwidth extension of music, so I'm very interested if you manage to apply this method to music signals.

" I've also tried the same (for piano only), but quite unsuccesfully."
as your reply mentioned. Is the generated music of poor quality?

2."It would be possible to extend the length of the segments, but 1-2 minutes is too much, you will run out of memory. What you should I guess is process small chunks and then concatenate them somehow. with overlap and add for example" .
I think so too, but I haven't done the experiment yet. Have you done this experiment.

Seungwoo0326 · 2022-10-04T06:21:11Z

@adagio715
Hi, I'm sorry for the late response.

We haven't tried on other kinds of audio, such as music. But we expect that our model is not only working on the speech. You may try to train with your dataset and test it. And for the answer to your question, we train with randomly selected chunks of each audio sample in the dataset with fixed length. So the training speed would not depend on the total length of your dataset, it would depend on the number of audio samples in your dataset. Additionally, we segment the training dataset and validation dataset in a fixed ratio based on the speaker, so if you want to handle it you may correct the cv_ratio in the config file. In conclusion, the training speed depends on the number of audio samples of the speaker used for training which depends on the cv_ratio in the config file.
As I answered above, we train with randomly selected chunks of each audio sample in the dataset, so the duration for each wav file is not that important. But we load the whole wav file first and then segment it, so it can be inefficient during dataloading.
I set the infer_step as 8 just for comparison with nu-wave in the same condition. To change the inference step, you can simply give the args.step as the input during inference. But it would only work well for large steps, above about 30, because it runs with a uniform noise schedule based on the log-snr which is not engineered. If you want to test with a different noise schedule you can change the code below. And as @eloimoliner answered (thank you for answering!), you may need to define another noise schedule for other datasets.
https://github.com/mindslab-ai/nuwave2/blob/c05a0b6135ce1c46b9a5e57f132f18d3e3091745/inference.py#L66-L70

adagio715 · 2022-10-04T07:05:39Z

@Seungwoo0326
Thank you very much for your explanation! Regarding your answer to our last question, if we try another inference step, how to define the infer_schedule accordingly? More specifically, for 8 steps, hparameters.yaml set the infer_schedule as infer_schedule: "torch.tensor([-2.6, -0.8, 2.0, 6.4, 9.8, 12.9, 14.4, 17.2])". We were wondering how these 8 specific numbers were determined?
By the way, we are not sure if we did something wrong, but we found your provided official checkpoint a bit strange. We used your source codes and tried the official checkpoint on the original low-resolution audio samples that are shown on your demo website. However, the results look much more like the NUWave model than the NUWave2 model. More specifically, the super-resolution frequency bands are very weak in energy. Could you be so kind to double-check if the provided checkpoint is correct? Or if you have heard similar issues, could you indicate what we might do wrong?

@eloimoliner @psp0001060
Thank you, both, very much for sharing your thoughts! So far, we have tried the model on the URMP dataset but the result is not very nice yet. The super-resolution part has too much noise. We think there might be two reasons for now: 1. the inference step and schedule are improper. 2 The training dataset is not enough.
If you manage to make it work well on your music dataset, it would be great to hear about your experience!

eloimoliner · 2022-10-04T11:33:06Z

Hi,
I tried training it on MAESTRO, which consists of piano music, and the result was not much what I expected. Here you can see some spectrograms:

and here is the wandb report, where you can also listen to the audio examples.

https://wandb.ai/eloimoliner/diffusion_june_2022/reports/Training-NUWave2-on-MAESTRO--VmlldzoyNzM4MTE2?accessToken=bicl6e6c9c01oqm5hljlqkxsq8imfdmdfvpf8qvzmnm9q7h8efjwga1zm00itvqi

NUWave2 seemed to learn well where to add energy, but it seemed to fail at generating harmonics. I must say this is not a completely fair comparison, as I was not using the same diffusion parameterization as in the paper, my diffusion model is based on this: https://arxiv.org/abs/2206.00364, which I know that works quite well for music generation (I will publish the results in a couple of weeks). My experiment was only about the NUWave2 architecture (and the conditioning) . I want to believe that I did something wrong, but I cannot find any mistake. Anyway, it is a pity I can't include this model as a baseline.

Also, I was working at 22050 kHz sampling rate.
@Seungwoo0326,
By any chance, did you find similar problems and have some intuition about why this may happen?

Seungwoo0326 · 2022-10-05T04:51:51Z

@adagio715
To try another noise schedule, just erase the code above (66~70) and define args.steps, noise_schedule you want.
For example,
args.steps=10
noise_schedule=torch.tensor([[-2.6, -0.8, 0.0, 2.0, 4.0, 6.4, 9.8, 12.9, 14.4, 17.2]])
And the noise schedule we used was handcrafted. If you want to try another schedule you need to go through trial and error. Or as I said above, just use a big number of iterations and use a uniformly set noise schedule. We didn't study deeply to reduce the number of iterations. You may read some papers about reducing the iteration of the diffusion model and adjusts it.

I don't think our checkpoint is wrong. Can I know the command you used for inference? I think you may be able to make some mistakes from the inference command.

Seungwoo0326 · 2022-10-05T05:01:19Z

@eloimoliner
Thank you for sharing the result!
But I think you need to train more. We train our model for about 1M iterations but I can see that you only train for about 200k.
And I think we also observed a similar look of the spectrogram which you shared, in the middle of our training process.
The diffusion process is different from our's so it can be different. Despite that, I think you need to train more.

I will be glad if you continue to share the result and continue discussing them together!

adagio715 · 2022-10-05T05:48:24Z

@Seungwoo0326
Thank you for the guidance. We will try different infer steps and noise schedules.
Regarding our inference experiment using the provided checkpoint: We used the exact inference.py code in the repo with the following arguments:

--checkpoint /PATH/TO/nuwave2_02_16_13_epoch=629.ckpt
--wav /PATH/TO/THE/AUDIO/SAMPLES/sample_1.wav   ## As said before, we used the low-resolution samples as provided on your demo webpage
--sr 48000

The other arguments were left as default as in the inference.py code. Besides, we used the exact hparameter.yaml code in the repo without changing any provided parameters.
Any ideas?

Seungwoo0326 · 2022-10-05T06:23:11Z

@adagio715
I think my explanation in README may be confusing.
The 'sr' parser is the sampling rate of the "downsampled" input. So if you want to upsample from a 16kHz audio sample the 'sr' parser would be 16000.
I understand why you used the 'sr' parser as 48000, but my intention was to give the bandwidth of the input audio as a condition.
And if the sampling rate of your input is 48kHz, you also need to give '--gt' parser because with '--gt' parser we low pass filter the input before going into the model.

--checkpoint /PATH/TO/nuwave2_02_16_13_epoch=629.ckpt
--wav /PATH/TO/THE/AUDIO/SAMPLES/sample_1.wav
--sr 16000  # just for example
--gt  # you need this parser if your input is 48kHz one. If the sampling rate of your input is 16kHz same as '--sr' parser, you don't need this parser.

I think it can be still confusing but hope this explanation helps you understand. I will correct the README soon.
Please try it!

The case of 16kHz -> 48kHz
'input bandwidth: 16kHz, sampling rates: 16kHz -> --sr 16000'
'input bandwidth: 16kHz, sampling rates: 48kHz -> --sr 16000, --gt'
'input bandwidth: 48kHz, sampling rates: 48kHz -> --sr 16000, --gt'

adagio715 · 2022-10-05T06:51:55Z

@Seungwoo0326
Thanks for the explanation and the examples!
Just to clarify if you dont mind :) So, according to your examples on the demo webpage, for super-resolution audio samples at 48kHz, their "valid" frequency component is up until 24kHz (the Nyquist–Shannon sampling theorem). Now, if my input audio sample has a sampling rate of 48kHz, but its "valid" frequency component is only up until 8kHz, the parameters should be --sr 16000, --gt or --sr 8000, --gt?

Seungwoo0326 · 2022-10-05T12:16:41Z

@adagio715
For the case you said, it should be --sr 16000, --gt.
I just imagined the situation with the downsampled audio at a sampling rate below 48 kHz as the input and then added '--gt' parser for optional.
But I think it makes you very confused, I will correct it as soon as possible. Thank you for asking me.

ARTUROSING · 2023-05-08T18:17:22Z

Hi, the other day I tried experimenting with models and training

The generated music may have poor quality if the segment length is too short, as it may not capture the full musical structure. However, it can still be useful for research purposes.
Yes, I have experimented with processing small chunks and concatenating them with overlap and addition. It is a viable method for extending the length of the segments.

ARTUROSING · 2023-05-08T18:19:45Z

Thank you very much for your explanation! Regarding your answer to our last question, if we try another inference step, how to define the infer_schedule accordingly? More specifically, for 8 steps, hparameters.yaml set the infer_schedule as infer_schedule: "torch.tensor([-2.6, -0.8, 2.0, 6.4, 9.8, 12.9, 14.4, 17.2])". We were wondering how these 8 specific numbers were determined? By the way, we are not sure if we did something wrong, but we found your provided official checkpoint a bit strange. We used your source codes and tried the official checkpoint on the original low-resolution audio samples that are shown on your demo website. However, the results look much more like the NUWave model than the NUWave2 model. More specifically, the super-resolution frequency bands are very weak in energy. Could you be so kind to double-check if the provided checkpoint is correct? Or if you have heard similar issues, could you indicate what we might do wrong?

@eloimoliner @psp0001060 Thank you, both, very much for sharing your thoughts! So far, we have tried the model on the URMP dataset but the result is not very nice yet. The super-resolution part has too much noise. We think there might be two reasons for now: 1. the inference step and schedule are improper. 2 The training dataset is not enough. If you manage to make it work well on your music dataset, it would be great to hear about your experience!

The inference schedule in hparameters.yaml is set based on the authors' experience and the characteristics of the model. The specific numbers in the example are just a suggestion and can be adjusted based on the specific use case.
It's possible that the provided checkpoint is not the best one to use for the URMP dataset. It's always a good idea to try different checkpoints and see which one works best for your specific task. The quality of the results can also depend on the training dataset, so it's worth trying to improve the quality of the training dataset.
The authors of the paper suggest using the official demo audio samples, which are low-resolution audio samples. The super-resolution results might look different when using high-resolution audio samples. It's worth trying to use the official demo audio samples to see if the results are different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tricks to prepare the training dataset #5

Tricks to prepare the training dataset #5

adagio715 commented Sep 17, 2022

eloimoliner commented Sep 20, 2022

psp0001060 commented Oct 3, 2022

Seungwoo0326 commented Oct 4, 2022

adagio715 commented Oct 4, 2022

eloimoliner commented Oct 4, 2022 •

edited

Seungwoo0326 commented Oct 5, 2022

Seungwoo0326 commented Oct 5, 2022 •

edited

adagio715 commented Oct 5, 2022 •

edited

Seungwoo0326 commented Oct 5, 2022 •

edited

adagio715 commented Oct 5, 2022

Seungwoo0326 commented Oct 5, 2022

ARTUROSING commented May 8, 2023

ARTUROSING commented May 8, 2023

Tricks to prepare the training dataset #5

Tricks to prepare the training dataset #5

Comments

adagio715 commented Sep 17, 2022

eloimoliner commented Sep 20, 2022

psp0001060 commented Oct 3, 2022

Seungwoo0326 commented Oct 4, 2022

adagio715 commented Oct 4, 2022

eloimoliner commented Oct 4, 2022 • edited

Seungwoo0326 commented Oct 5, 2022

Seungwoo0326 commented Oct 5, 2022 • edited

adagio715 commented Oct 5, 2022 • edited

Seungwoo0326 commented Oct 5, 2022 • edited

adagio715 commented Oct 5, 2022

Seungwoo0326 commented Oct 5, 2022

ARTUROSING commented May 8, 2023

ARTUROSING commented May 8, 2023

eloimoliner commented Oct 4, 2022 •

edited

Seungwoo0326 commented Oct 5, 2022 •

edited

adagio715 commented Oct 5, 2022 •

edited

Seungwoo0326 commented Oct 5, 2022 •

edited