Different sampling rates #5

EmreOzkose · 2021-06-28T13:20:34Z

Hi!

Did you observe trainings with different sampling rates such as 8K->16K, 8K-> 22K, 16K->22K, etc.. ?
(diferent from demo page)

and what changes should we do to train with these data? (maybe hop length, n_fft, noise_schedule, pos_emb_scale, etc..)

junjun3518 · 2021-06-28T23:48:43Z

Hi! Thank you for the great question!
We only tested for x2 and x3, but I think this work also available to apply on non-integer ratio upsampling.
You only need to change linear interpolation ratio and downsample method.
STFT is only needed for our in-phase downsampling, thus in this case, I recommend you to apply librosa.resample function instead of our downsampling functions in dataloader.

EmreOzkose · 2021-06-29T04:53:02Z

Thank you for quick reply :). I will try this experiment as soon as possible and report here.

junjun3518 · 2021-06-29T04:56:09Z

Thank you for trying the additional experiment! Plz let me know if you need some help!

EmreOzkose · 2021-07-09T14:23:15Z

I did some experiments with 16k samples. I used 4h 16k data, and the default model is being training for 9 days. Up to now, everything is okey. I am sharing tensorboard log.

It seems that the model is able to boost the quality of the sound well qualitatively. I also want to observe difference between normal upsampling and neural upsampling. I have a 16k test set. I down-sampled this set for testing. I also have an acoustic model (which is trained in 16k audio). The results are here:

data	word error rate
16k original sounds	4.8
re-sampled with sox	7.0
re-sampled with neural-upsampling	8.3

So, neural upsampling is worse than sox :(.
When I listen up-sampled sound, I find some extra noise. How can achieve removing these unwanted noises? Do you think that I should change somewhere in adding noise part?

junjun3518 · 2021-07-10T02:51:25Z

Interesting works!
I expected that you need to train more. Since denoising score matching process is trained by random Gaussian noise and complicated, we also got similar problems. We trained our model for two A100 or V100 GPUs over 2 weeks.

junjun3518 · 2021-07-10T02:55:21Z

In addition, if you have an STT model you could apply conditional score generation suggested by Yang Song (https://arxiv.org/pdf/2011.13456.pdf Section 5 and Appendix I).

EmreOzkose · 2021-07-10T08:10:00Z

Thank you for advices. I think the problem is less training data and computational power. Training is continuing :). I am also going deeper and adapting the model to my case. I will report if the problem is solved.

Yang Song's work is also interesting. I will check out if I can apply. Thank you :).

EmreOzkose · 2021-07-19T07:43:37Z

Hi :).

I did some experiments on same dataset with different noise level. In paper, a different noise level is mentioned (noise_schedule: "torch.linspace(1e-4, 0.005, hparams.ddpm.max_step)") and I used that levels. After 6.8 days of training, there is some extra noises again like previous experiment. I will also did same evaluation (with sox result) and report here.

Do you think if that noise level is too much for 8k->16k? Or it is okey for that setup?

junjun3518 · 2021-07-19T09:11:24Z

Hmm, I think you need to modify the inference schedule instead of the training schedule. Since 8 iteration's value is fit to our setup, maybe it could not be optimal for 8k->16k.

EmreOzkose · 2021-07-19T09:28:53Z

I think you are right, I didn't change inference part. I will check and report here. Thank you so much :).

EmreOzkose · 2021-07-27T05:53:14Z

I wanna share some more observations. There are spectrograms of a 16k test sound and upsampled sound (after downsampled to 8k).

Predicted area is so good. Here is upsampled version of the same sound with sox (after downsampled to 8k).

However, as we observed previously, some noise is added to 0khz-4khz part. Sound is qualified certainly, but my acoustic model performance is not better with neural-upsampled tests compared to sox-upsampled test. I think the primary reason is 0khz-4khz part.

In below figure, there are 3 waveform which are

downsampled example (same with above)
neural-upsampled version

and zoomed version

junjun3518 · 2021-07-27T07:12:42Z

(Now I can see it)
In my opinion, neural upsampling and accuracy of AST/STT could be irrelevant OR your dataset is small to train general model.
If problem is a amount of dataset, I recommend to apply open dataset such as downsampled VCTK or LibriSpeech with your dataset.
For us, we similarly suffered high noise existence over 10kHz, but no low noises.

junjun3518 · 2021-07-27T07:22:13Z

In addition, I am curious about your hyperparameters. Please let me know your batch or audio length or any difference between our hparameter.yaml file.

EmreOzkose · 2021-07-27T07:30:10Z

My hparameter.yaml :

train:
  batch_size: 6
  lr: 0.00003
  weight_decay: 0.00
  num_workers: 32
  gpus: 1 #ddp
  opt_eps: 1e-9
  beta1: 0.5
  beta2: 0.999

data:
  dir: "../neural_upsampling/data"
  format: '*.pt'
  cv_ratio: (1./2., 1./2., 0.00) #train/val/test

audio:
  sr: 16000
  nfft: 1024
  hop: 256
  ratio: 2 #upscale_ratio
  length: 32768 #32*1024 ~ 1sec

arch:
  residual_layers: 30 #
  residual_channels: 64
  dilation_cycle_length: 10
  pos_emb_dim: 512 

ddpm:
  max_step: 1000
  noise_schedule: "torch.linspace(1e-6, 0.006, hparams.ddpm.max_step)"
  pos_emb_scale: 50000
  pos_emb_channels: 128 
  infer_step: 8
  infer_schedule: "torch.tensor([1e-6,2e-6,1e-5,1e-4,1e-3,1e-2,1e-1,9e-1])"

log:
  name: 'nuwave_x2'
  checkpoint_dir: 'checkpoint'
  tensorboard_dir: 'tensorboard'
  test_result_dir: 'test_sample/results'

My GPU is Tesla P100-PCIE-16GB

EmreOzkose · 2021-07-27T07:39:43Z

I am increasing data now. I will start a training and report here as soon as possible.

anhnv125 · 2021-09-23T05:42:18Z

Hi!

I am training a nu-wave model to upsample from 8k to 16k. So far I have trained for over 64k iterations but the model didn't seem good (spectrogram and loss curve attached below)

Here is my config. I also changed downsampling/upsampling in dataloader to librosa.resample.

train:
  batch_size: 18
  lr: 0.00003
  weight_decay: 0.00
  num_workers: 8
  gpus: 2 #ddp
  opt_eps: 1e-9
  beta1: 0.5
  beta2: 0.999

data:
  dir: 'vctk/VCTK-Corpus/wav48' #dir/spk/format
  format: '*.pt'
  cv_ratio: (100./108., 8./108., 0.00) #train/val/test

audio:
  sr: 16000
  nfft: 1024
  hop: 256
  ratio: 2 #upscale_ratio
  length: 32768 #32*1024 ~ 1sec

arch:
  residual_layers: 30 #
  residual_channels: 64
  dilation_cycle_length: 10
  pos_emb_dim: 512 

ddpm:
  max_step: 1000
  noise_schedule: "torch.linspace(1e-6, 0.006, hparams.ddpm.max_step)"
  pos_emb_scale: 50000
  pos_emb_channels: 128 
  infer_step: 8
  infer_schedule: "torch.tensor([1e-6,2e-6,1e-5,1e-4,1e-3,1e-2,1e-1,9e-1])"

log:
  name: 'nuwave_x2'
  checkpoint_dir: 'checkpoint'
  tensorboard_dir: 'tensorboard'
  test_result_dir: 'test_sample/result'

Am I doing correctly? Or should I wait more for the training? Please give me some suggestion, I would appreciate it.
Thanks a lot.

junjun3518 · 2021-09-23T23:21:45Z

Hello Viet Anh!
As already mentioned, since the diffusion model is complicated, it needs a lot of time for training. Our results for targeting 48k was trained over 240k epochs during 2 weeks by 2 V100 or A100 GPUs.
For now, I think you need to wait for more training.
If it is well trained, for almost of time we could observe a clear spectrogram at y_recon.

Thank you for considering our model as a reference and I will be waiting for your upcoming paper!

anhnv125 · 2021-09-24T01:48:47Z

Thank you for your response.
That means I have configured the model correctly, that's great. I was thinking it needed more training time too. But the loss curve I have attached seems to converged while the spectrogram does not. So would further make a big difference in this case?
Also you have mentioned 240k epochs, I think it should be 240k iterations as the curve on README indicated, isn't it?

junjun3518 · 2021-09-24T04:41:16Z

Oh sorry for the misinformation. Yeah I mean 240k iterations instead of epoch.

anhnv125 · 2021-09-24T06:52:20Z

This is the result I got for over 260k iterations. There's still white noise and high frequencies were not constructed properly. What should I do now?

junjun3518 · 2021-09-24T06:56:54Z

Interesting! It is still noisy after 8 iterations? We only trained and tested on 48k target so not very sure with 8k->16k setups! Reporting your results will helpful for our works too!

junjun3518 · 2021-09-24T06:57:20Z

Plz run for_test.py or lightning_test.py for numerical results

anhnv125 · 2021-09-24T07:37:21Z

Here is the results

DATALOADER:0 TEST RESULTS
{'base_lsd': 6.641576766967773,
 'base_lsd^2': 47.209537506103516,
 'base_snr': 10.364435195922852,
 'base_snr^2': 114.63585662841797,
 'lsd': 5.008662700653076,
 'lsd^2': 27.40099334716797,
 'snr': 7.459704875946045,
 'snr^2': 57.45094299316406}

Not really good right? I only changed audio.sr to 16000 and audio.ratio to 2.

junjun3518 · 2021-09-28T23:31:40Z

I think it is similar problem that our 48k model is also not good at generating harmonics. While our 48k model generates over 12k frequency elements which is not contained harmonics much, but 8k->16k is almost harmonic generation.
Thank you for reporting your results.
We will adjust our model to robust for low frequencies.

anhnv125 · 2021-09-29T03:37:52Z

Thank you for your explanation, I have noted that. Looking forward to your adjustments.

junjun3518 · 2021-09-29T04:25:37Z

I found that recent work from ByteDance(https://arxiv.org/pdf/2109.13731.pdf) cited our work and their results are not good too. I recommend you to read this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different sampling rates #5

Different sampling rates #5

EmreOzkose commented Jun 28, 2021

junjun3518 commented Jun 28, 2021

EmreOzkose commented Jun 29, 2021

junjun3518 commented Jun 29, 2021

EmreOzkose commented Jul 9, 2021 •

edited

Loading

junjun3518 commented Jul 10, 2021

junjun3518 commented Jul 10, 2021 •

edited

Loading

EmreOzkose commented Jul 10, 2021

EmreOzkose commented Jul 19, 2021

junjun3518 commented Jul 19, 2021

EmreOzkose commented Jul 19, 2021

EmreOzkose commented Jul 27, 2021 •

edited

Loading

junjun3518 commented Jul 27, 2021 •

edited

Loading

junjun3518 commented Jul 27, 2021

EmreOzkose commented Jul 27, 2021 •

edited

Loading

EmreOzkose commented Jul 27, 2021

anhnv125 commented Sep 23, 2021 •

edited

Loading

junjun3518 commented Sep 23, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 28, 2021

anhnv125 commented Sep 29, 2021

junjun3518 commented Sep 29, 2021

Different sampling rates #5

Different sampling rates #5

Comments

EmreOzkose commented Jun 28, 2021

junjun3518 commented Jun 28, 2021

EmreOzkose commented Jun 29, 2021

junjun3518 commented Jun 29, 2021

EmreOzkose commented Jul 9, 2021 • edited Loading

junjun3518 commented Jul 10, 2021

junjun3518 commented Jul 10, 2021 • edited Loading

EmreOzkose commented Jul 10, 2021

EmreOzkose commented Jul 19, 2021

junjun3518 commented Jul 19, 2021

EmreOzkose commented Jul 19, 2021

EmreOzkose commented Jul 27, 2021 • edited Loading

junjun3518 commented Jul 27, 2021 • edited Loading

junjun3518 commented Jul 27, 2021

EmreOzkose commented Jul 27, 2021 • edited Loading

EmreOzkose commented Jul 27, 2021

anhnv125 commented Sep 23, 2021 • edited Loading

junjun3518 commented Sep 23, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

junjun3518 commented Sep 24, 2021

anhnv125 commented Sep 24, 2021

junjun3518 commented Sep 28, 2021

anhnv125 commented Sep 29, 2021

junjun3518 commented Sep 29, 2021

EmreOzkose commented Jul 9, 2021 •

edited

Loading

junjun3518 commented Jul 10, 2021 •

edited

Loading

EmreOzkose commented Jul 27, 2021 •

edited

Loading

junjun3518 commented Jul 27, 2021 •

edited

Loading

EmreOzkose commented Jul 27, 2021 •

edited

Loading

anhnv125 commented Sep 23, 2021 •

edited

Loading