Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very bad reconstruction if I change frame_rate and example_secs #127

Closed
james20141606 opened this issue Jun 9, 2020 · 23 comments
Closed

Comments

@james20141606
Copy link

Hi, I used to apply DDSP autoencoder to do audio to audio reconstruction and it works really well. But when I change the frame_rate from 250 to 50/100/200, and example_secs to 2/4, I found that the reconstruction is making no sense at all. It's very strange to me. I think DDSP should be robust to the frame_rate and example_secs? I could see the loss decreases, but the spectrogram and audio reconstruction are not good. Previously if I use 250 frame rate and 4 seconds, I could get correlation coefficient over 0.8, now the correlation between ground truth and reconstructed spectrograms are 0.
I attached the reconstructed result https://drive.google.com/file/d/1E3IMQHnQQpYT_uuRQGgdEjvbptXmCEBF/view?usp=sharing
and it would be great you could give me some hint why the frame rate and example_secs is so important? Or maybe there are some parameters I omit are supposed to be changed?

--gin_param="TFRecordProvider.frame_rate = 200" \
--gin_param='TFRecordProvider.example_secs = 2' \
--gin_param='DefaultPreprocessor.time_steps = 400' \
--gin_param='Additive.n_samples = 32000' \
--gin_param='FilteredNoise.n_samples = 32000' \
@james20141606
Copy link
Author

the reason why I would like to change the frame rate and example_secs is because I would like to "give more challenges to the decoder". I felt that the DDSP decoder is doing a really good job, and when I use this pretrained decoder to perform some other signal to audio task, I use a framework to first encode the signal to latent f0, loudness and z, and I use the pretrained decoder to generate audio. I use losses to regularize the latent space generated from the signal to be similar to latent space generated from audio using pretrained DDSP encoder. But I found that it is really hard to encode my signal to latent space. So I am considering if the decoder could accept latent space with less time steps, the encoding task would be easier. But I just found out that audio to audio reconstruction failed when I change the frame rate and example_secs.

@jesseengel
Copy link
Contributor

It's hard to tell from the included audio (because it's playing back at 44kHz in the linked webpage) but it looks like maybe your f0_hz features are not accurate which is causing the model to not learn (the audio sounds a lot like initialization values iirc).

Are you regenerating a new dataset for the different frame rate etc.? Is the audio at 16kHz? If you listen to a simple sin wave with that f0, does it track the audio well?

@james20141606
Copy link
Author

Hey, I just found that the problem might be about model restore is failing. I will solve this problem first and retry the reconstruction It is actually very tricky. I have found that sometimes model.restore is not working, (I print out some decoder layer's weights and found them are still initial values like 0 or 1). I found that even use exact same settings to train two times, maybe one time the model could be restored, but at another time the model failed to restore, which is really strange. (I also found that if I want to reuse only decoder part, it is also very tricky to restore the weights, currently I only have one solution which works, I have to manually load the weights layer by layer, all the other ways failed to load the decoder's weights

model.get_layer(name='z_rnn_fc_decoder').set_weights(\
                            [tf.train.load_variable(latest_checkpoint, i) for i in decoder_used_keys]

)

I also notice that (under fr=250 and secs=4) if I use resnet f0 encoder for the f0 and use z, the f0_hz will have a range around 12, and the range is really small (around 0.1), and if I use crepe model for f0 and without z, the f0 range will be around 300. Both cases will give me good reconstruction, but I am not sure why there are such big difference.

By the way, what do you mean by

If you listen to a simple sin wave with that f0, does it track the audio well?

@james20141606
Copy link
Author

when I start the training, there are some warnings like:
screenshot 2020-06-10 02 12 59

I am not sure this has anything to do with the later weight restore error

@jesseengel
Copy link
Contributor

jesseengel commented Jun 19, 2020

Hi, sorry for the delay,

I also found that if I want to reuse only decoder part, it is also very tricky to restore

We have some functionality for that in Trainer. You should be able to just do --gin_param="Trainer.restore_keys=['decoder']" I believe.

I also notice that (under fr=250 and secs=4) if I use resnet f0 encoder for the f0 and use z, the f0_hz will have a range around 12, and the range is really small (around 0.1), and if I use crepe model for f0 and without z, the f0 range will be around 300. Both cases will give me good reconstruction, but I am not sure why there are such big difference.

This is a problem of the original model with just a spectrogram reconstruction loss, it fails to learn the correct f0, but can still reconstruct the audio. We call this "f0 collapse" where it chooses a low frequncy that gives closely spaced harmonics, and then it only uses a few of them in the harmonic distribution (very sparse).

We just submitted a paper that shows a way around this using self-supervision (https://openreview.net/forum?id=RlVTYWhsky7) but that code isn't in open source quite yet. It's an open research problem, but we're making progress on it.

I am not sure this has anything to do with the later weight restore error

That looks like your forward pass is not actually using the parameters, or you are calling your forward pass in a different gradient scope. The trainer usually handles the scoping properly, but that message I think can arise from not not calling restore before build.

If you listen to a simple sin wave with that f0, does it track the audio well?

I was just encouraging you to check if your f0 labels / predictions were correct. It sounds from your other info that they weren't for the reasons listed above.

@james20141606
Copy link
Author

thanks so much for your detailed reply! I read your new paper and it is really impressive. I really enjoy your work on disentangling pitch and timbre information. You said that previously the network suffers from f0 collapse problem and it's more like the model learns the STFT of audio, My question is: if we just consider the reconstruction of the audio, and ignore the interpretation. Which latent could be learned more easily? For example in a TTS task, the DDSP-inv model will require us to learn f0, and sinusoid as latent features, but current DDSP model will only requires us to learn f0. If the reconstructed audio have very little difference, it seems only learning f0 is simpler even it is not interpretable.
(although I found that the current collapsed f0 is also hard to learn when I feed in some other signals as input and force the model to generate corresponding f0)

@james20141606
Copy link
Author

After reading your new paper again I think I still have some more questions:

  • since you could encode sinusoid frequencies and get good reconstruction, why bother generating harmonic frequencies? Because you still want f0 to make it more interpretable?
  • how can you make sure that harmonic encoder/previous DDSP's decoder could give you harmonic frequencies? Since they are just some fully connected/RNN layers, I am curious if the "harmonic frequencies" are real f1, f2, f3...
  • Have you tried to decode less harmonic dimensions? for example, only decode f1-f4? I believe that in some vocoder models these frequencies are enough to construct intelligible voices (if we ignore timbre). I wonder what would happen if I just set harmonic dimensions in DDSP to 5 or less. Will the model fail to reconstruct the audio?

@jesseengel
Copy link
Contributor

Hi,

since you could encode sinusoid frequencies and get good reconstruction, why bother generating harmonic frequencies? Because you still want f0 to make it more interpretable?

This is exactly right. An easy way to get good reconstructions is just to copy the input audio to the output :). Our goal instead is to learn useful representations that are easy for people and models to control and produce nice sounding outputs. Thus we're working to build a hierarchical representation from sinusoids, to pitches, to midi notes, and back down.

how can you make sure that harmonic encoder/previous DDSP's decoder could give you harmonic frequencies?

We say in the papers that we force them to be harmonic for the moment (f_k = k * f_0) for k \in {1, 2, 3, ..., n_harmonics}

Have you tried to decode less harmonic dimensions? for example, only decode f1-f4?

We haven't tried this, we just let the model use as many as it neeeds.

@james20141606
Copy link
Author

  1. I really appreciate your work to try to generate meaningful and interpretable latent space. It's really an exciting breakthrough. Since I really care about some downstream application of DDSP-inv, like generating audio from other signals. If I use previous DDSP, I could try to map the signals to f0 and loudness, that's enough. But for DDSP-inv model, it seems that I should map the input signal to the sinusoids? If so, at each timestep I should map to many more dimensions (compared to only mapping to one dimension f0 at each timestep). It seems like a harder task.
  2. And I actually still have a question about the harmonic distribution. You could assume that the $f_k = k * f_0$, but from the codes I could not see very clearly which parts control this harmonic properties. It seems that in your RnnFcDecoder class, the dense_out layer will generate harmonic distribution? (but if so you how can you control it to be harmonic)? Do you postprocess the output from decoder's dense layer to make the output harmonic? or you just generate harmonic using functions like get_harmonic_frequencies?

@jesseengel
Copy link
Contributor

  1. The working hypothesis is breaking the problem into smaller chunks, will make it easier to solve as we go forward to harder problems (say, estimating multiple fundamental frequencies at once).

  2. In this case, yes, the harmonic frequencies are generated with that function, but the amplitudes are determined by the decoder outputs.

@james20141606
Copy link
Author

1 I agree this working hypothesis is the right way to go. I am just a little concerned that when I use the current DDSP model, I found it hard to map other signals to latent f0. Since you said your new work is about estimating the correct f0, I am not sure if the mapping problem is from f0 estimation error or just because the latent is not so easy to map. I think it might be the f0 estimation related since if estimate f0 using Resnet, f0 will be small and varies in a small range, if I use CREPE, I also found the estimation is not very good. I really look forward to the DDSP-inv codes. (hope it won't introduce other problems since there are more latent variables to map to in DDSP-inv)
2 You mean in RnnFcDecoder, although you specify output_splits=(('amps', 1), ('harmonic_distribution', 40), the decoder does not actually generate the harmonic distribution? I found it a little confusing since the decoder framework plot in the DDSP paper, and this RnnFcDecoder class, and when I call model.decoder, they have the harmonic_distribution key, which makes me wonder when the harmonic distribution is generated.

@jesseengel
Copy link
Contributor

To be clear the harmonic distribution is the distribution of amplitudes of the harmonics. If you look at the code/paper you can see that the frequencies (for the moment) are just integer multiples of the fundamental.

@Forevian
Copy link

It is not hard to create your own alternative additive synth using the core.oscillator_bank where you can use arbitrary frequencies and thus make an attempt to teach the network tune the individual components etc... I am experimenting with using a bunch of sines to add the "transient oomph" to percussive sounds where the original is not really harmonic but there is still considerable phase coherence to the spectral components, so noise synthesis is not enough.

@james20141606
Copy link
Author

It is not hard to create your own alternative additive synth using the core.oscillator_bank where you can use arbitrary frequencies and thus make an attempt to teach the network tune the individual components etc... I am experimenting with using a bunch of sines to add the "transient oomph" to percussive sounds where the original is not really harmonic but there is still considerable phase coherence to the spectral components, so noise synthesis is not enough.

thank you for your reply! sounds very interesting! so you still use f0, loudness (and z) as latent feature? do you still use harmonic assumption? or you try to generate a bunch of sines totally from encoders as latent? I felt that it makes the model harder to optimize. and according to their latest research on DDSP-inv, it seems we should adapt their way to generate meaningful sine waves

@Forevian
Copy link

I am not using f0 and harmonicity because it is not really relevant for my data, I try to get the sines from the encoders. But maybe this is the wrong approach, I managed to resynthesize some complex samples (ie. 48 kHz snare hits), but failed to generalize it good enough. I am also looking forward to see DDSP-inv in practice!

@james20141606
Copy link
Author

I am not using f0 and harmonicity because it is not really relevant for my data, I try to get the sines from the encoders. But maybe this is the wrong approach, I managed to resynthesize some complex samples (ie. 48 kHz snare hits), but failed to generalize it good enough. I am also looking forward to see DDSP-inv in practice!

seems like DDSP-inv use present on spectrogram to generate sins, and the consistency loss and self-supervision is essential to generate good sins.

I am still confused about the consistency loss: why DDSP-inv tries to match f_sin and g_harm? since f_sin are free sinusoidal and g_harm are strict harmonic frequencies, will this loss restrict the freedom of f_sin? @jesseengel
I am also curious about how could DDSP-inv estimate the correct f0? is it because of the self-supervision training? I could not see why the hierarchical model could improve the estimation of f0. If in the end, we want the generated sinusoidal to be similar to harmonic frequencies, could we just use self-supervision to train the previous DDSP to generate better f0 and just use harmonic synthesis to generate audio?

@jesseengel
Copy link
Contributor

Yup, the self-supervised training is key to avoiding octave errors and "f0 collapse" as we described it. It's a way of enforcing a prior on the representation, but you could enforce priors other ways (with losses etc.), this just worked the best for us so far. You could estimate the harmonic components directly as well, but in general the hierarchy is key to our approach to try to gradually make the problem more focused at each level, that I think will be helpful for harder problems such as polyphony etc.

Also what @Forevian says is all true in terms of the oscillator bank. We'll hopefully open source all that code soon so people can play with it.

@james20141606
Copy link
Author

thanks for the explanation! one more question about ddsp-inv: in your new model, have you tried only f0 and loudness as latent, without z? or you just experimented latent including z? I found that sometimes without z model will produce voice which is very hoarse, does it mean f0 and loudness is not enough?

@jesseengel
Copy link
Contributor

The new model decomposes audio into f0, harmonic distribution, and amplitude directly, so the loudness and z conditioning doesn't apply.

@james20141606
Copy link
Author

I am still concerned that if f0, harmonic distribution, and amplitude (or previous f0, loudness) is enough to construct different voices. in original ddsp paper you said z contains timbre information, now where is this information? For human voice, I am curious which controls the accent information, do you think f0, harmonic distribution, and amplitude are enough?

@jesseengel
Copy link
Contributor

jesseengel commented Jul 16, 2020

I'm sorry, I think I see the misunderstanding. conditioning is the input to a neural network used to estimate synthesizer parameters. In the original paper loudness was conditioning, while f0 was both conditioning and a synthesizer parameter. The loudness and f0 were used to estimate the other synthesizer parameters (harmonic distribution, amplitude, noise magnitudes).

In the current ICML paper, the synthesizer parameters (f0, harmonic distribution, amplitude, noise magnitudes) are estimated directly from the input audio (using self-supervision).

In both cases, the same synthesis algorithm is used (harmonic + filtered noise). It does a pretty good job for many sound sources, but definitely would require some further components to make very realistic speech. A speakers accent is a produced from a combination of the pitch of their voice (f0) the shaping of their words (harmonic distribution) and the loudness of their speaking (amplitude) as well as the consonants and plosives (noise magnitudes), so no one aspect of the synthesis would account for it.

@james20141606
Copy link
Author

Thanks for the explanation! It's clearer now. it's interesting that noise magnitudes could model the consonants and plosives. Are there any theories supporting it? If that's true I think the combination of these components (f0, harmonic distribution, amplitude, noise magnitudes) is surely enough to model the human voice.

@jesseengel
Copy link
Contributor

Cool, glad it was helpful. Yes, this is just one approach to modeling audio but it's general enough that it was incorporated into the MPEG4 Codec. If it's okay I'm going to close out this issue now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants