Feeding raw audio waveform into the first layer #83

ibab · 2016-09-24T19:51:35Z

We've discussed the fact that one-hot encoding the input to the network is kind of weird, and that it would be more natural to use the waveform as a single-channel floating point tensor instead.
Does anyone have experience with running our implementation in this way?
Should we switch to this method?

tomlepaine · 2016-09-24T20:26:16Z

@ibab the demo in fast-wavenet trains on a single sample using a scalar encoding. Maybe we could test the two methods using that simple setup?

Train on a single sample using:

one-hot encoding
scalar encoding

And see if there is any notable difference the number of steps required to reach a given accuracy.

jyegerlehner · 2016-09-25T02:41:58Z

I think we should ultimately switch. I think @ghenter 's comments over on the "generating good audio samples" thread were lucid and compelling on the topic. I was planning on experimenting more with this. You and I having run my branch and your pull request; we're the only ones I'm aware of. I don't think either of us saw any advantage as it was implemented.

I have a suspicion that may be because we only used the filter width=2 for the initial causal convolution. 32 channels or whatever residual_channels gets set to is overkill for the number of second order FIR filters you need to form a basis for an arbitrary audio waveform. For comparison, 32 would be a typical number of channels for the initial 5x5 or 3x3 convolution stage in your garden-variety image convnet (or convolutional image autoencoder). And we find that those filters learn all manner of little oriented edge detectors across 3 color channels. By contrast with audio, we have only 1 conv dimension (time, instead of 2 spatial) and 1 "color" (AKA amplitude). So I think to justify all those channels, we need to make the initial convolution filter width wider than 2. Perhaps 3, 4 or even 5. So add an "initial_conv_channels" to the wavenet_params.json. Then that initial filter could learn interesting little wavelets to form a basis for our audio. Perhaps we should also add a separately configurable number of channels for the initial convolution, so that the number of initial filters we have doesn't have to be exactly the same as the number of residual_channels.

To those who say "No! We have too many hyperparameters!" To them I say:

As simple as possible, but no simpler.
Hyperparameters are what make nets fun for humans. The weights and biases are the knobs that the machine gets to play with. Hyperparameters are the knobs you and I get to play with. If you don't think they're fun then maybe you're in the wrong business. Alternatively, you could just let us find reasonable defaults for them, and never touch them if you don't want to.
We're nowhere near an unmanageable number of hyperparameters. We only have a handful. Two more isn't going to kill us. It's going to be fun!

If it's something you'd be willing to merge into master, then we could always make scalar-input a configurable option, and default it to a one-hot input.

ibab · 2016-09-25T10:07:51Z

@jyegerlehner: I'm thinking of switching to the scalar input completely, possibly without even leaving the one-hot input as an option. I like your thoughts on how the scalar input could be made more flexible. initial_conv_channels != residual_channels is also something I've been thinking about.

jyegerlehner · 2016-09-25T17:01:51Z

@ibab: OK, but however certain we may be that it's good for theoretical reasons, I think we'll want to verify that we can get models that give the current good audio samples before we push it to master.

Note: I said "channels" above once when I meant to say "filter width". I think we would want the initial conv to have two params: separately-configurable filter width and number of channels.

ibab · 2016-09-25T21:11:38Z

@jyegerlehner: I've tried using the scalar input approach again just now, and it has problems converging in your unit tests (with the 3 sine waves).
Even if I get the loss to drop, it takes more than 10x as many steps as with the one-hot approach.

jyegerlehner · 2016-09-25T21:30:55Z

@ibab: Right, that's what we saw last time. Bumping iterations up to 400 and learn rate down to 0.005 allowed the test to pass back when we tried it with filter width 2. What filter width are you using, same as the dilated filter widths?

I'd like to compare performance on the basis of real data. And like I said above, run with a large filter width, maybe go bigger, like 64. The optimizer can always push filter coefficients to zero if it wants. I'm not asking you do all this work. Just describing the things I thought I would try when I get around to it.

ibab · 2016-09-25T22:48:42Z

I tried increasing the filter width to 32, but it didn't have a large effect. :/

jyegerlehner · 2016-09-25T23:23:07Z

I think the criterion for whether there's benefit from scalar input vs one-hot vector input is how it performs on real data. It could be that the sine-waves are such a toy problem, all the net has to do is memorize which value comes next given the history. Different from real audio. Seen as pure regression problem, it has enough degrees of freedom to overfit and do so.

Or maybe one-hot quantized inputs are just as good a way to encode the input as scalar, and that's what DeepMind should have done, but didn't, because they have the same signal-processing prejudices I have. I don't know. I was describing how I'd try to figure it out. I haven't been super keen to switch over since I got fairly good results with the one-hot encoding we have now. I'm planning to look into it in the future.

So, let me try to say something useful. If you're asking for suggestions, perhaps you can do the comparison of the two approaches on real data. Alternatively, you could wait until I or someone else does the comparison, if you've got other things to worry about.

Speaking of which, I've got to go attend to some other things I've been putting off. I'll come back eventually.

ibab · 2016-09-25T23:38:09Z

@jyegerlehner: Don't worry, the project isn't going to run away :)
I'll give the comparison a try if I get around to it this week.

nakosung · 2016-09-27T21:20:17Z

What about frequency domain? it can be read as an image(not an one-hot encoded vector), so conv net could pick up some useful information easily.

csiki · 2018-09-30T20:49:22Z

@nakosung you mean feeding a spectrogram to a convnet? it doesn't work well, you lose phase information, and that happens to be important for sound discrimination. as an analogy, the wavegan paper got to a supporting conclusion.

nakosung · 2019-06-08T13:32:01Z

Please refer to MelNet

ibab added the strategy label Sep 25, 2016

SatyamKumarr mentioned this issue Dec 19, 2018

Feeding audio Waveform in every layer #372

Open

purzelrakete mentioned this issue Jun 7, 2021

Run an experiment to test one hot encoded inputs feldberlin/wavenet#2

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feeding raw audio waveform into the first layer #83

Feeding raw audio waveform into the first layer #83

ibab commented Sep 24, 2016

tomlepaine commented Sep 24, 2016

jyegerlehner commented Sep 25, 2016 •

edited

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016 •

edited

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016 •

edited

ibab commented Sep 25, 2016

nakosung commented Sep 27, 2016

csiki commented Sep 30, 2018

nakosung commented Jun 8, 2019

Feeding raw audio waveform into the first layer #83

Feeding raw audio waveform into the first layer #83

Comments

ibab commented Sep 24, 2016

tomlepaine commented Sep 24, 2016

jyegerlehner commented Sep 25, 2016 • edited

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016 • edited

ibab commented Sep 25, 2016

jyegerlehner commented Sep 25, 2016 • edited

ibab commented Sep 25, 2016

nakosung commented Sep 27, 2016

csiki commented Sep 30, 2018

nakosung commented Jun 8, 2019

jyegerlehner commented Sep 25, 2016 •

edited

jyegerlehner commented Sep 25, 2016 •

edited

jyegerlehner commented Sep 25, 2016 •

edited