Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feeding raw audio waveform into the first layer #83

Open
ibab opened this issue Sep 24, 2016 · 12 comments
Open

Feeding raw audio waveform into the first layer #83

ibab opened this issue Sep 24, 2016 · 12 comments
Labels

Comments

@ibab
Copy link
Owner

ibab commented Sep 24, 2016

We've discussed the fact that one-hot encoding the input to the network is kind of weird, and that it would be more natural to use the waveform as a single-channel floating point tensor instead.
Does anyone have experience with running our implementation in this way?
Should we switch to this method?

@tomlepaine
Copy link
Contributor

@ibab the demo in fast-wavenet trains on a single sample using a scalar encoding. Maybe we could test the two methods using that simple setup?

Train on a single sample using:

  • one-hot encoding
  • scalar encoding

And see if there is any notable difference the number of steps required to reach a given accuracy.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 25, 2016

I think we should ultimately switch. I think @ghenter 's comments over on the "generating good audio samples" thread were lucid and compelling on the topic. I was planning on experimenting more with this. You and I having run my branch and your pull request; we're the only ones I'm aware of. I don't think either of us saw any advantage as it was implemented.

I have a suspicion that may be because we only used the filter width=2 for the initial causal convolution. 32 channels or whatever residual_channels gets set to is overkill for the number of second order FIR filters you need to form a basis for an arbitrary audio waveform. For comparison, 32 would be a typical number of channels for the initial 5x5 or 3x3 convolution stage in your garden-variety image convnet (or convolutional image autoencoder). And we find that those filters learn all manner of little oriented edge detectors across 3 color channels. By contrast with audio, we have only 1 conv dimension (time, instead of 2 spatial) and 1 "color" (AKA amplitude). So I think to justify all those channels, we need to make the initial convolution filter width wider than 2. Perhaps 3, 4 or even 5. So add an "initial_conv_channels" to the wavenet_params.json. Then that initial filter could learn interesting little wavelets to form a basis for our audio. Perhaps we should also add a separately configurable number of channels for the initial convolution, so that the number of initial filters we have doesn't have to be exactly the same as the number of residual_channels.

To those who say "No! We have too many hyperparameters!" To them I say:

  • As simple as possible, but no simpler.
  • Hyperparameters are what make nets fun for humans. The weights and biases are the knobs that the machine gets to play with. Hyperparameters are the knobs you and I get to play with. If you don't think they're fun then maybe you're in the wrong business. Alternatively, you could just let us find reasonable defaults for them, and never touch them if you don't want to.
  • We're nowhere near an unmanageable number of hyperparameters. We only have a handful. Two more isn't going to kill us. It's going to be fun!

If it's something you'd be willing to merge into master, then we could always make scalar-input a configurable option, and default it to a one-hot input.

@ibab
Copy link
Owner Author

ibab commented Sep 25, 2016

@jyegerlehner: I'm thinking of switching to the scalar input completely, possibly without even leaving the one-hot input as an option. I like your thoughts on how the scalar input could be made more flexible. initial_conv_channels != residual_channels is also something I've been thinking about.

@jyegerlehner
Copy link
Contributor

@ibab: OK, but however certain we may be that it's good for theoretical reasons, I think we'll want to verify that we can get models that give the current good audio samples before we push it to master.

Note: I said "channels" above once when I meant to say "filter width". I think we would want the initial conv to have two params: separately-configurable filter width and number of channels.

@ibab
Copy link
Owner Author

ibab commented Sep 25, 2016

@jyegerlehner: I've tried using the scalar input approach again just now, and it has problems converging in your unit tests (with the 3 sine waves).
Even if I get the loss to drop, it takes more than 10x as many steps as with the one-hot approach.

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 25, 2016

@ibab: Right, that's what we saw last time. Bumping iterations up to 400 and learn rate down to 0.005 allowed the test to pass back when we tried it with filter width 2. What filter width are you using, same as the dilated filter widths?

I'd like to compare performance on the basis of real data. And like I said above, run with a large filter width, maybe go bigger, like 64. The optimizer can always push filter coefficients to zero if it wants. I'm not asking you do all this work. Just describing the things I thought I would try when I get around to it.

@ibab ibab added the strategy label Sep 25, 2016
@ibab
Copy link
Owner Author

ibab commented Sep 25, 2016

I tried increasing the filter width to 32, but it didn't have a large effect. :/

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Sep 25, 2016

I think the criterion for whether there's benefit from scalar input vs one-hot vector input is how it performs on real data. It could be that the sine-waves are such a toy problem, all the net has to do is memorize which value comes next given the history. Different from real audio. Seen as pure regression problem, it has enough degrees of freedom to overfit and do so.

Or maybe one-hot quantized inputs are just as good a way to encode the input as scalar, and that's what DeepMind should have done, but didn't, because they have the same signal-processing prejudices I have. I don't know. I was describing how I'd try to figure it out. I haven't been super keen to switch over since I got fairly good results with the one-hot encoding we have now. I'm planning to look into it in the future.

So, let me try to say something useful. If you're asking for suggestions, perhaps you can do the comparison of the two approaches on real data. Alternatively, you could wait until I or someone else does the comparison, if you've got other things to worry about.

Speaking of which, I've got to go attend to some other things I've been putting off. I'll come back eventually.

@ibab
Copy link
Owner Author

ibab commented Sep 25, 2016

@jyegerlehner: Don't worry, the project isn't going to run away :)
I'll give the comparison a try if I get around to it this week.

@nakosung
Copy link
Contributor

What about frequency domain? it can be read as an image(not an one-hot encoded vector), so conv net could pick up some useful information easily.

@csiki
Copy link

csiki commented Sep 30, 2018

@nakosung you mean feeding a spectrogram to a convnet? it doesn't work well, you lose phase information, and that happens to be important for sound discrimination. as an analogy, the wavegan paper got to a supporting conclusion.

@nakosung
Copy link
Contributor

nakosung commented Jun 8, 2019

Please refer to MelNet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants