-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feeding raw audio waveform into the first layer #83
Comments
@ibab the demo in Train on a single sample using:
And see if there is any notable difference the number of steps required to reach a given accuracy. |
I think we should ultimately switch. I think @ghenter 's comments over on the "generating good audio samples" thread were lucid and compelling on the topic. I was planning on experimenting more with this. You and I having run my branch and your pull request; we're the only ones I'm aware of. I don't think either of us saw any advantage as it was implemented. I have a suspicion that may be because we only used the filter width=2 for the initial causal convolution. 32 channels or whatever residual_channels gets set to is overkill for the number of second order FIR filters you need to form a basis for an arbitrary audio waveform. For comparison, 32 would be a typical number of channels for the initial 5x5 or 3x3 convolution stage in your garden-variety image convnet (or convolutional image autoencoder). And we find that those filters learn all manner of little oriented edge detectors across 3 color channels. By contrast with audio, we have only 1 conv dimension (time, instead of 2 spatial) and 1 "color" (AKA amplitude). So I think to justify all those channels, we need to make the initial convolution filter width wider than 2. Perhaps 3, 4 or even 5. So add an "initial_conv_channels" to the wavenet_params.json. Then that initial filter could learn interesting little wavelets to form a basis for our audio. Perhaps we should also add a separately configurable number of channels for the initial convolution, so that the number of initial filters we have doesn't have to be exactly the same as the number of residual_channels. To those who say "No! We have too many hyperparameters!" To them I say:
If it's something you'd be willing to merge into master, then we could always make scalar-input a configurable option, and default it to a one-hot input. |
@jyegerlehner: I'm thinking of switching to the scalar input completely, possibly without even leaving the one-hot input as an option. I like your thoughts on how the scalar input could be made more flexible. |
@ibab: OK, but however certain we may be that it's good for theoretical reasons, I think we'll want to verify that we can get models that give the current good audio samples before we push it to master. Note: I said "channels" above once when I meant to say "filter width". I think we would want the initial conv to have two params: separately-configurable filter width and number of channels. |
@jyegerlehner: I've tried using the scalar input approach again just now, and it has problems converging in your unit tests (with the 3 sine waves). |
@ibab: Right, that's what we saw last time. Bumping iterations up to 400 and learn rate down to 0.005 allowed the test to pass back when we tried it with filter width 2. What filter width are you using, same as the dilated filter widths? I'd like to compare performance on the basis of real data. And like I said above, run with a large filter width, maybe go bigger, like 64. The optimizer can always push filter coefficients to zero if it wants. I'm not asking you do all this work. Just describing the things I thought I would try when I get around to it. |
I tried increasing the filter width to 32, but it didn't have a large effect. :/ |
I think the criterion for whether there's benefit from scalar input vs one-hot vector input is how it performs on real data. It could be that the sine-waves are such a toy problem, all the net has to do is memorize which value comes next given the history. Different from real audio. Seen as pure regression problem, it has enough degrees of freedom to overfit and do so. Or maybe one-hot quantized inputs are just as good a way to encode the input as scalar, and that's what DeepMind should have done, but didn't, because they have the same signal-processing prejudices I have. I don't know. I was describing how I'd try to figure it out. I haven't been super keen to switch over since I got fairly good results with the one-hot encoding we have now. I'm planning to look into it in the future. So, let me try to say something useful. If you're asking for suggestions, perhaps you can do the comparison of the two approaches on real data. Alternatively, you could wait until I or someone else does the comparison, if you've got other things to worry about. Speaking of which, I've got to go attend to some other things I've been putting off. I'll come back eventually. |
@jyegerlehner: Don't worry, the project isn't going to run away :) |
What about frequency domain? it can be read as an image(not an one-hot encoded vector), so conv net could pick up some useful information easily. |
@nakosung you mean feeding a spectrogram to a convnet? it doesn't work well, you lose phase information, and that happens to be important for sound discrimination. as an analogy, the wavegan paper got to a supporting conclusion. |
Please refer to MelNet |
We've discussed the fact that one-hot encoding the input to the network is kind of weird, and that it would be more natural to use the waveform as a single-channel floating point tensor instead.
Does anyone have experience with running our implementation in this way?
Should we switch to this method?
The text was updated successfully, but these errors were encountered: