New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change padding to VALID in causal convolutions. #191

Merged
merged 11 commits into from Jan 30, 2017

Conversation

Projects
None yet
5 participants
@koz4k
Contributor

koz4k commented Nov 27, 2016

This fixes #186 and #98. The network learns fine, sample of a model trained on speaker p280 from VCTK with 34K steps, sample_size 10K, the rest of parameters default: https://soundcloud.com/piotr-kozakowski-189475386/wavenet-vctk-p280.

I think the main improvement is that now the calculated loss is accurate regardless of sample_size, so we can train on smaller batches, which makes the memory requirements smaller and allows higher receptive fields with the same amount of memory. Also, smaller batches means less time spent on each step, so more steps can be done in the same amount of time, which might speed up the convergence.

@@ -103,7 +107,8 @@ def check_waveform(assertion, generated_waveform):
# Expect most of the power to be at the 3 frequencies we trained
# on.
assertion(expected_power, 0.9 * power_sum)
assertion(expected_power, 0.75 * power_sum)

This comment has been minimized.

@koz4k

koz4k Nov 27, 2016

Contributor

I lowered this threshold from 0.9 to 0.75 because I couldn't get the new model to pass this test. I know this seems bad, but even the training data wasn't passing this test. To validate this, add the following line:
check_waveform(self.assertGreater, audio)
after line 132 of test_model.py from the current master - the test will fail. It's strange that the original network is passing this test.

This comment has been minimized.

@jyegerlehner

jyegerlehner Nov 30, 2016

Contributor

I think the failure to pass test may be symptom of a deeper problem. Please see comment on receptive field size.

This comment has been minimized.

@koz4k

koz4k Nov 30, 2016

Contributor

As I commented there, I think that my formula for receptive field is correct. And even if it wasn't, I still think that this test is wrong, because it should check if the model can overfit on the training data, so if the training data doesn't pass the test, how can we expect output of a network which should exactly reproduce the training data to pass it?

This comment has been minimized.

@jyegerlehner

jyegerlehner Nov 30, 2016

Contributor

OK I finally get your point now about the "training data not passing the test." I'll have to think about that. I remember when I first wrote it I looked at the power spectrum of the training data and there were three very sharp peaks at the three frequencies; hard to see how it couldn't pass. Maybe bleeding into adjacent bins increased when we decreased the sample rate of the test to 2 KHz (to make it run faster). Not sure.

@fehiepsi

This comment has been minimized.

Contributor

fehiepsi commented Nov 28, 2016

By smaller batches I guess that you mean smaller sample size? My review on main changes:

  • First, in my opinion, using valid or same will give the same result as long as we compute the corresponding loss correctly (for the later, we have to remove the first receptive field from the output).
  • In term of feeding overlapping sample (the overlapping size is receptive field), it is a very good idea to capture all information from raw audio files as you did!
  • I wonder about this: is the smaller sample size the better (in term of stability of gradient descent update)?
  • In addition, I am with you on the way we feed data into the queue (not use buffer!) Not only for intuition (we should not concatenate files), but also for feeding the global condition into the network easier.
  • In case you agree with the above comments, is the following solution simpler: still using same but remove first receptive field for calculating loss, removing the buffer, and feeding overlapping samples. This way we just have to change a few lines of code.
@koz4k

This comment has been minimized.

Contributor

koz4k commented Nov 28, 2016

@fehiepsi thank you for review. I address your comments below:

  • You're right, SAME padding with slicing off the receptive field is functionally equivalent, but I think it wastes some time and memory to compute and store activations we won't use in the final output (I don't know if tensorflow optimizes that). Also, I think that the code is clearer with VALID convolutions, because it's immediately obvious what's going on and dropping the unused output seems hacky, but that's only my personal opinion.
  • As for stability, you're right that smaller sample size makes the gradient descent less stable, but it might lead to faster convergence anyway. Compare it to batch gradient descent, where the gradient is computed on the whole dataset each step - it's very stable, but also very slow. Averaging the gradient on smaller batches will approximate the global gradient and make convergence much faster. Of course that's assuming that samples in each minibatch are drawn randomly from the dataset, which is not the case here, because we take consecutive samples, but that's another matter and I don't see how that may be done efficiently. The point is, I don't claim that smaller batches leads to faster convergence in general, but now we can check if it does.
  • As for removing the buffer, my intuition is that concatenating files works well with VCTK, but may not work well with other datasets, for example with music datasets where each audio file corresponds to a different song.
  • As I said above, I don't think that padding = SAME is a better solution, but we can discuss it further if you want :)
@fehiepsi

This comment has been minimized.

Contributor

fehiepsi commented Nov 29, 2016

@koz4k : I agree all with your approach! My only concern is about the performance. ^^ I need to test but right now I dont have much time. This weekend I will.

receptive_field += self.initial_filter_width - 1
else:
receptive_field += self.filter_width - 1
return receptive_field

This comment has been minimized.

@jyegerlehner

jyegerlehner Nov 30, 2016

Contributor

I don't think this is correct. I find that when:

dilations=[1,2,
           1,2]
scalar_input=false
filter_width=2

it yields receptive field = 8, whereas I think the correct value is 7:
receptive_field

I think this was a more correct implementation:
https://github.com/ibab/tensorflow-wavenet/pull/118/files#diff-1c4931de87cb94aa46c82f06ad35df45R53
though I think it preceded scalar inputs.

This comment has been minimized.

@koz4k

koz4k Nov 30, 2016

Contributor

I think you're forgetting about the initial causal convolution (line 313 in model.py):
current_layer = self._create_causal_layer(current_layer)
it has dilation 1 and width equal to filter_width when scalar_input=False, so in your example it contributes 1 to the receptive field.
Also, there is a runtime check here (line 525 in model.py):
loss = tf.nn.softmax_cross_entropy_with_logits( prediction, target_output)
this line will fail if the receptive field is calculated incorrectly because of dimension mismatch.
Your implementation might also be correct, but it assumes a specific form of the dilation stack (1, 2, 4, ..., 2^n, 1, 2, 4, ..., 2^n, ...) and mine doesn't.

This comment has been minimized.

@jyegerlehner

jyegerlehner Nov 30, 2016

Contributor

Good points. Thanks for the explanation.

@jyegerlehner

This comment has been minimized.

Contributor

jyegerlehner commented Nov 30, 2016

You're right, SAME padding with slicing off the receptive field is functionally equivalent

Yeah, I think I was labouring under a misconception in my initial enthusiasm.

And if we follow fehiepsi's suggestion to leave the convolution SAME, I think we end up with what we had in PR #118. In which case the point ibab made in that thread applies: we never train the net how to learn how to begin generating sound from silence (unless we turn off silence-trimming, or pre-pend every wave file's audio with receptive-field samples of silence).

@koz4k

This comment has been minimized.

Contributor

koz4k commented Nov 30, 2016

Prepending every audio file with receptive_field samples of silence is exactly what I do in this PR (line 106 in audio_reader.py):
audio = np.pad(audio, [[self.receptive_field, 0], [0, 0]], 'constant')
and I think it makes perfect sense, because what is there before someone says something or a piece of music is played? Silence :). With this modification, generating sound from silence works fine (for a sample, see description of this PR).

I got an impression that you didn't merge #118 because there was some complication of the code but no obvious improvements. I see one big improvement from merging either this PR of #118, which I already mentioned before: after this change we'll be able to train with lower sample_sizes and the computed loss will still be accurate. It might not make such a big difference if you have a lot of RAM on your graphics card but there are people who don't have access to such advanced hardware (like me) and would like to have a shot at experimenting with WaveNets even though they have only 2 gigs of GPU memory.

And even if it wasn't the case, I would still think that training a model with an incorrect loss function, when correcting it is so easy, is not good.

@nakosung

This comment has been minimized.

Contributor

nakosung commented Nov 30, 2016

And even if it wasn't the case, I would still think that training a model with an incorrect loss function, when correcting it is so easy, is not good.

@koz4k 👍

@nakosung

This comment has been minimized.

Contributor

nakosung commented Nov 30, 2016

I modified target_output to support batch_size>1. nakosung@b0c0f1e

@koz4k

This comment has been minimized.

Contributor

koz4k commented Nov 30, 2016

Thanks, I missed that :)

@akademi4eg

@koz4k @ibab @lemonzi @jyegerlehner
I've resolved conflicts (in local branch) with recent merges. Are there any ways to update this PR from command line? So far I've found only instructions on merging...
Now I'm training a model with this changes and global conditioning to make sure results are still good.
Are there any arguments against merging this PR?

@akademi4eg

This comment has been minimized.

Collaborator

akademi4eg commented Jan 26, 2017

I think I've found a way: I pushed changes to new branch and created a PR to a branch in @koz4k repository. In theory, merge of that PR would update this one.

Merge pull request #1 from ibab/koz4k-padding-valid
Resolving conflicts for PR 191 in ibab/tensorflow-wavenet repo
@koz4k

This comment has been minimized.

Contributor

koz4k commented Jan 26, 2017

@akademi4eg I merged your PR to this PR, but now the build fails :/

@akademi4eg

This comment has been minimized.

Collaborator

akademi4eg commented Jan 26, 2017

@koz4k It failed for python 3, yet passed for python 2.7. I think there might be a problem in the test. Moreover, I'm pretty sure that rerunning the test might lead to passing all tests.

@koz4k koz4k closed this Jan 26, 2017

@koz4k koz4k reopened this Jan 26, 2017

@jyegerlehner

This comment has been minimized.

Contributor

jyegerlehner commented Jan 26, 2017

Yes I think I've seen this test fail sporadically. We set the random seed and init variables so one would expect it to do the same thing every time. Perhaps there's some state in the bowels of TF that doesn't get initialized by 'tf.initialize_all_variables()'. Maybe throw a 'tf.reset_default_graph()' at the beginning of 'setUp' for each test?

I have a vague intention to come back and look at this again sometime. But it's not near the top of my list ATM. Sorry. We could just disable the test as an expedient in the mean time until someone gets around to it.

One other thing: at one point @koz4k pointed out (somewhere; I can't find the thread at the moment) that in one of these tests the training data fails the test assertion. Thus it's a flawed test because you can't expect the trained model to do any better than the training data. I think the reason that assertion would fail is that the training data has a period of leading silence before it starts in on the sine wave, and the transition from silence to sine wave creates spectral power over a continuum of frequencies, as its not periodic. The test gets around this by removing the initial silence from the generated audio (or tries to) before the assertions. I'm no signal processing expert, but that was what I was thinking when I wrote it. That could also be related to the intermittent failures, if anyone wants to look at it.

@koz4k koz4k closed this Jan 26, 2017

@koz4k koz4k reopened this Jan 26, 2017

@koz4k koz4k closed this Jan 27, 2017

@koz4k koz4k reopened this Jan 27, 2017

@akademi4eg

This comment has been minimized.

Collaborator

akademi4eg commented Jan 27, 2017

@koz4k @jyegerlehner I think, I've manged to make tests pass. I had to update testing environment to tensorflow 0.12.1. Not sure why older version failed, while newer worked ok.

Merge pull request #2 from ibab/koz4k-padding-valid
Tests fail in python 3.5. Possible fix.
@koz4k

This comment has been minimized.

Contributor

koz4k commented Jan 27, 2017

@akademi4eg it looks like your fix worked :)

@akademi4eg akademi4eg merged commit 1557fa5 into ibab:master Jan 30, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment