add regularization, dropout and batch norm? #65

r-zemblys · 2016-09-21T11:34:17Z

Has anybody got loss lower than ~2? Tried couple of configurations (default, 3 and 4 stacks of 10 dilation layers), but loss does not get lower, suggesting the network is not learning anymore.

Also, there is what happened happened after ~30k steps:

I believe this is the same problem as reported in #30. There is what happens with weights:

Now running the same network with l2 norm regularization added.

And one more note: training just stops after 44256 steps (already happened twice) without any warnings or errors, despite of num_steps=50000

dnuffer · 2016-09-21T11:41:46Z

I've observed the same things. I looked at the code to see what might be hanging and didn't find any red flags. I thought the hang might be related to my setup: CUDA 8.0rc (required for Pascal support), cuDNN 5.1, and tensorflow built from source (git master from 9/20)

ibab · 2016-09-21T12:40:09Z

The hanging is probably caused by the background audio processing crashing. (Especially if if the CPU/GPU are idle once it stops).
Usually, there should be a backtrace that can help us find the reason it crashed.
Which commit did you observe the problem with?
There was a bug where we simply stop processing audio once we've seen every file once.
It might be that you're on an older commit that had this problem.

I've been trying to find a solution to the gradient jumping to large values at large step numbers, but don't have any amazing solutions at the moment.
It seems to be related to the ReLU activations in the last few layers of the network.
I've tried clipping the gradients, which didn't have an effect on this problem.
Replacing the ReLU activations with Tanh seems to fix it completely, but the network doesn't converge quite as quickly as with ReLU.

lelayf · 2016-09-21T17:27:43Z

@ibab I'm experiencing the stalling with the latest commit.
@r-zemblys if you resume training at the checkpoint right before gradients implosion with a lower learning rate does it still behave the same ?

r-zemblys · 2016-09-22T07:00:54Z

@lelayf i've used learning rate of 0.01 to get that loss curve above. Train saver only stores last 5 checkpoints so I'm not able to try lowering learning rate right before gradient implosion.

@ibab I was indeed using older commit. Latest one does not have stalling problem.

Here is loss curve with l2 regularization added; orange - learning rate 0.01 (~20k steps), blue - 0.001 (~60k steps)

Gradient implosion problem is gone, but it seems network is not learning anymore after first epoch. Will try to generate some audio later today.

lelayf · 2016-09-22T11:12:01Z

@r-zemblys are you training on GPU or CPU ?

r-zemblys · 2016-09-22T12:15:48Z

Here is generated 80k samples, primed with 8k sample audio from other database.
generated_l2_primed.wav.zip

Soundwave looks reasonably OK (green - generated audio)

Notes:

used af4c58e
trained for ~20k steps with learning rate of 0.01 and continued for ~60k steps with 0.001
@lelayf it is TitanX GPU I'm using
used l2 regularization
disabled silence trimming because of #59
there was a bug in WaveNet.decode, which resulted to all-zeros output. I think bug is still here in fc5417d

ibab · 2016-09-22T12:21:58Z

@r-zemblys: Excellent, did you use the default wavenet_params.json?
I've also linked some of my results in #47.

r-zemblys · 2016-09-22T12:29:26Z

Forgot to add. This is configuration I've used:


{
    "filter_width": 2,
    "quantization_steps": 256,
    "sample_rate": 16000,
    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
    "residual_channels": 32,
    "dilation_channels": 16,
    "use_biases": false
}

But as I've mention in the beginning, there is no difference (at least in loss curve) if using default configuration.

ibab · 2016-09-22T12:31:00Z

@r-zemblys: Did you train on the entire dataset, or a specific speaker?

r-zemblys · 2016-09-22T12:37:07Z

@ibab: entire VCTK corpus. And then primed generation with a recording from LibriSpeech ASR corpus.

ibab · 2016-09-22T12:44:39Z

That's very cool. I think mixing together all different speakers explains the voice difference between your sample and mine.
Would you be interested in contributing the l2 regularization in a pull request?

hoonyoung · 2016-09-22T13:47:53Z

I'm using python 2.7 and as r-zemplys mentioned above as "..there was a bug in WaveNet.decode, which resulted to all-zeros output", I obtained the generated.wav file with all-zeros.

After fixing the last line of "wavenet_ops.py" like below, I am now getting the speech-like waveform output.

magnitude = (1 / mu) * ((1 + mu)**abs(signal) - 1)
--> magnitude = (1. / mu) * ((1. + mu)**abs(signal) - 1)

Hope someone reflect it to the code if necessary.

ibab · 2016-09-22T13:55:58Z

@hoonyoung: This should be fixed on master now. I've also enabled travis to run the tests with Python 2.

lelayf · 2016-09-23T02:54:21Z

I commented out silence trimming and now training does not stall anymore, using 88e77bf.

mortont mentioned this issue Sep 21, 2016

Error training multiple epochs #66

Closed

ibab mentioned this issue Sep 21, 2016

GPU OOM #53

Closed

r-zemblys mentioned this issue Sep 23, 2016

Added L2 norm regularization term to the loss to prevent gradient exp… #75

Closed

r-zemblys closed this as completed Sep 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add regularization, dropout and batch norm? #65

add regularization, dropout and batch norm? #65

r-zemblys commented Sep 21, 2016

dnuffer commented Sep 21, 2016

ibab commented Sep 21, 2016

lelayf commented Sep 21, 2016

r-zemblys commented Sep 22, 2016

lelayf commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

hoonyoung commented Sep 22, 2016

ibab commented Sep 22, 2016 •

edited

Loading

lelayf commented Sep 23, 2016

add regularization, dropout and batch norm? #65

add regularization, dropout and batch norm? #65

Comments

r-zemblys commented Sep 21, 2016

dnuffer commented Sep 21, 2016

ibab commented Sep 21, 2016

lelayf commented Sep 21, 2016

r-zemblys commented Sep 22, 2016

lelayf commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

r-zemblys commented Sep 22, 2016

ibab commented Sep 22, 2016

hoonyoung commented Sep 22, 2016

ibab commented Sep 22, 2016 • edited Loading

lelayf commented Sep 23, 2016

ibab commented Sep 22, 2016 •

edited

Loading