Unknown Adadelta trainer error/issues #87

merceyz · 2016-08-12T01:19:42Z

Hello,

I was looking at the different trainers and reading some documents on them when i noticed a value called "epsilon".
This value is nowhere to be seen in the API documentation and thus i assume it's missing. (Unless it's the "anneal" option which would be awkward for me)

merceyz · 2016-08-12T02:06:02Z

Also this seems to be a common "theme", it does really well for a while and then it shoots up to infinity

(ignore the sample difference, i'm just testing the trainers)

hughperkins · 2016-08-12T06:48:02Z

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

As far as the training nans... training nans is a perennial problem with neural nets. There are a few possible sources, none of which are mutually exclusive, could be a bit of all of them :-P :

learning rate a bit high => causes learning to diverge. since the gradient surface is not planar or linear, but twisted and convoluted, this might not be apparent to start with ,and then, bam!
a bug. It's possible...
all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization

There's no hard and fast rule or check to know which is which... I suppose what I would do is:

look around for what are sensible learning rates for the model and data I'm using, slot those in. does it work?
if it's a novel model, first try on a less novel model, and then see step 1 above; ie try something simple first, get that working, then try the original research bit
if you're using a standard-ish learning rate in a standardish model, and that model doesnt have any kind of additional regularization, normalization, gradient truncation etc that you dont have, then someone would need to dig a bit more. 'someone' probalyb menaing: you :-P

As far as 'digging a bit more', you're almost certainly need to roll up your sleeves and get stuck into the code, so I would try the first two steps first. I thnk that to 'get stuck into the code', at minimum, you'd probably want to do something like:

arrange for the weights to be saved each iteration, to a different file each time
as soon as it starts diverging, hitting nans, abort, note which iteration it was
now you can load those weights directly, examine the weights, their magnitude etc. what if you drop the learning rate? are some weights heading for infinity? will weight regularization help? etc...

If it was me, I'd do this currenlty probalby using python. In the past, I would have done it directly in hte C++.

PS one thing you could try is, assume its a gpu kernel bug, so make everything run on the cpu, by modifying https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L51 to return true only for kernle 0 (ie the cpu kernel), and ditto for Forward, and Backward. If this doesnt create NaNs, there might be a bug in one or more of the gpu kernels, for the specific geometry you are using.

hughperkins · 2016-08-12T06:48:34Z

(added a 'PS')

merceyz · 2016-08-12T10:23:50Z

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

See the eps (epsilon) parameter
https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Can also be found here just to name a few
https://keras.io/optimizers/

all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right?
Anyways this is the config i'm currently running
deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Values are gathered from the second link on where to find the "epsilon" value, i assumed the "eps" was the same as "anneal"

hughperkins · 2016-08-12T10:38:28Z

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 learningrate=1.0 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

Network arcitecture looks reaosnably standard. I'd be tempted to use tanh after the fc layers ,rather than relu. Might want only two fc layers perhaps, like maybe -150n-tanh-2n perhaps?

hughperkins · 2016-08-12T10:41:48Z

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

merceyz · 2016-08-12T11:00:28Z

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

Alright, that means the anneal value is not useful in this instance right?

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

How would i achieve this kind of normalizing?

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 batchsize=128 numepochs=4000 netdef=4*(64c3z-relu-mp2)-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

hughperkins · 2016-08-12T11:24:24Z

Alright, that means the anneal value is not useful in this instance right?

Well, it's not related to the adadelta fudgefactor yeah. anneal basically slowly reduces the learning rate over time. It's a bit tricky to use though. On the whole I think a standard approach is:

set a largish learning rate, eg 0.02 or something
train until the loss stops decreasing
halve the learning rate
rinse and repeat
the loss will make 'steps' each time you have the learning rate, eg the grpahs here https://github.com/szagoruyko/wide-residual-networks#code-for-wide-residual-networks are I assume stepped for htis reason

I just took the default value from the second link, but as i saw first now (facepalm) on the standford page it was a lower, perhaps normal, number

ok

How would i achieve this kind of normalizing?

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)

I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

merceyz · 2016-08-12T11:43:58Z

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

That would be the "l2_decay" specified on the Stanford page correct?

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)
I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

It didn't crash my system which is a good start, it seems to be doing rather good

hughperkins · 2016-08-12T11:49:35Z

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

L2 decay is what you need. I'm 70% sure that what I linked to is L2, but I'd want to double-check somewhere to be sure it's not L1. (I think its L2, because the derivative of x squared is simply x, so that's why we simply subract some fraction of hte current weight here).

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ok

It didn't crash my system which is a good start, it seems to be doing rather good

cool :-)

merceyz · 2016-08-12T13:55:59Z

I just ran it like this deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

And yet again at epoch 8 it goes south

However running this
deepcl_train learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Seems to be fine, loss goes up a bit at epoch 11 and 12 then back down at 13

hughperkins · 2016-08-12T14:01:38Z

Ok. You mean, using SGD trainer instead of adadelta trainer?

merceyz · 2016-08-12T14:02:39Z

As SGD is the default trainer yes, so there might be a bug somewhere in the adadelta trainer?

hughperkins · 2016-08-12T14:03:44Z

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

merceyz · 2016-08-12T17:03:50Z

I tried with the adagrad trainer which is now at epoch 27 and is constantly getting better and better, 99,5794% and a loss of 370,28

deepcl_train trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

So i'll assume something is wrong in the adadelta trainer.

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

I sadly don't know how it's supposed to be implemented so I can't really "proofread" it

hughperkins · 2016-08-13T06:42:06Z

So i'll assume something is wrong in the adadelta trainer.

Ok, thats good info. I will file a bug

edit: oh, the title of this issue, here, this thread, is adadelta error. so ... good :-)

(note: the adadelta paper is here: www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf the update rule is in 'algorithm 1':

We'd need to compare this equatoin with what is written in https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L55-L74 Unfortunately, it's a bit illegible, though better than it was originally. the bug could be either in this list of operations (first thing to check), or in the underlying vector arithmetic implementations (though there are unittests for this). On the whole, I guess it's most likely the bug is in this chunk of code, linked from this paragraph, though it's mostly a guess.

hughperkins · 2016-08-13T06:56:03Z

What the code says is:

    clWorking = clGradWeights;
// copy clGradWeights vector into clWorking vector
    clWorking.squared();
// take per-element square of each element in clWorking, so they equal gradient squared
// this is probalby the gt2 terms in the equation above (gt is clGradWeights)
// (writing gt squared as gt2)
    clWorking *= (1 - decay);
// (1 - decay) is probably (1 - p)  (writing rho as p)
// so clWorking now holds (1-p)gt2
    clSumGradSquared *= decay;
// by comparison with equation 8 (see below),
// it looks like clsumgradsquared holds the running average of the g2 elements over time,
// ie E[g2]
// so, from the code, we now have:
// clSumGradSquared is:  p * E[g2]
    clSumGradSquared += clWorking;
// now, clSumGradSqured is: p * E[g2] + (1-p)gt2
// ie, looks like step 4, in the algorithm screenshot above

edit: ah this bit is equation 8:

edit2: next bit of code:

    clWorking = clSumGradSquared;
// copy p * E[g2] + (1-p)gt2 into clWorking
// so, clWorking is: p * E[g2] + (1-p)gt2
    clWorking.inv();
// calculate 1/ clWorking, for each element, so now each elemnt of clWorking is:
// 1 / (p * E[g2] + (1-p)gt2)
    clWorking *= clSumUpdateSquared;
// I guess that `update` is delta x in the equation in the screenshot, which we can
// write maybe AX  (since A looks a bit like the delta symbol, the triangle)
// I guess that ... hmmm.... seems like we are calculating equation 9 and step 5
// equation 9 is:

but ... in equation 9 ,there is an epsilon, which is what you mentioned above
... and in the code ... no epsilon :-P  maybe this is the bug

hughperkins · 2016-08-13T07:19:08Z

Maybe the code should be modified to insert the following line in between line 62 and line 63:

clWorking  += 0.0000001f;

(where 0.0000001f is epsilon)

hughperkins · 2016-08-13T07:22:20Z

Added in 07eb0d6 We'll see if that helps. I should trigger a build really

hughperkins · 2016-08-13T08:09:48Z

When you get a moment, http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip has an updated Adadelta. We can try this version, see if it helps.

merceyz · 2016-08-13T09:44:53Z

http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip

I downloaded it and moved over my Data folder (images and manifest) and it is sadly getting stuck.

If i wait long enough and hit ctrl + c (cancel) it outputs "Empty input file"

It's RAM usage is at 3,14 GB and CPU usage is 0,00%

merceyz · 2016-08-20T13:58:14Z

Any updates regarding the issue(s)?

hughperkins · 2016-08-21T11:17:52Z

I'm not sure. I think the issue was fixed. There seems to be some problem with the build process in general (I just migrated to msvc2015 , a bit gratuitously), and I dont know how to diagnose/fix that. It sounds like a lot of work... I need to pluck up courage, roll up my sleeves, and look into why the new build is not working... Can you try to check if other stuff is working, on this build? or is everything entirely broken on this build?

merceyz · 2016-08-21T11:35:13Z

unittests seems to run fine

hughperkins · 2016-08-21T11:42:23Z

Alright. What about running simple mnist training? Just normal sgd and so on? Is it only adadelta that is broken? Or things are more broken generally? If it's just adadelta broken, that simplifies, since then it's not a build issue, just some logic issue in adadelta, which should be not too hard for me to fix, hopefully, probably...

hughperkins · 2017-06-24T21:19:48Z

Hmmm, ELU forward does this:
https://github.com/hughperkins/DeepCL/blob/master/src/activate/ActivationForwardGpuNaive.cpp#L85

    "#elif defined ELU\n"
    "    #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)\n"

I suppose if output is more than 10 or so, exp on output would give infinity (but not nan). I think it's physically impossible for exp to give nan, given non-nan input. Some examples of calling exp on different nubmers:

~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(10)"
 = 22026.5
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(100)"
 = 2.68812e+43
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(200)"
 = 7.22597e+86
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(300)"
 = 1.94243e+130
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(400)"
 = 5.22147e+173
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(500)"
 = 1.40359e+217
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(1000)"
 = Infinity
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(100000)"
 = Infinity
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(-10000)"
~= 0
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(-100000000000)"
~= 0
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(0)"
 = 1

... so thats odd. Do you happen to know exactly which input value is generating a nan output value, in the elu forward?

merceyz · 2017-06-24T23:04:37Z

Code that checks and dumps during forward:

if (layers[layerId]->hasOutputWrapper() && layers[layerId]->getOutputWrapper()->isOnDevice() == true)
{
    #pragma region Find NaN
    layers[layerId]->getOutputWrapper()->copyToHost();
    float * result = (float*)layers[layerId]->getOutputWrapper()->getHostArray();
    int items = layers[layerId]->getOutputNumElements();
    for (int i = 0; i < items; i++)
    {
        if (std::isfinite(result[i]) == false)
        {
            cout << "Found error at layer " << layerId << " " << layers[layerId]->asString() << endl;

            ofstream dumpData;
            dumpData.open("ForwardDump.txt");

            layers[layerId - 1]->getOutputWrapper()->copyToHost();
            float* input = (float*)layers[layerId - 1]->getOutputWrapper()->getHostArray();
            int inputItems = layers[layerId - 1]->getOutputNumElements();

            dumpData << "Input data of layer " << layers[layerId]->asString() << endl;
            for (int inputIndex = 0; inputIndex < inputItems; inputIndex++)
            {
                dumpData << input[inputIndex] << "  ";
            }

            dumpData << endl << "Output data of layer " << layers[layerId]->asString() << endl;
            for (int outputIndex = 0; outputIndex < items; outputIndex++)
            {
                dumpData << result[outputIndex] << "  ";
            }
            dumpData.close();

            throw runtime_error("Non-finite number found");
        }
    }
    #pragma endregion
}

Assuming input[0] -> output[0]

Gives me these input values producing NaN which makes no sense

hughperkins · 2017-06-24T23:49:38Z

Yes, that doesnt make any sense. Are you sure you've transferred the data from gpu to cpu, before printing it, and seeing hte nans? Make sure you put a clFinish() before and after the transfer, just to be sure.

merceyz · 2017-06-25T01:58:06Z

Alright, after a lot of digging I now know why it happens.
As soon as it predicts a 100% chance of a label it starts to go "insane"

I'll attach the log so that you can see for yourself.
Entries and where they come from:

"Forward" https://github.com/hughperkins/DeepCL/blob/master/src/net/NeuralNet.cpp#L187
"Backwards" https://github.com/hughperkins/DeepCL/blob/master/src/net/NeuralNet.cpp#L224

"Input to loss" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L87
Value is: `output[imageOffset + label]`
"Output from loss" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L87
Value is: `-log(output[imageOffset + label])`
"LOSS IS CURRENTLY" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L90
Value is: `loss`

"Grad out" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L151
Value is: `output[imageOffset + plane]`
"GradOut final" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L158

TrainingLog.txt

hughperkins · 2017-06-25T08:41:29Z

Alright, after a lot of digging I now know why it happens.
As soon as it predicts a 100% chance of a label it starts to go "insane"

Wow, thats impressive work Chris. Very nice. I'm very impressed :-)

Ok, so whats happening, I htink, is:

if the input to the softmaxloss layer is 0, then log(0) evaluates to Inf, and then it's over really

But eg log(0 + 1e-6) is no longer infinity, but -13. Can you try adding + 1e-6 into each of the log terms, in the forward direction, and see what happens? For example, this line:

https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L76

                loss += - log(output[ imageOffset + label ]);

would become:

                loss += - log(output[ imageOffset + label ] + 1e-6);

(you can also try +1e-8. I think both will fix the issue, probably. They might give ever so slightly different results, I'm not sure which set of results will be 'better')

merceyz · 2017-06-25T10:48:54Z

It didn't help, I think the problem is with the calcGradInputFromLabels

calcLossFromLabels -> When the input is 1 it's output is 0
calcGradInputFromLabels -> it's input is now 0 so it's output is -1, in the event its input is 1 it's output is now 0, assuming the output from this (0) is multiplied with anything in the network we have everything eventually turning in to 0 and eventually NaN

Also are these two lines correct?
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L151
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L158

hughperkins · 2017-06-25T10:55:09Z

Hmmm, gradients of zero are fairly normal, and shouldnt break anything. I *think*. Where/how is the zero gradient getting converted into `nan`?

…

On Sun, Jun 25, 2017 at 11:48 AM, Chris ***@***.***> wrote: It didn't help I think the problem is with the calcGradInputFromLabels calcLossFromLabels -> When the input is 1 it's output is 0 calcGradInputFromLabels -> it's input is now 0 so it's output is -1, in the event its input is 1 it's output is now 0, assuming the output from this (0) is multiplied with anything in the network we have everything eventually turning in to 0 and eventually NaN Also are these two lines correct? https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer. cpp#L151 https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer. cpp#L158 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqO1_PlF4_Y8sGboEH6zQonFaC3T7ks5sHjsWgaJpZM4Jis4A> .

merceyz · 2017-06-25T11:05:50Z

I'm assuming something gets multiplied by 0 which then goes into a function that doesn't accept 0 and returns NaN.
Same goes for -1, if it ends up in a function that doesn't accept -1 (log for example)

hughperkins · 2017-06-25T12:03:32Z

That does sound accurate. But I think we should figure out which exact function. As you say, generally I'd expect it to be a `log` somewhere. I'm anticipating the fix to be to add an epsilon term, ie 1e-6, to some log somewhere. but which log is an open qustion.

…

On Sun, Jun 25, 2017 at 12:05 PM, Chris ***@***.***> wrote: I'm assuming something gets multiplied by 0 which then goes into a function that doesn't accept 0 and returns NaN. Same goes for -1, if it ends up in a function that doesn't accept -1 (log for example) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqA_OwQp7juTDZ_sasSZcnroeVbdJks5sHj8PgaJpZM4Jis4A> .

merceyz · 2017-06-25T17:35:47Z

I prevented the output of the softmax layer from ever becoming 1 or 0
output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] - maxValue) / denominator, 0.99999f), 0.00001f);
Loss doesn't become NaN but the result during forward of some of the layers do because both the weights and bias contains A LOT of NaNs.

So the issue is during backwards.
Where should I go from here?

hughperkins · 2017-06-25T19:53:58Z

Sounds like you are heading in the right direction. Awesome :-)

merceyz · 2017-06-25T19:58:44Z

I'm completely stuck.

I added checks before and after backpropWeightsImpl->calcGradWeights and backwardImpl->backward in ConvolutionalLayer::backward() but the first instance of NaN is still during forward where the weights and bias is NaN.
It's almost as if the weights and bias for the current layer change at some point other than ConvolutionalLayer::backward()

Code for checking:

VIRTUAL void ConvolutionalLayer::checkWrapperData(CLWrapper* input, std::string message)
{
    if (input->isOnDevice())
    {
        cl->finish();
        input->copyToHost();
        cl->finish();

        float* data = (float*)input->getHostArray();
        int size = input->size();

        for (int i = 0; i < size; i++)
        {
            if (std::isfinite(data[i]) == false)
            {
                cout << "Found non-finite number: " << message << endl;
                throw runtime_error("Found non-finite number at " + asString());
            }
        }
    }
}

merceyz · 2017-06-25T20:05:38Z

I forgot the trainer updates them (facepalm).
Checking there now

hughperkins · 2017-06-25T20:06:15Z

for the weights, `calcGradWeights` wont actually update the weights, but the `gradWeights`. The weights are updated a bit later, like: weights += learningRate * gradWeights As to where this happens, lets see.... oh, the trainers handle this, eg . https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L99 ``` updateWeights(layer->getWeightsWrapper(), layer->getGradWeightsWrapper(), dynamic_cast< AdadeltaState * >(layer->getTrainerState()) ); ```

…

On Sun, Jun 25, 2017 at 8:58 PM, Chris ***@***.***> wrote: I'm completely stuck. I added checks before and after backpropWeightsImpl->calcGradWeights and backwardImpl->backward in ConvolutionalLayer::backward() but the first instance of NaN is still during forward where the weights and bias is NaN. It's almost as if the weights and bias change at some point other than ConvolutionalLayer::backward() Code for checking: VIRTUAL void ConvolutionalLayer::checkWrapperData(CLWrapper* input, std::string message) { if (input->isOnDevice()) { cl->finish(); input->copyToHost(); cl->finish(); float* data = (float*)input->getHostArray(); int size = input->size(); for (int i = 0; i < size; i++) { if (std::isfinite(data[i]) == false) { cout << "Found non-finite number: " << message << endl; throw runtime_error("Found non-finite number at " + asString()); } } } } — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqCkSriKtXV0UyR2ZiyF-h3EA11EFks5sHrv0gaJpZM4Jis4A> .

hughperkins · 2017-06-25T20:06:23Z

:-)

…

On Sun, Jun 25, 2017 at 9:05 PM, Chris ***@***.***> wrote: I forgot the trainer updates them (facepalm). Checking there now — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqJUTWMidSGZQL4w9qlFQDVb717PRks5sHr2TgaJpZM4Jis4A> .

merceyz · 2017-06-25T21:14:20Z

both the weights and bias contains A LOT of NaNs.

There was a bug in my code for checking them (i did int countNan; instead of int countNan = 0)

So with no weights nor bias as NaN the first instance is during forward on a conv layer.
What can produce +-inf on that layer?

hughperkins · 2017-06-25T21:21:05Z

Missing a clfinish. Copying in nans. Size mismatch, eg copying too many, or too few values.

…

On 25 June 2017 22:14:20 BST, Chris ***@***.***> wrote: > both the weights and bias contains A LOT of NaNs. There was a bug in my code for checking them (i did `int countNan;` instead of `int countNan = 0`) So with no weights nor bias as NaN the first instance is during forward on a conv layer. What can produce +-inf on that layer? -- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: #87 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

merceyz · 2017-06-25T22:37:47Z

I forced forwardauto to use the CPU kernel and coded the cpu kernel to stop once it gets a non-finite result and print what it just did math on and got this

Bias: -98174265514624848756736.000000
Input: 892913097361408654370144256.000000; weight: 85576382305778379259904.000000
Sum: inf

merceyz · 2017-06-25T23:03:56Z

So the source of the issue is the SoftMaxLayer.

Above result comes from (preventing it from ever prediction 1 and 0):
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L293
output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] - maxValue) / denominator, 0.99999f), 0.00001f)

Not doing that gives me a loss of inf due to log(0) which spreads fast
Thoughts?

Some resources that might help
https://stackoverflow.com/a/9906882/2923845
https://cs231n.github.io/linear-classify/#softmax
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp

hughperkins · 2017-06-26T09:25:10Z

Ok. You're saying, putting in the max and min, in the softmax layer, fixes the issue?

…

On Mon, Jun 26, 2017 at 12:03 AM, Chris ***@***.***> wrote: So the source of the issue is the SoftMaxLayer. https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer. cpp#L293 output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] - maxValue) / denominator, 0.99999f), 0.00001f) Gives me the above result. Not doing that gives me a loss of inf due to log(0) which spreads fast — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqHuZ2AFiUgCrHfiOoo5DikR1Gzd2ks5sHudcgaJpZM4Jis4A> .

merceyz · 2017-06-26T10:36:35Z

It prevents a loss of +-inf but bias and weights become so large that the sum of the calculation on them becomes +-inf

hughperkins · 2017-06-26T10:43:28Z

Ah. One could consider that no longer a code issue in the library, but a symptom of needing weight regularization, and/or gradient clipping. I think l1 and l2 weight regularization are implemented right? Can you try adding some l2 weight regularization?

…

On 26 June 2017 11:36:35 BST, Chris ***@***.***> wrote: It prevents a loss of +-inf but bias and weights become so large that the sum of the calculation on them becomes +-inf -- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: #87 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

merceyz · 2017-06-26T10:46:21Z

From command line that's weightdecay=X correct?

merceyz · 2017-06-26T10:49:18Z

Also do you want me to do that with or without the max and min?

hughperkins · 2017-06-26T11:06:00Z

From command line that's weightdecay=X correct?

I think so, but you might need to double-check this point (or try it, with large values, see if it changes anything, before switching to smaller values). It looks like the implementation is trainer-specific, so you might need to implement it for any specific trainer you want to use it with.

Also do you want me to do that with or without the max and min?

I think the max/min seems like a fairly ok-ish change. It essentially boils down to adding/subtracting a small value (epsilon), which is fairly standard. So, yes, with the max/min.

…

On Mon, Jun 26, 2017 at 11:49 AM, Chris ***@***.***> wrote: Also do you want me to do that with or without the max and min? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHiqBTFXjpAPc7z7eCFTShCEebyyj6Jks5sH4yugaJpZM4Jis4A> .

merceyz · 2017-06-26T13:52:54Z

I had a look at the Caffe implementation of the softmax layer and used the log(max(x, FLT_MIN)) https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L106
here https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L117

Which prevents a loss of -+inf.
With a weight decay of 0,1 it seems to be working.

I noticed that they also normalize the loss which is probably something we should be doing as well
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L111

hughperkins · 2017-06-26T14:00:05Z

Ah, looking at other implementations is a good idea :-) . Cool :-)

With a weight decay of 0,1 it seems to be working.

Great! :-)

I noticed that they also normalize the loss which is probably something we should be doing as well
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L111

ok. what calculation are they doing for this?

merceyz · 2017-06-26T14:45:50Z

ok. what calculation are they doing for this?

I'm not quite sure
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L59
EDIT:

the loss is normalized by the number of (nonignored) labels present; otherwise the loss is simply summed over spatial locations.

With a weight decay of 0,1 it seems to be working.

I take that back, after running for a while the numbers in weights and bias grow too large producing -nan

merceyz · 2017-06-26T19:32:15Z

Alright, finally done with this.

The loss becoming nan/+-inf is solved by preventing log of a number smaller than FLT_MIN (x<=0)

The bias and weights becoming so large that the result of math on them becomes nan/+-inf is because of a learning rate that is too high. The fix for loss hides this so I've added a check for it.

See #120

merceyz changed the title ~~Missing "epsilon" in the adadelta trainer~~ Missing "epsilon" in the adadelta trainer? Aug 12, 2016

merceyz changed the title ~~Missing "epsilon" in the adadelta trainer?~~ Unknown Adadelta trainer error/issues Aug 12, 2016

hughperkins self-assigned this Aug 13, 2016

hughperkins added bug help wanted labels Aug 13, 2016

merceyz mentioned this issue Jun 26, 2017

Fixed some issues in the SoftMaxLayer #120

Merged

Unknown Adadelta trainer error/issues #87

Unknown Adadelta trainer error/issues #87

Comments

merceyz commented Aug 12, 2016 • edited

merceyz commented Aug 12, 2016

hughperkins commented Aug 12, 2016

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016 • edited

hughperkins commented Aug 12, 2016

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016 • edited

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016 • edited

hughperkins commented Aug 12, 2016

merceyz commented Aug 12, 2016

hughperkins commented Aug 13, 2016 • edited

hughperkins commented Aug 13, 2016 • edited

hughperkins commented Aug 13, 2016 • edited

hughperkins commented Aug 13, 2016

hughperkins commented Aug 13, 2016

merceyz commented Aug 13, 2016 • edited

merceyz commented Aug 20, 2016

hughperkins commented Aug 21, 2016

merceyz commented Aug 21, 2016

hughperkins commented Aug 21, 2016

hughperkins commented Jun 24, 2017 • edited

merceyz commented Jun 24, 2017

hughperkins commented Jun 24, 2017

merceyz commented Jun 25, 2017

hughperkins commented Jun 25, 2017

merceyz commented Jun 25, 2017 • edited

hughperkins commented Jun 25, 2017 via email

merceyz commented Jun 25, 2017

hughperkins commented Jun 25, 2017 via email

merceyz commented Jun 25, 2017

hughperkins commented Jun 25, 2017

merceyz commented Jun 25, 2017 • edited

merceyz commented Jun 25, 2017

hughperkins commented Jun 25, 2017 via email

hughperkins commented Jun 25, 2017 via email

merceyz commented Jun 25, 2017

hughperkins commented Jun 25, 2017 via email

merceyz commented Jun 25, 2017 • edited

merceyz commented Jun 25, 2017 • edited

hughperkins commented Jun 26, 2017 via email

merceyz commented Jun 26, 2017

hughperkins commented Jun 26, 2017 via email

merceyz commented Jun 26, 2017

merceyz commented Jun 26, 2017

hughperkins commented Jun 26, 2017 via email

merceyz commented Jun 26, 2017

hughperkins commented Jun 26, 2017

merceyz commented Jun 26, 2017 • edited

merceyz commented Jun 26, 2017

merceyz commented Aug 12, 2016 •

edited

merceyz commented Aug 12, 2016 •

edited

merceyz commented Aug 12, 2016 •

edited

merceyz commented Aug 12, 2016 •

edited

hughperkins commented Aug 13, 2016 •

edited

hughperkins commented Aug 13, 2016 •

edited

hughperkins commented Aug 13, 2016 •

edited

merceyz commented Aug 13, 2016 •

edited

hughperkins commented Jun 24, 2017 •

edited

merceyz commented Jun 25, 2017 •

edited

merceyz commented Jun 25, 2017 •

edited

merceyz commented Jun 25, 2017 •

edited

merceyz commented Jun 25, 2017 •

edited

merceyz commented Jun 26, 2017 •

edited