Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown Adadelta trainer error/issues #87

Closed
merceyz opened this issue Aug 12, 2016 · 123 comments
Closed

Unknown Adadelta trainer error/issues #87

merceyz opened this issue Aug 12, 2016 · 123 comments
Assignees
Labels

Comments

@merceyz
Copy link
Contributor

merceyz commented Aug 12, 2016

Hello,

I was looking at the different trainers and reading some documents on them when i noticed a value called "epsilon".
This value is nowhere to be seen in the API documentation and thus i assume it's missing. (Unless it's the "anneal" option which would be awkward for me)

@merceyz merceyz changed the title Missing "epsilon" in the adadelta trainer Missing "epsilon" in the adadelta trainer? Aug 12, 2016
@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

Also this seems to be a common "theme", it does really well for a while and then it shoots up to infinity
70db370756473a6c7c46c8215d524e4e
f70d9bfbe76932ed21d5bf63b5ed1593

(ignore the sample difference, i'm just testing the trainers)

@hughperkins
Copy link
Owner

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

As far as the training nans... training nans is a perennial problem with neural nets. There are a few possible sources, none of which are mutually exclusive, could be a bit of all of them :-P :

  • learning rate a bit high => causes learning to diverge. since the gradient surface is not planar or linear, but twisted and convoluted, this might not be apparent to start with ,and then, bam!
  • a bug. It's possible...
  • all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization

There's no hard and fast rule or check to know which is which... I suppose what I would do is:

  • look around for what are sensible learning rates for the model and data I'm using, slot those in. does it work?
  • if it's a novel model, first try on a less novel model, and then see step 1 above; ie try something simple first, get that working, then try the original research bit
  • if you're using a standard-ish learning rate in a standardish model, and that model doesnt have any kind of additional regularization, normalization, gradient truncation etc that you dont have, then someone would need to dig a bit more. 'someone' probalyb menaing: you :-P

As far as 'digging a bit more', you're almost certainly need to roll up your sleeves and get stuck into the code, so I would try the first two steps first. I thnk that to 'get stuck into the code', at minimum, you'd probably want to do something like:

  • arrange for the weights to be saved each iteration, to a different file each time
  • as soon as it starts diverging, hitting nans, abort, note which iteration it was
  • now you can load those weights directly, examine the weights, their magnitude etc. what if you drop the learning rate? are some weights heading for infinity? will weight regularization help? etc...

If it was me, I'd do this currenlty probalby using python. In the past, I would have done it directly in hte C++.

PS one thing you could try is, assume its a gpu kernel bug, so make everything run on the cpu, by modifying https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L51 to return true only for kernle 0 (ie the cpu kernel), and ditto for Forward, and Backward. If this doesnt create NaNs, there might be a bug in one or more of the gpu kernels, for the specific geometry you are using.

@hughperkins
Copy link
Owner

(added a 'PS')

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon?

See the eps (epsilon) parameter
https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html

Can also be found here just to name a few
https://keras.io/optimizers/

all is perfectly in order, but for optimal learning you need some kind of gradient truncation, normalization, or regularization

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right?
Anyways this is the config i'm currently running
deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Values are gathered from the second link on where to find the "epsilon" value, i assumed the "eps" was the same as "anneal"

@hughperkins
Copy link
Owner

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

deepcl_train trainer=adadelta anneal=1e-08 rho=0.95 learningrate=1.0 batchsize=128 numepochs=4000 netdef=4*(60c3z-relu-mp2)-150n-relu-150n-relu-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

Network arcitecture looks reaosnably standard. I'd be tempted to use tanh after the fc layers ,rather than relu. Might want only two fc layers perhaps, like maybe -150n-tanh-2n perhaps?

@hughperkins
Copy link
Owner

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.

Alright, that means the anneal value is not useful in this instance right?

learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.

How would i achieve this kind of normalizing?

oh, better to make the number of feature planes a power of 2, on the whole, eg 64c3z. I'm not sure if it will make any difference at all for deepcl kernels, but as a general rule, gpu kernels will tend to be more optimized for powers of 2 for the number of feature planes.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 batchsize=128 numepochs=4000 netdef=4*(64c3z-relu-mp2)-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

@hughperkins
Copy link
Owner

Alright, that means the anneal value is not useful in this instance right?

Well, it's not related to the adadelta fudgefactor yeah. anneal basically slowly reduces the learning rate over time. It's a bit tricky to use though. On the whole I think a standard approach is:

I just took the default value from the second link, but as i saw first now (facepalm) on the standford page it was a lower, perhaps normal, number

ok

How would i achieve this kind of normalizing?

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)

I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.

That would be the "l2_decay" specified on the Stanford page correct?

I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:

16c3z-relu-mp2-32c3z-relu-mp2-2*(64c3z-relu-mp2)
I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each -mp2

It didn't crash my system which is a good start, it seems to be doing rather good

@hughperkins
Copy link
Owner

Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD

L2 decay is what you need. I'm 70% sure that what I linked to is L2, but I'd want to double-check somewhere to be sure it's not L1. (I think its L2, because the derivative of x squared is simply x, so that's why we simply subract some fraction of hte current weight here).

Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer

Ok

It didn't crash my system which is a good start, it seems to be doing rather good

cool :-)

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

I just ran it like this deepcl_train trainer=adadelta rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

And yet again at epoch 8 it goes south

However running this
deepcl_train learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt
Seems to be fine, loss goes up a bit at epoch 11 and 12 then back down at 13

@hughperkins
Copy link
Owner

Ok. You mean, using SGD trainer instead of adadelta trainer?

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

As SGD is the default trainer yes, so there might be a bug somewhere in the adadelta trainer?

@hughperkins
Copy link
Owner

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

@merceyz
Copy link
Contributor Author

merceyz commented Aug 12, 2016

I tried with the adagrad trainer which is now at epoch 27 and is constantly getting better and better, 99,5794% and a loss of 370,28

deepcl_train trainer=adagrad rho=0.95 learningrate=0.001 weightdecay=0.001 batchsize=128 numepochs=4000 netdef=16c3z-relu-mp2-32c3z-relu-mp2-64c3z-relu-mp2-128c3z-relu-mp2-150n-tanh-2n datadir=Data trainfile=Train.txt validatefile=Validation.txt

So i'll assume something is wrong in the adadelta trainer.

could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere.

I sadly don't know how it's supposed to be implemented so I can't really "proofread" it

@merceyz merceyz changed the title Missing "epsilon" in the adadelta trainer? Unknown Adadelta trainer error/issues Aug 12, 2016
@hughperkins
Copy link
Owner

hughperkins commented Aug 13, 2016

So i'll assume something is wrong in the adadelta trainer.

Ok, thats good info. I will file a bug

edit: oh, the title of this issue, here, this thread, is adadelta error. so ... good :-)

(note: the adadelta paper is here: www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf the update rule is in 'algorithm 1':
adadeltarule2

We'd need to compare this equatoin with what is written in https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L55-L74 Unfortunately, it's a bit illegible, though better than it was originally. the bug could be either in this list of operations (first thing to check), or in the underlying vector arithmetic implementations (though there are unittests for this). On the whole, I guess it's most likely the bug is in this chunk of code, linked from this paragraph, though it's mostly a guess.

@hughperkins
Copy link
Owner

hughperkins commented Aug 13, 2016

What the code says is:

    clWorking = clGradWeights;
// copy clGradWeights vector into clWorking vector
    clWorking.squared();
// take per-element square of each element in clWorking, so they equal gradient squared
// this is probalby the gt2 terms in the equation above (gt is clGradWeights)
// (writing gt squared as gt2)
    clWorking *= (1 - decay);
// (1 - decay) is probably (1 - p)  (writing rho as p)
// so clWorking now holds (1-p)gt2
    clSumGradSquared *= decay;
// by comparison with equation 8 (see below),
// it looks like clsumgradsquared holds the running average of the g2 elements over time,
// ie E[g2]
// so, from the code, we now have:
// clSumGradSquared is:  p * E[g2]
    clSumGradSquared += clWorking;
// now, clSumGradSqured is: p * E[g2] + (1-p)gt2
// ie, looks like step 4, in the algorithm screenshot above

edit: ah this bit is equation 8:
adadelta_equation8

edit2: next bit of code:

    clWorking = clSumGradSquared;
// copy p * E[g2] + (1-p)gt2 into clWorking
// so, clWorking is: p * E[g2] + (1-p)gt2
    clWorking.inv();
// calculate 1/ clWorking, for each element, so now each elemnt of clWorking is:
// 1 / (p * E[g2] + (1-p)gt2)
    clWorking *= clSumUpdateSquared;
// I guess that `update` is delta x in the equation in the screenshot, which we can
// write maybe AX  (since A looks a bit like the delta symbol, the triangle)
// I guess that ... hmmm.... seems like we are calculating equation 9 and step 5
// equation 9 is:

equation9

but ... in equation 9 ,there is an epsilon, which is what you mentioned above
... and in the code ... no epsilon :-P  maybe this is the bug

@hughperkins
Copy link
Owner

hughperkins commented Aug 13, 2016

Maybe the code should be modified to insert the following line in between line 62 and line 63:

clWorking  += 0.0000001f;

(where 0.0000001f is epsilon)

@hughperkins
Copy link
Owner

Added in 07eb0d6 We'll see if that helps. I should trigger a build really

@hughperkins
Copy link
Owner

When you get a moment, http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip has an updated Adadelta. We can try this version, see if it helps.

@merceyz
Copy link
Contributor Author

merceyz commented Aug 13, 2016

http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip

I downloaded it and moved over my Data folder (images and manifest) and it is sadly getting stuck.

97ad5005c49913a2cb5e74fa0dfcb5a7

If i wait long enough and hit ctrl + c (cancel) it outputs "Empty input file"

It's RAM usage is at 3,14 GB and CPU usage is 0,00%

@merceyz
Copy link
Contributor Author

merceyz commented Aug 20, 2016

Any updates regarding the issue(s)?

@hughperkins
Copy link
Owner

I'm not sure. I think the issue was fixed. There seems to be some problem with the build process in general (I just migrated to msvc2015 , a bit gratuitously), and I dont know how to diagnose/fix that. It sounds like a lot of work... I need to pluck up courage, roll up my sleeves, and look into why the new build is not working... Can you try to check if other stuff is working, on this build? or is everything entirely broken on this build?

@merceyz
Copy link
Contributor Author

merceyz commented Aug 21, 2016

unittests seems to run fine

@hughperkins
Copy link
Owner

Alright. What about running simple mnist training? Just normal sgd and so on? Is it only adadelta that is broken? Or things are more broken generally? If it's just adadelta broken, that simplifies, since then it's not a build issue, just some logic issue in adadelta, which should be not too hard for me to fix, hopefully, probably...

@hughperkins
Copy link
Owner

hughperkins commented Jun 24, 2017

Hmmm, ELU forward does this:
https://github.com/hughperkins/DeepCL/blob/master/src/activate/ActivationForwardGpuNaive.cpp#L85

    "#elif defined ELU\n"
    "    #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)\n"

I suppose if output is more than 10 or so, exp on output would give infinity (but not nan). I think it's physically impossible for exp to give nan, given non-nan input. Some examples of calling exp on different nubmers:

~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(10)"
 = 22026.5
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(100)"
 = 2.68812e+43
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(200)"
 = 7.22597e+86
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(300)"
 = 1.94243e+130
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(400)"
 = 5.22147e+173
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(500)"
 = 1.40359e+217
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(1000)"
 = Infinity
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(100000)"
 = Infinity
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(-10000)"
~= 0
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(-100000000000)"
~= 0
~/git-local/DeepCL/src/conv (master|✔) $ wcalc "exp(0)"
 = 1

... so thats odd. Do you happen to know exactly which input value is generating a nan output value, in the elu forward?

@merceyz
Copy link
Contributor Author

merceyz commented Jun 24, 2017

Code that checks and dumps during forward:

if (layers[layerId]->hasOutputWrapper() && layers[layerId]->getOutputWrapper()->isOnDevice() == true)
{
    #pragma region Find NaN
    layers[layerId]->getOutputWrapper()->copyToHost();
    float * result = (float*)layers[layerId]->getOutputWrapper()->getHostArray();
    int items = layers[layerId]->getOutputNumElements();
    for (int i = 0; i < items; i++)
    {
        if (std::isfinite(result[i]) == false)
        {
            cout << "Found error at layer " << layerId << " " << layers[layerId]->asString() << endl;

            ofstream dumpData;
            dumpData.open("ForwardDump.txt");

            layers[layerId - 1]->getOutputWrapper()->copyToHost();
            float* input = (float*)layers[layerId - 1]->getOutputWrapper()->getHostArray();
            int inputItems = layers[layerId - 1]->getOutputNumElements();

            dumpData << "Input data of layer " << layers[layerId]->asString() << endl;
            for (int inputIndex = 0; inputIndex < inputItems; inputIndex++)
            {
                dumpData << input[inputIndex] << "  ";
            }

            dumpData << endl << "Output data of layer " << layers[layerId]->asString() << endl;
            for (int outputIndex = 0; outputIndex < items; outputIndex++)
            {
                dumpData << result[outputIndex] << "  ";
            }
            dumpData.close();

            throw runtime_error("Non-finite number found");
        }
    }
    #pragma endregion
}

Assuming input[0] -> output[0]

Gives me these input values producing NaN which makes no sense

0.190712
0.181808
0.181803
0.196433
0.26465
0.246923
0.219679
0.260867
0.204367
0.238279
0.191556
0.165134
0.180202
0.190625
0.193821
0.21916
0.253312
0.285008
0.246093
0.220917
0.233793
0.351242
0.41973
0.39212
0.332233
0.291702
0.226313
0.200484
0.238277
0.266027
0.406134

@hughperkins
Copy link
Owner

Yes, that doesnt make any sense. Are you sure you've transferred the data from gpu to cpu, before printing it, and seeing hte nans? Make sure you put a clFinish() before and after the transfer, just to be sure.

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

Alright, after a lot of digging I now know why it happens.
As soon as it predicts a 100% chance of a label it starts to go "insane"

I'll attach the log so that you can see for yourself.
Entries and where they come from:

"Forward" https://github.com/hughperkins/DeepCL/blob/master/src/net/NeuralNet.cpp#L187
"Backwards" https://github.com/hughperkins/DeepCL/blob/master/src/net/NeuralNet.cpp#L224

"Input to loss" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L87
Value is: `output[imageOffset + label]`
"Output from loss" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L87
Value is: `-log(output[imageOffset + label])`
"LOSS IS CURRENTLY" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L90
Value is: `loss`

"Grad out" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L151
Value is: `output[imageOffset + plane]`
"GradOut final" https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L158

TrainingLog.txt

@hughperkins
Copy link
Owner

Alright, after a lot of digging I now know why it happens.
As soon as it predicts a 100% chance of a label it starts to go "insane"

Wow, thats impressive work Chris. Very nice. I'm very impressed :-)

Ok, so whats happening, I htink, is:

  • if the input to the softmaxloss layer is 0, then log(0) evaluates to Inf, and then it's over really

But eg log(0 + 1e-6) is no longer infinity, but -13. Can you try adding + 1e-6 into each of the log terms, in the forward direction, and see what happens? For example, this line:

https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L76

                loss += - log(output[ imageOffset + label ]);

would become:

                loss += - log(output[ imageOffset + label ] + 1e-6);

(you can also try +1e-8. I think both will fix the issue, probably. They might give ever so slightly different results, I'm not sure which set of results will be 'better')

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

It didn't help, I think the problem is with the calcGradInputFromLabels

calcLossFromLabels -> When the input is 1 it's output is 0
calcGradInputFromLabels -> it's input is now 0 so it's output is -1, in the event its input is 1 it's output is now 0, assuming the output from this (0) is multiplied with anything in the network we have everything eventually turning in to 0 and eventually NaN

Also are these two lines correct?
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L151
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L158

@hughperkins
Copy link
Owner

hughperkins commented Jun 25, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

I'm assuming something gets multiplied by 0 which then goes into a function that doesn't accept 0 and returns NaN.
Same goes for -1, if it ends up in a function that doesn't accept -1 (log for example)

@hughperkins
Copy link
Owner

hughperkins commented Jun 25, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

I prevented the output of the softmax layer from ever becoming 1 or 0
output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] - maxValue) / denominator, 0.99999f), 0.00001f);
Loss doesn't become NaN but the result during forward of some of the layers do because both the weights and bias contains A LOT of NaNs.

So the issue is during backwards.
Where should I go from here?

@hughperkins
Copy link
Owner

Sounds like you are heading in the right direction. Awesome :-)

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

I'm completely stuck.

I added checks before and after backpropWeightsImpl->calcGradWeights and backwardImpl->backward in ConvolutionalLayer::backward() but the first instance of NaN is still during forward where the weights and bias is NaN.
It's almost as if the weights and bias for the current layer change at some point other than ConvolutionalLayer::backward()

Code for checking:

VIRTUAL void ConvolutionalLayer::checkWrapperData(CLWrapper* input, std::string message)
{
    if (input->isOnDevice())
    {
        cl->finish();
        input->copyToHost();
        cl->finish();

        float* data = (float*)input->getHostArray();
        int size = input->size();

        for (int i = 0; i < size; i++)
        {
            if (std::isfinite(data[i]) == false)
            {
                cout << "Found non-finite number: " << message << endl;
                throw runtime_error("Found non-finite number at " + asString());
            }
        }
    }
}

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

I forgot the trainer updates them (facepalm).
Checking there now

@hughperkins
Copy link
Owner

hughperkins commented Jun 25, 2017 via email

@hughperkins
Copy link
Owner

hughperkins commented Jun 25, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

both the weights and bias contains A LOT of NaNs.

There was a bug in my code for checking them (i did int countNan; instead of int countNan = 0)

So with no weights nor bias as NaN the first instance is during forward on a conv layer.
What can produce +-inf on that layer?

@hughperkins
Copy link
Owner

hughperkins commented Jun 25, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

I forced forwardauto to use the CPU kernel and coded the cpu kernel to stop once it gets a non-finite result and print what it just did math on and got this

Bias: -98174265514624848756736.000000
Input: 892913097361408654370144256.000000; weight: 85576382305778379259904.000000
Sum: inf

@merceyz
Copy link
Contributor Author

merceyz commented Jun 25, 2017

So the source of the issue is the SoftMaxLayer.

Above result comes from (preventing it from ever prediction 1 and 0):
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L293
output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] - maxValue) / denominator, 0.99999f), 0.00001f)

Not doing that gives me a loss of inf due to log(0) which spreads fast
Thoughts?

Some resources that might help
https://stackoverflow.com/a/9906882/2923845
https://cs231n.github.io/linear-classify/#softmax
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp

@hughperkins
Copy link
Owner

hughperkins commented Jun 26, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

It prevents a loss of +-inf but bias and weights become so large that the sum of the calculation on them becomes +-inf

@hughperkins
Copy link
Owner

hughperkins commented Jun 26, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

From command line that's weightdecay=X correct?

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

Also do you want me to do that with or without the max and min?

@hughperkins
Copy link
Owner

hughperkins commented Jun 26, 2017 via email

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

I had a look at the Caffe implementation of the softmax layer and used the log(max(x, FLT_MIN)) https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L106
here https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L117

Which prevents a loss of -+inf.
With a weight decay of 0,1 it seems to be working.

I noticed that they also normalize the loss which is probably something we should be doing as well
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L111

@hughperkins
Copy link
Owner

Ah, looking at other implementations is a good idea :-) . Cool :-)

With a weight decay of 0,1 it seems to be working.

Great! :-)

I noticed that they also normalize the loss which is probably something we should be doing as well
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L111

ok. what calculation are they doing for this?

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

ok. what calculation are they doing for this?

I'm not quite sure
https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L59
EDIT:

the loss is normalized by the number of (nonignored) labels present; otherwise the loss is simply summed over spatial locations.

With a weight decay of 0,1 it seems to be working.

I take that back, after running for a while the numbers in weights and bias grow too large producing -nan

@merceyz
Copy link
Contributor Author

merceyz commented Jun 26, 2017

Alright, finally done with this.

The loss becoming nan/+-inf is solved by preventing log of a number smaller than FLT_MIN (x<=0)

The bias and weights becoming so large that the result of math on them becomes nan/+-inf is because of a learning rate that is too high. The fix for loss hides this so I've added a check for it.

See #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants