New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unknown Adadelta trainer error/issues #87
Comments
The loaders are at https://github.com/hughperkins/DeepCL/tree/master/src/trainers , and the main is at https://github.com/hughperkins/DeepCL/blob/master/src/main/train.cpp Where are you seeing epsilon? As far as the training nans... training nans is a perennial problem with neural nets. There are a few possible sources, none of which are mutually exclusive, could be a bit of all of them :-P :
There's no hard and fast rule or check to know which is which... I suppose what I would do is:
As far as 'digging a bit more', you're almost certainly need to roll up your sleeves and get stuck into the code, so I would try the first two steps first. I thnk that to 'get stuck into the code', at minimum, you'd probably want to do something like:
If it was me, I'd do this currenlty probalby using python. In the past, I would have done it directly in hte C++. PS one thing you could try is, assume its a gpu kernel bug, so make everything run on the cpu, by modifying https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L51 to return |
(added a 'PS') |
See the eps (epsilon) parameter Can also be found here just to name a few
If i'm not mistaken deepcl already adds a normalization layer after the input layer automatically right? |
Ah. I think that epsilon is probalby eg https://github.com/hughperkins/DeepCL/blob/master/src/trainers/AdadeltaState.cpp#L32 ie fuzzfactor/fudgefactor. It's hardcoded for now. It shouldnt affect very much I htink.
This normzalies incoming images. But I mean normalizing: weights. Or gradients. Or both. Or at least truncating gradients onto the unit ball. Weight decay is fairly standard.
learningrate=1.0 is pretty ambitious I think, without using batch-normzliation, which deepcl doesnt support. I htink you will have more success with a learning rate more like 0.001 or 0.0001 Network arcitecture looks reaosnably standard. I'd be tempted to use |
oh, better to make the number of feature planes a power of 2, on the whole, eg |
Alright, that means the anneal value is not useful in this instance right?
I just took the default value from the second link, but as i saw first now (facepalm) on the Stanford page it was a lower, perhaps normal, number
How would i achieve this kind of normalizing?
Only difference that made was in one instance crashing my display driver with 1003 MB left of ram and in another BSOD
|
Well, it's not related to the adadelta fudgefactor yeah. anneal basically slowly reduces the learning rate over time. It's a bit tricky to use though. On the whole I think a standard approach is:
ok
SGD has weight decay https://github.com/hughperkins/DeepCL/blob/master/src/trainers/SGD.cpp#L65 It would have to be added into specific trainers, on a case by case basis.
Ah. Well... ok. I would guess most of hte memory is going into the first few layers (since they're bigger, no mp2's yet), so you could graudally increase the number of planes, something like:
I havent tried this, not sure if it's a good architecture, just demonstrating the concept of increasing hte number of planes after each |
That would be the "l2_decay" specified on the Stanford page correct?
Actually i read it wrong, the Stanford page uses a learning rate of 1.0 for the adadelta trainer
It didn't crash my system which is a good start, it seems to be doing rather good |
L2 decay is what you need. I'm 70% sure that what I linked to is L2, but I'd want to double-check somewhere to be sure it's not L1. (I think its L2, because the derivative of x squared is simply x, so that's why we simply subract some fraction of hte current weight here).
Ok
cool :-) |
I just ran it like this And yet again at epoch 8 it goes south However running this |
Ok. You mean, using SGD trainer instead of adadelta trainer? |
As SGD is the default trainer yes, so there might be a bug somewhere in the adadelta trainer? |
could be... the code is at https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp Feel free to scrutinize/tweak it. Personally, I've mostly used SGD, not used adadelta too much, so not ipmossible some buggette remains somewhere. |
I tried with the adagrad trainer which is now at epoch 27 and is constantly getting better and better, 99,5794% and a loss of 370,28
So i'll assume something is wrong in the adadelta trainer.
I sadly don't know how it's supposed to be implemented so I can't really "proofread" it |
Ok, thats good info. I will file a bug edit: oh, the title of this issue, here, this thread, is adadelta error. so ... good :-) (note: the adadelta paper is here: www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf the update rule is in 'algorithm 1': We'd need to compare this equatoin with what is written in https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L55-L74 Unfortunately, it's a bit illegible, though better than it was originally. the bug could be either in this list of operations (first thing to check), or in the underlying vector arithmetic implementations (though there are unittests for this). On the whole, I guess it's most likely the bug is in this chunk of code, linked from this paragraph, though it's mostly a guess. |
Maybe the code should be modified to insert the following line in between line 62 and line 63:
(where |
Added in 07eb0d6 We'll see if that helps. I should trigger a build really |
When you get a moment, http://deepcl.hughperkins.com/Downloads/deepcl-win64-v11.2.2alpha1.zip has an updated Adadelta. We can try this version, see if it helps. |
I downloaded it and moved over my Data folder (images and manifest) and it is sadly getting stuck. If i wait long enough and hit ctrl + c (cancel) it outputs "Empty input file" It's RAM usage is at 3,14 GB and CPU usage is 0,00% |
Any updates regarding the issue(s)? |
I'm not sure. I think the issue was fixed. There seems to be some problem with the build process in general (I just migrated to msvc2015 , a bit gratuitously), and I dont know how to diagnose/fix that. It sounds like a lot of work... I need to pluck up courage, roll up my sleeves, and look into why the new build is not working... Can you try to check if other stuff is working, on this build? or is everything entirely broken on this build? |
unittests seems to run fine |
Alright. What about running simple mnist training? Just normal sgd and so on? Is it only adadelta that is broken? Or things are more broken generally? If it's just adadelta broken, that simplifies, since then it's not a build issue, just some logic issue in adadelta, which should be not too hard for me to fix, hopefully, probably... |
Hmmm, ELU forward does this:
I suppose if
... so thats odd. Do you happen to know exactly which input value is generating a |
Code that checks and dumps during forward:
Assuming input[0] -> output[0] Gives me these input values producing NaN which makes no sense
|
Yes, that doesnt make any sense. Are you sure you've transferred the data from gpu to cpu, before printing it, and seeing hte nans? Make sure you put a |
Alright, after a lot of digging I now know why it happens. I'll attach the log so that you can see for yourself.
|
Wow, thats impressive work Chris. Very nice. I'm very impressed :-) Ok, so whats happening, I htink, is:
But eg https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.cpp#L76
would become:
(you can also try |
It didn't help, I think the problem is with the calcGradInputFromLabels calcLossFromLabels -> When the input is 1 it's output is 0 Also are these two lines correct? |
Hmmm, gradients of zero are fairly normal, and shouldnt break anything. I
*think*. Where/how is the zero gradient getting converted into `nan`?
…On Sun, Jun 25, 2017 at 11:48 AM, Chris ***@***.***> wrote:
It didn't help
I think the problem is with the calcGradInputFromLabels
calcLossFromLabels -> When the input is 1 it's output is 0
calcGradInputFromLabels -> it's input is now 0 so it's output is -1, in
the event its input is 1 it's output is now 0, assuming the output from
this (0) is multiplied with anything in the network we have everything
eventually turning in to 0 and eventually NaN
Also are these two lines correct?
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.
cpp#L151
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.
cpp#L158
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqO1_PlF4_Y8sGboEH6zQonFaC3T7ks5sHjsWgaJpZM4Jis4A>
.
|
I'm assuming something gets multiplied by 0 which then goes into a function that doesn't accept 0 and returns NaN. |
That does sound accurate. But I think we should figure out which exact
function. As you say, generally I'd expect it to be a `log` somewhere.
I'm anticipating the fix to be to add an epsilon term, ie 1e-6, to some log
somewhere. but which log is an open qustion.
…On Sun, Jun 25, 2017 at 12:05 PM, Chris ***@***.***> wrote:
I'm assuming something gets multiplied by 0 which then goes into a
function that doesn't accept 0 and returns NaN.
Same goes for -1, if it ends up in a function that doesn't accept -1 (log
for example)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqA_OwQp7juTDZ_sasSZcnroeVbdJks5sHj8PgaJpZM4Jis4A>
.
|
I prevented the output of the softmax layer from ever becoming 1 or 0 So the issue is during backwards. |
Sounds like you are heading in the right direction. Awesome :-) |
I'm completely stuck. I added checks before and after Code for checking:
|
I forgot the trainer updates them (facepalm). |
for the weights, `calcGradWeights` wont actually update the weights, but
the `gradWeights`. The weights are updated a bit later, like:
weights += learningRate * gradWeights
As to where this happens, lets see.... oh, the trainers handle this, eg .
https://github.com/hughperkins/DeepCL/blob/master/src/trainers/Adadelta.cpp#L99
```
updateWeights(layer->getWeightsWrapper(), layer->getGradWeightsWrapper(),
dynamic_cast< AdadeltaState * >(layer->getTrainerState()) );
```
…On Sun, Jun 25, 2017 at 8:58 PM, Chris ***@***.***> wrote:
I'm completely stuck.
I added checks before and after backpropWeightsImpl->calcGradWeights and
backwardImpl->backward in ConvolutionalLayer::backward() but the first
instance of NaN is still during forward where the weights and bias is NaN.
It's almost as if the weights and bias change at some point other than
ConvolutionalLayer::backward()
Code for checking:
VIRTUAL void ConvolutionalLayer::checkWrapperData(CLWrapper* input, std::string message)
{
if (input->isOnDevice())
{
cl->finish();
input->copyToHost();
cl->finish();
float* data = (float*)input->getHostArray();
int size = input->size();
for (int i = 0; i < size; i++)
{
if (std::isfinite(data[i]) == false)
{
cout << "Found non-finite number: " << message << endl;
throw runtime_error("Found non-finite number at " + asString());
}
}
}
}
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqCkSriKtXV0UyR2ZiyF-h3EA11EFks5sHrv0gaJpZM4Jis4A>
.
|
:-)
…On Sun, Jun 25, 2017 at 9:05 PM, Chris ***@***.***> wrote:
I forgot the trainer updates them (facepalm).
Checking there now
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqJUTWMidSGZQL4w9qlFQDVb717PRks5sHr2TgaJpZM4Jis4A>
.
|
There was a bug in my code for checking them (i did So with no weights nor bias as NaN the first instance is during forward on a conv layer. |
Missing a clfinish. Copying in nans. Size mismatch, eg copying too many, or too few values.
…On 25 June 2017 22:14:20 BST, Chris ***@***.***> wrote:
> both the weights and bias contains A LOT of NaNs.
There was a bug in my code for checking them (i did `int countNan;`
instead of `int countNan = 0`)
So with no weights nor bias as NaN the first instance is during forward
on a conv layer.
What can produce +-inf on that layer?
--
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub:
#87 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I forced forwardauto to use the CPU kernel and coded the cpu kernel to stop once it gets a non-finite result and print what it just did math on and got this Bias: -98174265514624848756736.000000 |
So the source of the issue is the SoftMaxLayer. Above result comes from (preventing it from ever prediction 1 and 0): Not doing that gives me a loss of inf due to log(0) which spreads fast Some resources that might help |
Ok. You're saying, putting in the max and min, in the softmax layer, fixes
the issue?
…On Mon, Jun 26, 2017 at 12:03 AM, Chris ***@***.***> wrote:
So the source of the issue is the SoftMaxLayer.
https://github.com/hughperkins/DeepCL/blob/master/src/loss/SoftMaxLayer.
cpp#L293
output[imageOffset + plane] = max(min(exp(input[imageOffset + plane] -
maxValue) / denominator, 0.99999f), 0.00001f)
Gives me the above result.
Not doing that gives me a loss of inf due to log(0) which spreads fast
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqHuZ2AFiUgCrHfiOoo5DikR1Gzd2ks5sHudcgaJpZM4Jis4A>
.
|
It prevents a loss of +-inf but bias and weights become so large that the sum of the calculation on them becomes +-inf |
Ah. One could consider that no longer a code issue in the library, but a symptom of needing weight regularization, and/or gradient clipping. I think l1 and l2 weight regularization are implemented right? Can you try adding some l2 weight regularization?
…On 26 June 2017 11:36:35 BST, Chris ***@***.***> wrote:
It prevents a loss of +-inf but bias and weights become so large that
the sum of the calculation on them becomes +-inf
--
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub:
#87 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
From command line that's weightdecay=X correct? |
Also do you want me to do that with or without the max and min? |
From command line that's weightdecay=X correct?
I think so, but you might need to double-check this point (or try it, with
large values, see if it changes anything, before switching to smaller
values). It looks like the implementation is trainer-specific, so you
might need to implement it for any specific trainer you want to use it with.
Also do you want me to do that with or without the max and min?
I think the max/min seems like a fairly ok-ish change. It essentially boils
down to adding/subtracting a small value (epsilon), which is fairly
standard. So, yes, with the max/min.
…On Mon, Jun 26, 2017 at 11:49 AM, Chris ***@***.***> wrote:
Also do you want me to do that with or without the max and min?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHiqBTFXjpAPc7z7eCFTShCEebyyj6Jks5sH4yugaJpZM4Jis4A>
.
|
I had a look at the Caffe implementation of the softmax layer and used the log(max(x, FLT_MIN)) https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_loss_layer.cpp#L106 Which prevents a loss of -+inf. I noticed that they also normalize the loss which is probably something we should be doing as well |
Ah, looking at other implementations is a good idea :-) . Cool :-)
Great! :-)
ok. what calculation are they doing for this? |
I'm not quite sure
I take that back, after running for a while the numbers in weights and bias grow too large producing -nan |
Alright, finally done with this. The loss becoming nan/+-inf is solved by preventing log of a number smaller than FLT_MIN (x<=0) The bias and weights becoming so large that the result of math on them becomes nan/+-inf is because of a learning rate that is too high. The fix for loss hides this so I've added a check for it. See #120 |
Hello,
I was looking at the different trainers and reading some documents on them when i noticed a value called "epsilon".
This value is nowhere to be seen in the API documentation and thus i assume it's missing. (Unless it's the "anneal" option which would be awkward for me)
The text was updated successfully, but these errors were encountered: