-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Macbook Pro 13 2018: get nan loss in MNIST_cnn.py example #168
Comments
I can support this observation on my system. Interestingly, the metal implementation for the amd GPU appears to work. Mac OSX 10.13.6, Radeon Pro 555, Intel HD 630
both installed via pip
produces NaNs for loss
produces NaNs for loss
crashes completely with :
however, the metal shaders appear to work on the amd GPU, so you might want to try to use them
works, 333us / step and
|
also, the CPU backends may produce problems. The llvm_cpu backend segfaults
while the OpenCL_cpu backend works:
works fine with 6ms/step |
Yes, we've noticed some serious problems with OpenCL on MacBooks - AFAIK Apple has stopped working on it altogether and we'll probably pull OpenCL support on MacBooks. Metal is definitely the way to go and we'll be moving it out of the 'experimental' config next release. LLVM is still experimental but hopefully we'll have some updates there soon. |
I also confirm the issue with an Intel Iris 650: OpenCL produces NaNs for loss while Metal crashes completely with same error. Unfortunately I don't have a dedicated GPU to test. |
It appears there was a release in github 2 months ago that moved Metal away from experimental. I think this was not the 0.3.5 release but one later? Do we know if this issue would be resolved? I cannot run any of the example code from Chollet's Deep Learning with Python (Keras) with the 0.3.5 release and I am wondering if the later one would help somehow (though not sure how to build it). |
We've made several fixes in the latest version, which we sadly haven't been able to release yet. I'm very optimistic we'll get it out this week, hopefully tomorrow. Apologies |
0.5.0 is out, thanks for your patience - from here on out releases will be regular and frequent. If you can retest with 0.5.0 and let us know if the issue is resolved, we'd really appreciate it! |
I am having the same problem with 0.5.0. I have tested with MNIST_cnn and with the example found in I can reproduce this issue with both Keras 2.2 and 2.4. Any suggestions are much appreciated!
|
I have the same problem with an Iris plus 655 under metal (or even llvm cpu) under the new 0.50 version. Any example code computes NaN for losses and really terrible accuracy. Chollet's Keras deep learning examples do this. Even the plaidml/keras test codes in the plaidml git library do this, for example: $ python mnist_mlp_test.py Layer (type) Output Shape Param #dense_1 (Dense) (None, 128) 100480 activation_1 (Activation) (None, 128L) 0 dense_2 (Dense) (None, 128) 16512 activation_2 (Activation) (None, 128L) 0 dense_3 (Dense) (None, 10) 1290 activation_3 (Activation) (None, 10L) 0Total params: 118,282 Epoch 1/1 |
Hey people, I tried compiling PlaidML from source to the very last commit 7623a07 and with my Intel Iris 650 running Metal works like a charm. MacBook Pro 2017. Plaidbench successfully passes the test and a small autoencoder net on images works perfectly. EDIT: not tried v.0.5.0 from PIP |
JacoGasp I will have to try building off master then. I see there are actually many newer files (2 month vs 4 month old files) in master than the supposedly new 0.5.0 branch so wonder if 0.5.0 doesn't actually include all the latest fixes. |
yes, we've had some issues getting 0.5.0 binaries released and in the process we wrote a lot of code. The binaries in pypi are based on the code that is in the 0.5.0 PR that we'll land today. We have some work to do to make travis happy (we use buildkite internally). The pips do include pretty much anything that would fix this issue. What version of macOS are you running @chrisbarrera? In addition to OpenCL being broken on macOS, metal was quite broken pre-mojave (for intel gpus) |
@brianretford I am running Mojave 10.14.2 on a macbook pro 13" 2018 with Iris Plus 655. I am currently running 0.5.0 installed from pip. Edit: Its worth noting that llvm on the cpu gives the same results, as does opencl. |
Ahh, also we have one of those laptops in any case. But if it repros in LLVM CPU sounds like there is something more fundamental. We'll try to dig into this but we are really pushing to get the Stripe based version out because it's just that much better. |
Encountered this bug on a MBP 2015 13" (internal Intel Iris 610 GPU) and a 2018 MBP 15" (external RX 570), both working through metal. Using keras 2.2.4 and PlaidML-Keras 0.5.0 freshly installed from pip. It still trains in both cases, so I have a question – does it just not display loss, or does it fail to calculate it completely, thus failing to train the model? |
loss being nan is usually a bug in our code or in the network itself. Does it work if you don't use plaid? Can you provide some code to help w/ repro? |
Yes, it works at least on the 2015 13" MBP running mac os 10.14.14 with vanilla Keras 2.2.4 on Python 3.6.5. There, it appears to correctly show loss. Here is the code I was trying to run. https://gist.github.com/shpigunov/8b3221a74519834ae37b88f6f7607e21 |
So just for fun in the latest master with STRIPE I copied "stripe_config": "gpu/intel_gen9" into the Iris config of macos_experimental.json and recompiled. Setting USE_STRIP=1 the NaN's disappear. Mostly, it seems I can train too many epochs and the decreasing losses eventually become NaNs and blows the accuracy value, but up to that point the accuracy increases. I don't know if stripe is safe to use with Iris at this point (there is a reason y'all left it out right? :) but its a datapoint... |
It should be safe to use, the only reason it was left out is that we're focusing on exactly one platform initially and then we'll branch out and add official support as we have appropriate test equipment. The 0.6.0 release should be even faster. It's doing a lot more and we still have more to do. |
Macbook 15 2014 OSX 10.14.5 PlaidML 0.6.0 is getting the same errors with all backends except
One example output:
|
I just installed 0.6.3 and this problem went away, but only if USE_STRIPE=1. I don't get any NANs whatsoever whereas it either happened all the time, or regularly after a few epochs, on earlier releases or when USE_STRIPE=0. Major improvement even over the 0.5.x and 0.6.x releases. |
In mnist, curiously, I don't get any NANs if batch_size==129 or 257 on MacBook Pro (Retina, 13-inch, Mid 2014, macOS 10.13.6, Intel Iris 1536 MB, Plaidml 0.6.4 or 0.5.0 working through metal), although I get NANs if batch_size==128 or 256. |
I'm in a similar situation to @aalskdlk, using mnist_mlp.py which is much simpler (no convolutions). |
setting batch_size to 129 works for me. Weird! |
It seems that the loss explodes at the 1st epoch when the number of training parameters is too large (> ~111,217) with PlaidML + OpenCL on Mac. No problem with Metal implementation. OpenCL implementation didn't work at all on CPU. Tested environment. |
Seeing this issue with my custom CNN. After I reduced my batch size by half the nan loss went away. |
I have the same issue with an architecture containing both conv1d and lstm layers. |
Same thing. But when i set batch_size to 129 accuracy is about 0.47. It should be way higher. |
I'm using RStudio and Plaidml to work through the book "Deep Learning with R". In chapter 2, the first exercise is giving me similar problems to what others have highlighted. In the example, if I choose batch_size to be 129 instead of 128, I get a non-NAN value for loss. However, when calculating the metrics, they are pretty terrible compared to the results in the book (I get an accuracy of 0.47 vs. 0.97 in text). My current setup:
|
I have a same issue when I'm trying to do very simple mnist examples (con2d 2 layers). I tried batch sizes of 32, 64, 128, 129, 256, 257, but none of these worked. My current setup is;
|
Firstly, I have tried some solutions on previous issue post but it failed
I tried to run on Irish gpu with OpenCL but it didn't work
(plaidml) Thanhs-MacBook-Pro:Desktop thanh$ python plaidmltest.py
Using plaidml.keras.backend backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
INFO:plaidml:Opening device "opencl_intel_intel(r)_iris(tm)_plus_graphics_655.0"
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 62s 1ms/step - loss: nan - acc: 0.0000e+00 - val_loss: nan - val_acc: 0.0000e+00
Epoch 2/12
31616/60000 [==============>...............] - ETA: 27s - loss: nan - acc: 0.0000e+00
However, when I used OpenCL on CPU, it worked
(plaidml) Thanhs-MacBook-Pro:Desktop thanh$ python plaidmltest.py
Using plaidml.keras.backend backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
INFO:plaidml:Opening device "opencl_cpu.0"
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
INFO:plaidml:Analyzing Ops: 85 of 285 operations complete
21504/60000 [=========>....................] - ETA: 3:23 - loss: 0.4775 - acc: 0.8510
The text was updated successfully, but these errors were encountered: