Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Macbook Pro 13 2018: get nan loss in MNIST_cnn.py example #168

Open
PotatoThanh opened this issue Sep 3, 2018 · 33 comments
Open

Macbook Pro 13 2018: get nan loss in MNIST_cnn.py example #168

PotatoThanh opened this issue Sep 3, 2018 · 33 comments
Assignees

Comments

@PotatoThanh
Copy link

Firstly, I have tried some solutions on previous issue post but it failed

I tried to run on Irish gpu with OpenCL but it didn't work

(plaidml) Thanhs-MacBook-Pro:Desktop thanh$ python plaidmltest.py
Using plaidml.keras.backend backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
INFO:plaidml:Opening device "opencl_intel_intel(r)_iris(tm)_plus_graphics_655.0"
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 62s 1ms/step - loss: nan - acc: 0.0000e+00 - val_loss: nan - val_acc: 0.0000e+00
Epoch 2/12
31616/60000 [==============>...............] - ETA: 27s - loss: nan - acc: 0.0000e+00

However, when I used OpenCL on CPU, it worked

(plaidml) Thanhs-MacBook-Pro:Desktop thanh$ python plaidmltest.py
Using plaidml.keras.backend backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
INFO:plaidml:Opening device "opencl_cpu.0"
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
INFO:plaidml:Analyzing Ops: 85 of 285 operations complete
21504/60000 [=========>....................] - ETA: 3:23 - loss: 0.4775 - acc: 0.8510

@reichardtj
Copy link

I can support this observation on my system. Interestingly, the metal implementation for the amd GPU appears to work.

Mac OSX 10.13.6, Radeon Pro 555, Intel HD 630

keras.__version__ = 2.2.2
plaiml.__version__ = 0.3.5

both installed via pip

INFO:plaidml:Opening device "opencl_amd_amd_radeon_pro_555_compute_engine.0"

produces NaNs for loss

INFO:plaidml:Opening device "opencl_intel_intel(r)_hd_graphics_630.0"

produces NaNs for loss

INFO:plaidml:Opening device "metal_intel(r)_hd_graphics_unknown.0"

crashes completely with :

    ERROR:plaidml:Compiler::Build> Compilation failure:
Compilation failed:

program_source:1540:11: warning: unused variable 'x0'
      int x0 = ((64 * x0_lid) + x0_tid);
          ^
program_source:5822:7: warning: unused variable 'tid'
  int tid = _tid;
      ^
program_source:6157:12: error: cannot initialize a variable of type 'int2' (aka 'vector_int2') with an rvalue of type 'vec<char, 2>' (vector of 2 'char' values)
      int2 LX_T315 = select((char2)((float2)0), (char2)((float2)1), ((bool2)LX_T314));
           ^         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Traceback (most recent call last):
  File "mnist_cnn.py", line 66, in <module>
    validation_data=(x_test, y_test))
  File "/anaconda/lib/python3.6/site-packages/keras/engine/training.py", line 1037, in fit
    validation_steps=validation_steps)
  File "/anaconda/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/anaconda/lib/python3.6/site-packages/plaidml/keras/backend.py", line 165, in __call__
    self._invoker.invoke()
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 1432, in invoke
    return Invocation(self._ctx, self)
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 1438, in __init__
    self._as_parameter_ = _lib().plaidml_schedule_invocation(ctx, invoker)
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 716, in _check_err
    self.raise_last_status()
  File "/anaconda/lib/python3.6/site-packages/plaidml/library.py", line 131, in raise_last_status
    raise self.last_status()


plaidml.exceptions.Internal: The exception chain appears to be corrupt

however, the metal shaders appear to work on the amd GPU, so you might want to try to use them

INFO:plaidml:Opening device "metal_amd_radeon_pro_460.0"

works, 333us / step and

Test loss: 0.02899222058057785
Test accuracy: 0.9912

@reichardtj
Copy link

also, the CPU backends may produce problems. The llvm_cpu backend segfaults

INFO:plaidml:Opening device "llvm_cpu.0"
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
INFO:plaidml:Analyzing Ops: 131 of 285 operations complete
  640/60000 [..............................] - ETA: 7:12 - loss: nan - acc: 0.0000e+00Segmentation fault: 11

while the OpenCL_cpu backend works:

INFO:plaidml:Opening device "opencl_cpu.0"

works fine with 6ms/step

@brianretford
Copy link

Yes, we've noticed some serious problems with OpenCL on MacBooks - AFAIK Apple has stopped working on it altogether and we'll probably pull OpenCL support on MacBooks. Metal is definitely the way to go and we'll be moving it out of the 'experimental' config next release.

LLVM is still experimental but hopefully we'll have some updates there soon.

@jacogasp
Copy link

jacogasp commented Oct 9, 2018

I can support this observation on my system. Interestingly, the metal implementation for the amd GPU appears to work.

Mac OSX 10.13.6, Radeon Pro 555, Intel HD 630

keras.__version__ = 2.2.2
plaiml.__version__ = 0.3.5

both installed via pip

INFO:plaidml:Opening device "opencl_amd_amd_radeon_pro_555_compute_engine.0"

produces NaNs for loss

INFO:plaidml:Opening device "opencl_intel_intel(r)_hd_graphics_630.0"

produces NaNs for loss

INFO:plaidml:Opening device "metal_intel(r)_hd_graphics_unknown.0"

crashes completely with :

    ERROR:plaidml:Compiler::Build> Compilation failure:
Compilation failed:

program_source:1540:11: warning: unused variable 'x0'
      int x0 = ((64 * x0_lid) + x0_tid);
          ^
program_source:5822:7: warning: unused variable 'tid'
  int tid = _tid;
      ^
program_source:6157:12: error: cannot initialize a variable of type 'int2' (aka 'vector_int2') with an rvalue of type 'vec<char, 2>' (vector of 2 'char' values)
      int2 LX_T315 = select((char2)((float2)0), (char2)((float2)1), ((bool2)LX_T314));
           ^         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Traceback (most recent call last):
  File "mnist_cnn.py", line 66, in <module>
    validation_data=(x_test, y_test))
  File "/anaconda/lib/python3.6/site-packages/keras/engine/training.py", line 1037, in fit
    validation_steps=validation_steps)
  File "/anaconda/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/anaconda/lib/python3.6/site-packages/plaidml/keras/backend.py", line 165, in __call__
    self._invoker.invoke()
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 1432, in invoke
    return Invocation(self._ctx, self)
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 1438, in __init__
    self._as_parameter_ = _lib().plaidml_schedule_invocation(ctx, invoker)
  File "/anaconda/lib/python3.6/site-packages/plaidml/__init__.py", line 716, in _check_err
    self.raise_last_status()
  File "/anaconda/lib/python3.6/site-packages/plaidml/library.py", line 131, in raise_last_status
    raise self.last_status()


plaidml.exceptions.Internal: The exception chain appears to be corrupt

however, the metal shaders appear to work on the amd GPU, so you might want to try to use them

INFO:plaidml:Opening device "metal_amd_radeon_pro_460.0"

works, 333us / step and

Test loss: 0.02899222058057785
Test accuracy: 0.9912

I also confirm the issue with an Intel Iris 650: OpenCL produces NaNs for loss while Metal crashes completely with same error.

Unfortunately I don't have a dedicated GPU to test.

@chrisbarrera
Copy link

It appears there was a release in github 2 months ago that moved Metal away from experimental. I think this was not the 0.3.5 release but one later? Do we know if this issue would be resolved? I cannot run any of the example code from Chollet's Deep Learning with Python (Keras) with the 0.3.5 release and I am wondering if the later one would help somehow (though not sure how to build it).

@brianretford
Copy link

We've made several fixes in the latest version, which we sadly haven't been able to release yet. I'm very optimistic we'll get it out this week, hopefully tomorrow. Apologies

@brianretford
Copy link

brianretford commented Feb 1, 2019

0.5.0 is out, thanks for your patience - from here on out releases will be regular and frequent. If you can retest with 0.5.0 and let us know if the issue is resolved, we'd really appreciate it!

@JTDean123
Copy link

I am having the same problem with 0.5.0. I have tested with MNIST_cnn and with the example found in https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py and with both the metal and OpenCL implementations.

I can reproduce this issue with both Keras 2.2 and 2.4. Any suggestions are much appreciated!

INFO:plaidml:Opening device "metal_intel(r)_iris(tm)_plus_graphics_650.0"
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/100
INFO:plaidml:Analyzing Ops: 127 of 321 operations complete
  46/1563 [..............................] - ETA: 7:17 - loss: nan - acc: 0.0944```

@chrisbarrera
Copy link

chrisbarrera commented Feb 3, 2019

I have the same problem with an Iris plus 655 under metal (or even llvm cpu) under the new 0.50 version. Any example code computes NaN for losses and really terrible accuracy. Chollet's Keras deep learning examples do this.

Even the plaidml/keras test codes in the plaidml git library do this, for example:

$ python mnist_mlp_test.py
20000 train samples
10000 test samples
INFO:plaidml:Opening device "metal_intel(r)_iris(tm)_plus_graphics_655.0"


Layer (type) Output Shape Param #

dense_1 (Dense) (None, 128) 100480


activation_1 (Activation) (None, 128L) 0


dense_2 (Dense) (None, 128) 16512


activation_2 (Activation) (None, 128L) 0


dense_3 (Dense) (None, 10) 1290


activation_3 (Activation) (None, 10L) 0

Total params: 118,282
Trainable params: 118,282
Non-trainable params: 0


Epoch 1/1
20000/20000 [==============================] - 1s 64us/step - loss: nan - acc: 0.0990
10000/10000 [==============================] - 1s 97us/step
Test score: nan
Test accuracy: 0.098

@jacogasp
Copy link

jacogasp commented Feb 4, 2019

Hey people,

I tried compiling PlaidML from source to the very last commit 7623a07 and with my Intel Iris 650 running Metal works like a charm. MacBook Pro 2017.

Plaidbench successfully passes the test and a small autoencoder net on images works perfectly.

EDIT: not tried v.0.5.0 from PIP

@chrisbarrera
Copy link

JacoGasp I will have to try building off master then. I see there are actually many newer files (2 month vs 4 month old files) in master than the supposedly new 0.5.0 branch so wonder if 0.5.0 doesn't actually include all the latest fixes.

@brianretford
Copy link

brianretford commented Feb 5, 2019

yes, we've had some issues getting 0.5.0 binaries released and in the process we wrote a lot of code. The binaries in pypi are based on the code that is in the 0.5.0 PR that we'll land today. We have some work to do to make travis happy (we use buildkite internally). The pips do include pretty much anything that would fix this issue.

What version of macOS are you running @chrisbarrera? In addition to OpenCL being broken on macOS, metal was quite broken pre-mojave (for intel gpus)

@chrisbarrera
Copy link

chrisbarrera commented Feb 5, 2019

@brianretford I am running Mojave 10.14.2 on a macbook pro 13" 2018 with Iris Plus 655. I am currently running 0.5.0 installed from pip. Edit: Its worth noting that llvm on the cpu gives the same results, as does opencl.

@brianretford
Copy link

Ahh, also we have one of those laptops in any case. But if it repros in LLVM CPU sounds like there is something more fundamental.

We'll try to dig into this but we are really pushing to get the Stripe based version out because it's just that much better.

@shpigunov
Copy link

Encountered this bug on a MBP 2015 13" (internal Intel Iris 610 GPU) and a 2018 MBP 15" (external RX 570), both working through metal.

Using keras 2.2.4 and PlaidML-Keras 0.5.0 freshly installed from pip. It still trains in both cases, so I have a question – does it just not display loss, or does it fail to calculate it completely, thus failing to train the model?

Screenshot at May 06 19-06-26

@brianretford
Copy link

loss being nan is usually a bug in our code or in the network itself. Does it work if you don't use plaid? Can you provide some code to help w/ repro?

@shpigunov
Copy link

Yes, it works at least on the 2015 13" MBP running mac os 10.14.14 with vanilla Keras 2.2.4 on Python 3.6.5. There, it appears to correctly show loss.

Here is the code I was trying to run. https://gist.github.com/shpigunov/8b3221a74519834ae37b88f6f7607e21

@shpigunov
Copy link

Just to show that loss displays correctly when using Keras with TensorFlow backend without PlaidML.

image

@shpigunov
Copy link

shpigunov commented May 7, 2019

And here is the specific device setup that I have used for PlaidML that I got errors with.

image

I will also attempt the OpenCL backend and report back shortly.

@shpigunov
Copy link

shpigunov commented May 7, 2019

Same behavior using OpenCL backend for the Intel GPU (number 2 in the setup options above).

image

@chrisbarrera
Copy link

chrisbarrera commented May 9, 2019

So just for fun in the latest master with STRIPE I copied "stripe_config": "gpu/intel_gen9" into the Iris config of macos_experimental.json and recompiled. Setting USE_STRIP=1 the NaN's disappear. Mostly, it seems I can train too many epochs and the decreasing losses eventually become NaNs and blows the accuracy value, but up to that point the accuracy increases. I don't know if stripe is safe to use with Iris at this point (there is a reason y'all left it out right? :) but its a datapoint...
EDIT: fyi each step with stripe is ~78us vs ~124us without stripe, so whatever its doing its doing it faster :)

@brianretford
Copy link

It should be safe to use, the only reason it was left out is that we're focusing on exactly one platform initially and then we'll branch out and add official support as we have appropriate test equipment.

The 0.6.0 release should be even faster.

It's doing a lot more and we still have more to do.

brianretford pushed a commit that referenced this issue Jun 10, 2019
@xiahongze
Copy link

xiahongze commented Jun 21, 2019

Macbook 15 2014 OSX 10.14.5 PlaidML 0.6.0 is getting the same errors with all backends except opencl_nvidia_geforce_gt_750m.0. Here is the complete list of options available to me.

   1 : llvm_cpu.0
   2 : opencl_intel_iris_pro.0
   3 : opencl_cpu.0
   4 : opencl_nvidia_geforce_gt_750m.0
   5 : metal_nvidia_geforce_gt_750m.0
   6 : metal_intel_iris_pro_graphics.0

One example output:

Running 1024 examples with mobilenet, batch size 16, on backend plaid
Loading CIFAR data
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 150s 1us/step
INFO:plaidml:Opening device "metal_intel_iris_pro_graphics.0"
Compiling network...Epoch 1/1
INFO:plaidml:Analyzing Ops: 1249 of 2263 operations complete
16/16 [==============================] - 14s 895ms/step - loss: 4.3532 - acc: 0.0000e+00
 Warming up...Epoch 1/1
32/32 [==============================] - 3s 103ms/step - loss: nan - acc: 0.0625
 Running...
Epoch 1/1
 432/1024 [===========>..................] - ETA: 1:01 - loss: nan - acc: 0.0949```

anandj91 pushed a commit to anandj91/plaidml that referenced this issue Jul 24, 2019
@chrisbarrera
Copy link

I just installed 0.6.3 and this problem went away, but only if USE_STRIPE=1. I don't get any NANs whatsoever whereas it either happened all the time, or regularly after a few epochs, on earlier releases or when USE_STRIPE=0. Major improvement even over the 0.5.x and 0.6.x releases.

@aalskdlk
Copy link

In mnist, curiously, I don't get any NANs if batch_size==129 or 257 on MacBook Pro (Retina, 13-inch, Mid 2014, macOS 10.13.6, Intel Iris 1536 MB, Plaidml 0.6.4 or 0.5.0 working through metal), although I get NANs if batch_size==128 or 256.

@royalstream
Copy link

royalstream commented Aug 28, 2019

I'm in a similar situation to @aalskdlk, using mnist_mlp.py which is much simpler (no convolutions).
I'm running an upgraded MacPro5,1 with an NVIDIA GeForce GTX 970 under High Sierra, Anaconda Python 3.7 64-bit.
OpenCL seems to work fine for the GPU, although there's noticeable delay before each epoch (data transfer?)
Metal is much faster but gives me NAN losses... unless I change the batch size to something else like 129, then model.fit works perfectly. However, I still get a NAN with model.evaluate

@xiahongze
Copy link

setting batch_size to 129 works for me. Weird!

@li-li-github
Copy link

It seems that the loss explodes at the 1st epoch when the number of training parameters is too large (> ~111,217) with PlaidML + OpenCL on Mac. No problem with Metal implementation. OpenCL implementation didn't work at all on CPU.

Tested environment.
OS: macOS 10.14.6
CPU: 2.9GHz Core i9
GPU: Radeon Pro Vega 20 4GB
PlaidML: 0.6.4
Keras: 2.2.4

@ZxMYS
Copy link

ZxMYS commented Nov 19, 2019

Seeing this issue with my custom CNN. After I reduced my batch size by half the nan loss went away.

@AlexVaith
Copy link

AlexVaith commented Nov 21, 2019

I have the same issue with an architecture containing both conv1d and lstm layers.
The models trains on 10000 samples and outputs nan for the loss beginning at the middle of the third epoch.
I am using the metal configuration of the radeon pro 560x. My plaidml build is 0.6.4..
I would like to test the USE_STRIPE soluction, but I don't know where to assign this.

@ptr-uhrin
Copy link

In mnist, curiously, I don't get any NANs if batch_size==129 or 257 on MacBook Pro (Retina, 13-inch, Mid 2014, macOS 10.13.6, Intel Iris 1536 MB, Plaidml 0.6.4 or 0.5.0 working through metal), although I get NANs if batch_size==128 or 256.

Same thing. But when i set batch_size to 129 accuracy is about 0.47. It should be way higher.

@nikdata
Copy link

nikdata commented Feb 29, 2020

I'm using RStudio and Plaidml to work through the book "Deep Learning with R". In chapter 2, the first exercise is giving me similar problems to what others have highlighted. In the example, if I choose batch_size to be 129 instead of 128, I get a non-NAN value for loss. However, when calculating the metrics, they are pretty terrible compared to the results in the book (I get an accuracy of 0.47 vs. 0.97 in text).

My current setup:

  • Mid-2012 MacBook Pro Retina (2.3 GHz i7 Quad-Core)
  • Nvidia Geforce GT 650M
  • Mac OS X Catalina (10.15.3)
  • plaidml v. 0.7.0 using experimental config as the non-experimental config does not see the Nvidia GPU
  • keras v. 2.2.4

@hannabros
Copy link

hannabros commented Mar 11, 2021

I have a same issue when I'm trying to do very simple mnist examples (con2d 2 layers). I tried batch sizes of 32, 64, 128, 129, 256, 257, but none of these worked. My current setup is;

  • MacBook (Retina, 12-inch, 2017)
  • 1.3 GHz dual core Intel Core i5
  • macOS Catalina (10.15)
  • metal_intel(r)_hd_graphics_615.0
  • keras v 2.2.4
  • plaidml v 0.7.0 (no experimental config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests