Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM crash on GC with current maven snapshot #420

Closed
twitwi opened this issue Jul 1, 2013 · 14 comments
Closed

JVM crash on GC with current maven snapshot #420

twitwi opened this issue Jul 1, 2013 · 14 comments

Comments

@twitwi
Copy link
Contributor

twitwi commented Jul 1, 2013

Hi @ochafik

I have a (now) very simple program that crashes with the latest maven snapshots but works with javacl-1.0.0-RC3.jar.

public static void main(String[] args) {
    CLContext context = JavaCL.createBestContext();
    CLDevice[] devices = context.getDevices();
    for (int i = 0; i < devices.length; i++) {
        System.err.println(i+": "+devices[i]);
    }
    System.err.println("Now GC'ing");
    System.gc(); // crash here
    System.err.println("GC'ed");
}

I run using optirun under linux mint 15, 64bits, with a "NVS 5400M (NVIDIA CUDA)" device. The rest of the program works fine (complicated opencl kernel work) but the GC crashes the VM with:

0: NVS 5400M (NVIDIA CUDA)
Now GC'ing
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8e46073bcc, pid=15591, tid=140248881133312
#
# JRE version: 7.0_21-b02
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x75cbcc]  PSRootsClosure<false>::do_oop(oopDesc**)+0xc

It might be a temporary issue (or specific to my device) but I prefer to report it.

Rémi

@twitwi
Copy link
Contributor Author

twitwi commented Jul 2, 2013

I could try on another machine (linux 64bits, xubuntu 12.10) and it does not crash on GC.
So it seems the problem is contextual (driver? optirun? …)

0: GeForce GTX 560 Ti (NVIDIA CUDA)
Now GC'ing
GC'ed

The used GPU is a secondary card (not used for any display).

@ochafik
Copy link
Member

ochafik commented Jul 2, 2013

Hi @twitwi ,

Thank you so much for investigating and providing such a narrowed down test case!
By any chance, have you tried installing AMD Stream? (CPU-only OpenCL implementation)

Also, have you made sure the exact same version of Java is being used on both setups? (and have you tried turning compressed oops on/off, just in case?)

Could you also try calling CLAbstractEntity.release() on each on context prior to GC'ing?
And could you put your test in a loop (as done in BridJ's MemoryTest ) to give more chances to xubuntu to fail as well?

Finally, a fuller native stack trace might be useful, please do not fear of spamming this issue with a larger log :-)

Cheers

@twitwi
Copy link
Contributor Author

twitwi commented Jul 3, 2013

I tried (on the failing machine) with two versions of java (6 and 7), varying UseCompressedOops and bridj.protected.
Also with the AMD APP as I had it installed before.
The test program is quite simple:

public static void main(String[] args) {
    CLContext context = JavaCL.createBestContext();
    CLDevice[] devices = context.getDevices();
    for (int i = 0; i < devices.length; i++) {
        System.err.println(i+": "+devices[i]);
    }
    System.err.println("Releasing context");
    context.release();
    System.err.println("Now GC'ing");
    System.gc();
    System.err.println("GC'ed");
}

Results incoming…

@twitwi
Copy link
Contributor Author

twitwi commented Jul 3, 2013

Overall, with optirun, the only impacting variable is the javacl-core version (RC3 (works) vs SNAPSHOT).

Overall, with amdapp (cpu, no ati card), bridj.protected=true makes the CL platform not found, while RC3 works and SNAPSHOT does the same error.

The java version, UseCompressedOops and releasing the context seem to have no impact.

Details:
http://dl.heeere.com/withoptirun.zip
http://dl.heeere.com/withamdapp.zip

Script that produced it:

pre= #optirun
for dep in dependency dependency-RC3 ; do
    for java in java /usr/lib/jvm/java-6-openjdk-amd64/bin/java ; do
        echo "JAVA: $java"
        echo
        $java -version
        echo
        for opt in {,-Dbridj.protected=true}" "{,-XX:+UseCompressedOops,-XX:-UseCompressedOops} ; do
            echo "RUNNING: $pre $java $opt -cp target/DPGMMJavaCL-1.0-SNAPSHOT.jar:target/$dep/* com.heeere.dpgmm.javacl.TestGC"
            $pre $java $opt -cp target/DPGMMJavaCL-1.0-SNAPSHOT.jar:target/$dep/* com.heeere.dpgmm.javacl.TestGC
            echo;echo;echo
        done
    done
done

Maybe, I should bisect the thing if it is not reproducible elsewhere.

@twitwi
Copy link
Contributor Author

twitwi commented Jul 4, 2013

I just read part of the JavaCL code and I have a note to add. My AMDAPP is not installed in /opt/AMDAPP/lib (custom install)… in case it matters

@ochafik
Copy link
Member

ochafik commented Jul 4, 2013

Hi @twitwi ,

Thanks for taking the time to investigate, much appreciated!
Bisecting might help but might be non-trivial, since the issue might come from BridJ as well (you'd have to recompile both libraries/BridJ and libraries/OpenCL at each step).
Could you please add a last check with BRIDJ_DIRECT=0 if you have time? (direct mode is also disabled with BRIDJ_PROTECTED=1, but since it mysteriously made the platform to disappear... (which is an issue of its own right, maybe even related)
As for the lib path, I doubt it could cause the issue, although it might be good to see which library BridJ picks, which should be somewhere in the verbose or debug logs.
(if it's not the right lib, providing the full path with -Dbridj.OpenCL.library=/some/path/amdocl64.so could help)

I'm now trying to install mint linux :-)

Cheers

@ochafik
Copy link
Member

ochafik commented Jul 4, 2013

Just to check: are you using the Ubuntu-based Linux Mint (the default one), or Linux Mint Debian Edition?

@twitwi
Copy link
Contributor Author

twitwi commented Jul 4, 2013

default one

@twitwi
Copy link
Contributor Author

twitwi commented Jul 4, 2013

(ubuntu based, not kubuntu neither)

ochafik added a commit that referenced this issue Jul 29, 2013
…new -Dbridj.debug.pointer.releases=true). Could be the root cause of #420 and/or #405.
@ochafik
Copy link
Member

ochafik commented Jul 30, 2013

Hi @twitwi ,

Could you please try again with the latest 1.0-SNAPSHOT? There's a magic one-line fix that might help...

Cheers

@twitwi
Copy link
Contributor Author

twitwi commented Jul 30, 2013

Hi and thanks for the patch,
I'm away for a few weeks. I'll try when I come back.

@ochafik
Copy link
Member

ochafik commented Nov 4, 2013

Hi Rémi,

Friendly ping :-)

Cheers

@ochrons
Copy link

ochrons commented Mar 6, 2014

I seem to be seeing the same error with RC3:

When garbage collector runs, there is an Access Violation error due to CLDevice cleanup (log attached with stack trace etc.)

Using JavaCL 1.0.0-RC3

Environment: Windows 8.1, two OpenCL platforms
Number of devices in platform NVIDIA CUDA: 1
Number of devices in platform Intel(R) OpenCL: 1
--- Info for device Quadro 2000M: ---
CL_DEVICE_NAME: Quadro 2000M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 331.65
--- Info for device Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz: ---
CL_DEVICE_NAME: Intel(R) Core(TM) i7-2860QM CPU @ 2.50GHz
CL_DEVICE_VENDOR: Intel(R) Corporation
CL_DRIVER_VERSION: 3.0.1.15216

To replicate (simple Scala app):

object Start extends App {
  override def main(args: Array[String]) = {
    val oclPlatforms: Array[CLPlatform] = JavaCL.listPlatforms()
    // list the NVIDIA devices
    oclPlatforms(0).listAllDevices(true)
    System.gc()
    Thread.sleep(1000)
  }
}

Same problem when calling JavaCL.getBestDevice()

Getting the CPU devices, on the other hand works ok:

object Start extends App {
  override def main(args: Array[String]) = {
    val oclPlatforms: Array[CLPlatform] = JavaCL.listPlatforms()
    // list the CPU devices
    oclPlatforms(1).listAllDevices(true)
    System.gc()
    Thread.sleep(1000)
  }
}

Relevant part of the log dump:

Stack: [0x000000000af20000,0x000000000b020000],  sp=0x000000000b01e918,  free space=1018k
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.nativelibs4java.opencl.library.OpenCLLibrary.clReleaseDevice(J)I+0
j  com.nativelibs4java.opencl.CLDevice.clear()V+7
j  com.nativelibs4java.opencl.CLAbstractEntity.doRelease()V+10
j  com.nativelibs4java.opencl.CLAbstractEntity.finalize()V+1
v  ~StubRoutines::call_stub
j  java.lang.ref.Finalizer.invokeFinalizeMethod(Ljava/lang/Object;)V+0
j  java.lang.ref.Finalizer.runFinalizer()V+45
j  java.lang.ref.Finalizer.access$100(Ljava/lang/ref/Finalizer;)V+1
j  java.lang.ref.Finalizer$FinalizerThread.run()V+24
v  ~StubRoutines::call_stub

@ochafik
Copy link
Member

ochafik commented Mar 18, 2015

This issue was moved to nativelibs4java/JavaCL#18

@ochafik ochafik closed this as completed Mar 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants