Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent communication request provokes segfault in Java bindings. #369

Closed
osvegis opened this issue Feb 1, 2015 · 57 comments
Closed

Comments

@osvegis
Copy link
Contributor

osvegis commented Feb 1, 2015

MPI corrupts the memory space of Java.
The following example provokes segfault in Java bindings.
Please, see the comments in the example.

import mpi.*;                                                                         
import java.nio.*;                                                                    

public class CrashTest
{
    private static final int STEPS = 1000000000,
        SIZE  = 4096;

    public static void main(String...args) throws MPIException
    {
        MPI.Init(args);
        int rank = MPI.COMM_WORLD.getRank();
        StringBuilder s = new StringBuilder();

        if(MPI.COMM_WORLD.getSize() != 2)
            throw new MPIException("I need exactly 2 processes.");

        // Only one buffer is needed,                                                 
        // but the test works ok if you only use one.                                 
        ByteBuffer sendBuf = MPI.newByteBuffer(SIZE),
            recvBuf = MPI.newByteBuffer(SIZE);

        Prequest req = MPI.COMM_WORLD.recvInit(recvBuf, SIZE, MPI.BYTE, 0, 0);

        for(int i = 1; i <= STEPS; i++)
            {
                // Allocate memory to provoke GC work and crash.                      
                // If you comment the following line, the test works ok.              
                (s = new StringBuilder(SIZE).append(i)).trimToSize();

                if(rank == 0)
                    {
                        if(i % 100000 == 0)
                            {
                                s.setLength(0);
                                System.out.println(i + s.toString());
                            }

                        MPI.COMM_WORLD.send(sendBuf, SIZE, MPI.BYTE, 1, 0);
                    }
                else
                    {
                        req.start();
                        req.waitFor();
                    }
            }

        MPI.Finalize();
    }

} // CrashTest
@jsquyres
Copy link
Member

jsquyres commented Feb 7, 2015

@osvegis I'm unable to get this test to fail for me.

Should I be seeing the memory usage of the process increase as the process runs?

I increased SIZE to 1048576, and after running 2 procs of this on a single 128GB server, the memory usage (as reported by "top") still only shows 5.1% memory usage for both java processes (it's been 5.1% the whole time):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                 
19701 jsquyres  20   0 18.7g 3.2g  13m S 100.0  5.1   5:34.79 java                    
19702 jsquyres  20   0 18.7g 3.2g  13m S 100.0  5.1   5:34.47 java

@jsquyres
Copy link
Member

jsquyres commented Feb 7, 2015

Forgot to mention that I tried both master and the v1.8 branch -- so I'm guessing I'm not triggering the GC, and therefore not triggering the problem.

@osvegis
Copy link
Contributor Author

osvegis commented Feb 7, 2015

I also tried on master and the test crashes. This problem may exist from the very beginning of java bindings integration: ompi-1.7.5 also fails.
I think the problem is not related with buffer size.
Regarding memory usage, the process size may increase but not necessarily.

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2015

Hmm. Ok. How do I get this test to reproduce, then? Is there a way that I can know for sure that the GC has fired?

@shurickdaryin
Copy link

@jsquyres You could insert an explicit call to System.gc() into the for loop (maybe with if( i %1024 == 0) so that it is not called on each iteration). Alternatively, you could use -Xmx java parameter to set maximum heap size (see http://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html).

@osvegis
Copy link
Contributor Author

osvegis commented Feb 9, 2015

Don't worry about if GC is fired. GC is fired when it is necessary, because on each iteration we allocate memory, so if GC is not fired we get an out of memory error.
On my tests sometimes crash Java, and sometimes crash an MPI call.
Maybe you can provoke the error in a more modest machine.

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2015

:-(

I'm still totally unable to reproduce this bug -- even on my OS X laptop (with only 16GB RAM).

I don't doubt that there is a real issue here, but I'm somewhat stymied until I can reproduce it reliably...

@osvegis
Copy link
Contributor Author

osvegis commented Feb 9, 2015

My machine is 3570T, with only 8GB RAM.
Linux shuttle 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64 GNU/Linux

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2015

My laptop is the only machine I have access too with only 16GB -- I don't have access to anything with less memory than that. :-(

FWIW, I even put in the call to System.gc(), but that didn't trigger the issue, either.

Any suggestions?

@osvegis
Copy link
Contributor Author

osvegis commented Feb 9, 2015

What do you think about of a virtual machine?

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2015

Interesting idea. Let me see what I can fire up around here...

On Feb 9, 2015, at 5:27 PM, Oscar Vega-Gisbert notifications@github.com wrote:

What do you think about of a virtual machine?


Reply to this email directly or view it on GitHub.

Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

@goodell
Copy link
Member

goodell commented Feb 9, 2015

@osvegis is there any additional debugging you could do on your end? Could you send along the console output and backtrace(s) on the SEGV? I think with Hotspot there is a way to get a more verbose error log: http://www.oracle.com/technetwork/java/javase/felog-138657.html

You might also try running your test program with -Xcheck:jni (http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtq)

There are some other general troubleshooting suggestions here: http://www.oracle.com/technetwork/java/javase/crashes-137240.html

Jeff, you might also have trouble reproducing if you are running a different version of the JVM or are running on a different platform. It might be very difficult for you to directly reproduce this problem yourself.

@osvegis
Copy link
Contributor Author

osvegis commented Feb 9, 2015

El 09/02/15 a las 23:34, Dave Goodell escribió:

@osvegis https://github.com/osvegis is there any additional
debugging you could do on your end? Could you send along the console
output and backtrace(s) on the SEGV? I think with Hotspot there is a
way to get a more verbose error log:
http://www.oracle.com/technetwork/java/javase/felog-138657.html

You might also try running your test program with |-Xcheck:jni|
(http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtq)

There are some other general troubleshooting suggestions here:
http://www.oracle.com/technetwork/java/javase/crashes-137240.html

Jeff, you might also have trouble reproducing if you are running a
different version of the JVM or are running on a different platform.
It might be very difficult for you to directly reproduce this problem
yourself.

The following execution generated an error file: hs_err_pid####.log

$ mpirun -np 2 java -Xcheck:jni -cp build/classes/ CrashTest
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
*** glibc detected *** java: free(): corrupted unsorted chunks: 
0x0000000002556420 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fdf07437a16]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7fdf0743c7bc]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x2b5d80)[0x7fdf06671d80]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8b2ee0)[0x7fdf06c6eee0]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8dc4d8)[0x7fdf06c984d8]
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x7b04d2)[0x7fdf06b6c4d2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7fdf07b65b50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdf0749d70d]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00600000-00601000 r--p 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00601000-00602000 rw-p 00001000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
02097000-02673000 rw-p 00000000 00:00 0                                  
[heap]
77a200000-77b700000 rw-p 00000000 00:00 0
77b700000-784800000 rw-p 00000000 00:00 0
784800000-789a00000 rw-p 00000000 00:00 0
789a00000-7d6d00000 rw-p 00000000 00:00 0
7d6d00000-7fff80000 rw-p 00000000 00:00 0
7fff80000-800000000 ---p 00000000 00:00 0
7fdee75d1000-7fdee75dd000 r-xp 00000000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee75dd000-7fdee77dd000 ---p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee77dd000-7fdee77de000 rw-p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fdee77de000-7fdeef7df000 rw-s 00000000 08:05 15597816 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/shared_mem_pool.shuttle 
(deleted)
7fdeef7df000-7fdeefbe0000 rw-s 00000000 08:05 15597823 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/0/vader_segment.shuttle.0
7fdeefbe0000-7fdeeffe1000 rw-s 00000000 08:05 15597819 
/tmp/openmpi-sessions-oscar@shuttle_0/53059/1/1/vader_segment.shuttle.1
7fdeeffe1000-7fdeeffe8000 r-xp 00000000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdeeffe8000-7fdef01e8000 ---p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdef01e8000-7fdef01ea000 rw-p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fdef01ea000-7fdef0207000 r-xp 00000000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0207000-7fdef0407000 ---p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0407000-7fdef0409000 rw-p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fdef0409000-7fdef0434000 r-xp 00000000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0434000-7fdef0634000 ---p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0634000-7fdef0635000 rw-p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fdef0635000-7fdef0636000 rw-p 00000000 00:00 0
7fdef0636000-7fdef063e000 r-xp 00000000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef063e000-7fdef083d000 ---p 00008000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef083d000-7fdef083e000 rw-p 00007000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fdef083e000-7fdef0841000 r-xp 00000000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0841000-7fdef0a40000 ---p 00003000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0a40000-7fdef0a41000 rw-p 00002000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fdef0a41000-7fdef0a63000 r-xp 00000000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0a63000-7fdef0c63000 ---p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0c63000-7fdef0c64000 rw-p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fdef0c64000-7fdef0c75000 r-xp 00000000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef0c75000-7fdef0e75000 ---p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef0e75000-7fdef0e76000 rw-p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fdef1082000-7fdef10a7000 r-xp 00000000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef10a7000-7fdef12a7000 ---p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef12a7000-7fdef12a9000 rw-p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fdef12b1000-7fdef12b6000 r-xp 00000000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef12b6000-7fdef14b5000 ---p 00005000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef14b5000-7fdef14b6000 rw-p 00004000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fdef14e0000-7fdef14e4000 r-xp 00000000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef14e4000-7fdef16e4000 ---p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef16e4000-7fdef16e5000 rw-p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fdef16e5000-7fdef16f1000 r-xp 00000000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
7fdef16f1000-7fdef18f0000 ---p 0000c000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
7fdef18f0000-7fdef18f3000 rw-p 0000b000 08:05 22151992 
/home/oscar/ompi-install/lib/openmpi/mca_btl_vader.so
[shuttle:06363] *** Process received signal ***
[shuttle:06363] Signal: Abortado (6)
[shuttle:06363] Signal code:  (-6)
[shuttle:06363] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fdf07b6e0a0]
[shuttle:06363] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fdf073f3165]
[shuttle:06363] [ 2] 
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7fdf073f63e0]
[shuttle:06363] [ 3] 
/lib/x86_64-linux-gnu/libc.so.6(+0x6d1cb)[0x7fdf0742e1cb]
[shuttle:06363] [ 4] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fdf07437a16]
[shuttle:06363] [ 5] 
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7fdf0743c7bc]
[shuttle:06363] [ 6] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x2b5d80)[0x7fdf06671d80]
[shuttle:06363] [ 7] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8b2ee0)[0x7fdf06c6eee0]
[shuttle:06363] [ 8] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8dc4d8)[0x7fdf06c984d8]
[shuttle:06363] [ 9] 
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x7b04d2)[0x7fdf06b6c4d2]
[shuttle:06363] [10] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7fdf07b65b50]
[shuttle:06363] [11] 
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdf0749d70d]
[shuttle:06363] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node shuttle exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

Sometimes the crash is different. A SIGSEGV error appears, but the test continues!

$ mpirun -np 2 java -Xcheck:jni -cp build/classes/ CrashTest
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f0986a8424e, pid=6789, tid=139678682806016
#
# JRE version: OpenJDK Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: OpenJDK 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 
compressed oops)
# Derivative: IcedTea 2.5.4
# Distribution: Debian GNU/Linux 7.6 (wheezy), package 7u75-2.5.4-1~deb7u1
# Problematic frame:
# C  [libc.so.6+0x7924e]
#
# Failed to write core dump. Core dumps have been disabled. To enable 
core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/oscar/NetBeansProjects/mpi-pruebas/hs_err_pid6789.log
1100000
1200000
1300000
1400000
1500000
1600000
...

@goodell
Copy link
Member

goodell commented Feb 9, 2015

You might be able to find the problem with a malloc debugger (like DUMA, a fork of Electric Fence: http://duma.sourceforge.net/).

@osvegis
Copy link
Contributor Author

osvegis commented Feb 10, 2015

El 10/02/15 a las 00:15, Dave Goodell escribió:

You might be able to find the problem with a malloc debugger (like
DUMA, a fork of Electric Fence: http://duma.sourceforge.net/).


Reply to this email directly or view it on GitHub
#369 (comment).

I get the following error:

$ export LD_PRELOAD=libduma.so.0.0.0

$ mpirun -np 2 duma java -cp ~/ompi-install/lib/mpi.jar:. CrashTest

DUMA 2.5.15 (shared library, NO_LEAKDETECTION)
Copyright (C) 2006 Michael Eddington meddington@gmail.com
Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH
Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA 2.5.15 (shared library, NO_LEAKDETECTION)
Copyright (C) 2006 Michael Eddington meddington@gmail.com
Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH
Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA Aborting: malloc() is not bound to duma.
DUMA Aborting: Preload lib with 'LD_PRELOAD=libduma.so '.

DUMA 2.5.15 (shared library, NO_LEAKDETECTION)
Copyright (C) 2006 Michael Eddington meddington@gmail.com
Copyright (C) 2002-2008 Hayati Ayguen h_ayguen@web.de, Procitec GmbH
Copyright (C) 1987-1999 Bruce Perens bruce@perens.com

DUMA Aborting: malloc() is not bound to duma.
DUMA Aborting: Preload lib with 'LD_PRELOAD=libduma.so '.

@goodell
Copy link
Member

goodell commented Feb 10, 2015

I think OMPI's malloc hooks for memory registration are getting in the way. Try setting export OMPI_MCA_memory_linux_disable=1 in your environment also. This MCA parameter must be set as an environment variable, it will not work if set by some other mechanism.

@osvegis
Copy link
Contributor Author

osvegis commented Feb 10, 2015

I get the same error.
Maybe Java uses its own malloc and so Duma is not allowed.

@jsquyres
Copy link
Member

Hmm. How about trying the same thing with a pure Java program (i.e., non MPI)? That would tell us if Java is interposing its own malloc.

@osvegis
Copy link
Contributor Author

osvegis commented Feb 10, 2015 via email

@jsquyres
Copy link
Member

Is there a way to tell if this same error happens with multiple different versions of the JVM?

@osvegis
Copy link
Contributor Author

osvegis commented Feb 10, 2015 via email

@jsquyres
Copy link
Member

On my OSX machine:

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

On my Linux machine:

$ java -version
java version "1.6.0_32"
OpenJDK Runtime Environment (IcedTea6 1.13.4) (rhel-6.1.13.4.el6_5-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

@jsquyres
Copy link
Member

Just curious: does the same problem happen if you only use the TCP BTL (i.e., not the shared memory BTL)?

$ mpirun --mca btl tcp,self ...

@osvegis
Copy link
Contributor Author

osvegis commented Feb 10, 2015

I thought it worked ok, but after 4 attempts:

$ mpirun -np 2 --mca btl tcp,self java -cp build/classes/ CrashTest
100000
200000
300000
400000
*** glibc detected *** java: corrupted double-linked list: 
0x0000000002585b00 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fa7a55a1a16]
/lib/x86_64-linux-gnu/libc.so.6(+0x76e8d)[0x7fa7a55a1e8d]
/lib/x86_64-linux-gnu/libc.so.6(+0x79174)[0x7fa7a55a4174]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7fa7a55a68a0]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_malloc+0x5e)[0x7fa79835ec79]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_seg_alloc+0x2f)[0x7fa7937208da]
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc+0x114)[0x7fa794c2b5b3]
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_wrapper+0x36)[0x7fa794c2b1aa]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x11ea9)[0x7fa793729ea9]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x12c33)[0x7fa79372ac33]
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x1b8)[0x7fa79372a11b]
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so(+0xa4e6)[0x7fa793d854e6]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9cd3e)[0x7fa798372d3e]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9ce4d)[0x7fa798372e4d]
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9d11c)[0x7fa79837311c]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x2ab)[0x7fa798373783]
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_progress+0x88)[0x7fa79830cea1]
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50b5b)[0x7fa79894db5b]
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50c00)[0x7fa79894dc00]
/home/oscar/ompi-install/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7fa79894dc50]
/home/oscar/ompi-install/lib/libmpi.so.0(PMPI_Wait+0x130)[0x7fa7989a4627]
/home/oscar/ompi-install/lib/libmpi_java.so.0.0.0(Java_mpi_Request_waitFor+0x2d)[0x7fa798c6f367]
[0x7fa79fd6d088]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00600000-00601000 r--p 00000000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
00601000-00602000 rw-p 00001000 08:05 790211 
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
02101000-026f3000 rw-p 00000000 00:00 0                                  
[heap]
77a200000-77b700000 rw-p 00000000 00:00 0
77b700000-784800000 rw-p 00000000 00:00 0
784800000-789a00000 rw-p 00000000 00:00 0
789a00000-7d6d00000 rw-p 00000000 00:00 0
7d6d00000-7f0100000 rw-p 00000000 00:00 0
7f0100000-7f0280000 ---p 00000000 00:00 0
7f0280000-800000000 rw-p 00000000 00:00 0
7fa78c000000-7fa78c021000 rw-p 00000000 00:00 0
7fa78c021000-7fa790000000 ---p 00000000 00:00 0
7fa79246a000-7fa792476000 r-xp 00000000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792476000-7fa792676000 ---p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792676000-7fa792677000 rw-p 0000c000 08:05 22154771 
/home/oscar/ompi-install/lib/openmpi/mca_dpm_orte.so
7fa792677000-7fa79267e000 r-xp 00000000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa79267e000-7fa79287e000 ---p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa79287e000-7fa792880000 rw-p 00007000 08:05 22154789 
/home/oscar/ompi-install/lib/openmpi/mca_osc_sm.so
7fa792880000-7fa79289d000 r-xp 00000000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa79289d000-7fa792a9d000 ---p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa792a9d000-7fa792a9f000 rw-p 0001d000 08:05 22154791 
/home/oscar/ompi-install/lib/openmpi/mca_osc_pt2pt.so
7fa792a9f000-7fa792aca000 r-xp 00000000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792aca000-7fa792cca000 ---p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792cca000-7fa792ccb000 rw-p 0002b000 08:05 22154757 
/home/oscar/ompi-install/lib/openmpi/mca_coll_tuned.so
7fa792ccb000-7fa792ccc000 rw-p 00000000 00:00 0
7fa792ccc000-7fa792cd4000 r-xp 00000000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792cd4000-7fa792ed3000 ---p 00008000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792ed3000-7fa792ed4000 rw-p 00007000 08:05 22154767 
/home/oscar/ompi-install/lib/openmpi/mca_coll_sm.so
7fa792ed4000-7fa792ed7000 r-xp 00000000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa792ed7000-7fa7930d6000 ---p 00003000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa7930d6000-7fa7930d7000 rw-p 00002000 08:05 22154769 
/home/oscar/ompi-install/lib/openmpi/mca_coll_self.so
7fa7930d7000-7fa7930f9000 r-xp 00000000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7930f9000-7fa7932f9000 ---p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7932f9000-7fa7932fa000 rw-p 00022000 08:05 22154761 
/home/oscar/ompi-install/lib/openmpi/mca_coll_libnbc.so
7fa7932fa000-7fa79330b000 r-xp 00000000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa79330b000-7fa79350b000 ---p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa79350b000-7fa79350c000 rw-p 00011000 08:05 22154759 
/home/oscar/ompi-install/lib/openmpi/mca_coll_basic.so
7fa793718000-7fa79373d000 r-xp 00000000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa79373d000-7fa79393d000 ---p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa79393d000-7fa79393f000 rw-p 00025000 08:05 22154795 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so
7fa793947000-7fa79394c000 r-xp 00000000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa79394c000-7fa793b4b000 ---p 00005000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa793b4b000-7fa793b4c000 rw-p 00004000 08:05 22154763 
/home/oscar/ompi-install/lib/openmpi/mca_coll_inter.so
7fa793b76000-7fa793b7a000 r-xp 00000000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793b7a000-7fa793d7a000 ---p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793d7a000-7fa793d7b000 rw-p 00004000 08:05 22154799 
/home/oscar/ompi-install/lib/openmpi/mca_pubsub_orte.so
7fa793d7b000-7fa793d8d000 r-xp 00000000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793d8d000-7fa793f8c000 ---p 00012000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793f8c000-7fa793f8e000 rw-p 00011000 08:05 22152437 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so
7fa793f8e000-7fa79400e000 rw-p 00000000 00:00 0
7fa79400e000-7fa794012000 r-xp 00000000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794012000-7fa794212000 ---p 00004000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794212000-7fa794213000 rw-p 00004000 08:05 22151949 
/home/oscar/ompi-install/lib/openmpi/mca_btl_self.so
7fa794213000-7fa794218000 r-xp 00000000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794218000-7fa794418000 ---p 00005000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794418000-7fa794419000 rw-p 00005000 08:05 22154755 
/home/oscar/ompi-install/lib/openmpi/mca_bml_r2.so
7fa794419000-7fa79441b000 r-xp 00000000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79441b000-7fa79461a000 ---p 00002000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79461a000-7fa79461b000 rw-p 00001000 08:05 21634332 
/home/oscar/ompi-install/lib/libmca_common_sm.so.0.0.0
7fa79461b000-7fa79461d000 r-xp 00000000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79461d000-7fa79481d000 ---p 00002000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79481d000-7fa79481e000 rw-p 00002000 08:05 22153277 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_sm.so
7fa79481e000-7fa794823000 r-xp 00000000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794823000-7fa794a23000 ---p 00005000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794a23000-7fa794a24000 rw-p 00005000 08:05 22153263 
/home/oscar/ompi-install/lib/openmpi/mca_mpool_grdma.so
7fa794a24000-7fa794a2a000 r-xp 00000000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794a2a000-7fa794c29000 ---p 00006000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794c29000-7fa794c2a000 rw-p 00005000 08:05 22153303 
/home/oscar/ompi-install/lib/openmpi/mca_rcache_vma.so
7fa794c2a000-7fa794c2d000 r-xp 00000000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794c2d000-7fa794e2c000 ---p 00003000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794e2c000-7fa794e2d000 rw-p 00002000 08:05 22151834 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so
7fa794e2d000-7fa794e33000 r-xp 00000000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa794e33000-7fa795032000 ---p 00006000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa795032000-7fa795033000 rw-p 00005000 08:05 22154224 
/home/oscar/ompi-install/lib/openmpi/mca_routed_radix.so
7fa795033000-7fa795037000 r-xp 00000000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795037000-7fa795237000 ---p 00004000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795237000-7fa795238000 rw-p 00004000 08:05 22154129 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_rcd.so
7fa795238000-7fa79523d000 r-xp 00000000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79523d000-7fa79543d000 ---p 00005000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79543d000-7fa79543e000 rw-p 00005000 08:05 22154205 
/home/oscar/ompi-install/lib/openmpi/mca_rml_oob.so
7fa79543e000-7fa79544f000 r-xp 00000000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa79544f000-7fa79564f000 ---p 00011000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa79564f000-7fa795650000 rw-p 00011000 08:05 22154150 
/home/oscar/ompi-install/lib/openmpi/mca_oob_usock.so
7fa795650000-7fa795669000 r-xp 00000000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa795669000-7fa795869000 ---p 00019000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa795869000-7fa79586a000 rw-p 00019000 08:05 22154148 
/home/oscar/ompi-install/lib/openmpi/mca_oob_tcp.so
7fa79586b000-7fa79586e000 r-xp 00000000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa79586e000-7fa795a6d000 ---p 00003000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa795a6d000-7fa795a6e000 rw-p 00002000 08:05 22151640 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_basic.so
7fa795a6e000-7fa795a74000 r-xp 00000000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795a74000-7fa795c73000 ---p 00006000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795c73000-7fa795c74000 rw-p 00005000 08:05 22154127 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_direct.so
7fa795c74000-7fa795c78000 r-xp 00000000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795c78000-7fa795e77000 ---p 00004000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795e77000-7fa795e78000 rw-p 00003000 08:05 22154125 
/home/oscar/ompi-install/lib/openmpi/mca_grpcomm_brks.so
7fa795e78000-7fa795e7a000 r-xp 00000000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa795e7a000-7fa79607a000 ---p 00002000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa79607a000-7fa79607b000 rw-p 00002000 08:05 22154094 
/home/oscar/ompi-install/lib/openmpi/mca_errmgr_default_app.so
7fa79607b000-7fa79607d000 r-xp 00000000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79607d000-7fa79627c000 ---p 00002000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79627c000-7fa79627d000 rw-p 00001000 08:05 22154279 
/home/oscar/ompi-install/lib/openmpi/mca_state_app.so
7fa79627d000-7fa79627e000 ---p 00000000 00:00 0
7fa79627e000-7fa796a7e000 rw-p 00000000 00:00 0
7fa796a7e000-7fa796a7f000 ---p 00000000 00:00 0
7fa796a7f000-7fa79727f000 rw-p 00000000 00:00 0
7fa79727f000-7fa797294000 r-xp 00000000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so
7fa797294000-7fa797494000 ---p 00015000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so
7fa797494000-7fa797495000 rw-p 00015000 08:05 22153279 
/home/oscar/ompi-install/lib/openmpi/mca_pmix_native.so[shuttle:04178] 
*** Process received signal ***
[shuttle:04178] Signal: Abortado (6)
[shuttle:04178] Signal code:  (-6)
[shuttle:04178] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fa7a5cd80a0]
[shuttle:04178] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa7a555d165]
[shuttle:04178] [ 2] 
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7fa7a55603e0]
[shuttle:04178] [ 3] 
/lib/x86_64-linux-gnu/libc.so.6(+0x6d1cb)[0x7fa7a55981cb]
[shuttle:04178] [ 4] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76a16)[0x7fa7a55a1a16]
[shuttle:04178] [ 5] 
/lib/x86_64-linux-gnu/libc.so.6(+0x76e8d)[0x7fa7a55a1e8d]
[shuttle:04178] [ 6] 
/lib/x86_64-linux-gnu/libc.so.6(+0x79174)[0x7fa7a55a4174]
[shuttle:04178] [ 7] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7fa7a55a68a0]
[shuttle:04178] [ 8] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_malloc+0x5e)[0x7fa79835ec79]
[shuttle:04178] [ 9] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_seg_alloc+0x2f)[0x7fa7937208da]
[shuttle:04178] [10] 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc+0x114)[0x7fa794c2b5b3]
[shuttle:04178] [11] 
/home/oscar/ompi-install/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_wrapper+0x36)[0x7fa794c2b1aa]
[shuttle:04178] [12] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x11ea9)[0x7fa793729ea9]
[shuttle:04178] [13] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(+0x12c33)[0x7fa79372ac33]
[shuttle:04178] [14] 
/home/oscar/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x1b8)[0x7fa79372a11b]
[shuttle:04178] [15] 
/home/oscar/ompi-install/lib/openmpi/mca_btl_tcp.so(+0xa4e6)[0x7fa793d854e6]
[shuttle:04178] [16] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9cd3e)[0x7fa798372d3e]
[shuttle:04178] [17] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9ce4d)[0x7fa798372e4d]
[shuttle:04178] [18] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(+0x9d11c)[0x7fa79837311c]
[shuttle:04178] [19] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x2ab)[0x7fa798373783]
[shuttle:04178] [20] 
/home/oscar/ompi-install/lib/libopen-pal.so.0(opal_progress+0x88)[0x7fa79830cea1]
[shuttle:04178] [21] 
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50b5b)[0x7fa79894db5b]
[shuttle:04178] [22] 
/home/oscar/ompi-install/lib/libmpi.so.0(+0x50c00)[0x7fa79894dc00]
[shuttle:04178] [23] 
/home/oscar/ompi-install/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7fa79894dc50]
[shuttle:04178] [24] 
/home/oscar/ompi-install/lib/libmpi.so.0(PMPI_Wait+0x130)[0x7fa7989a4627]
[shuttle:04178] [25] 
/home/oscar/ompi-install/lib/libmpi_java.so.0.0.0(Java_mpi_Request_waitFor+0x2d)[0x7fa798c6f367]
[shuttle:04178] [26] [0x7fa79fd6d088]
[shuttle:04178] *** End of error message ***
[shuttle][[51156,1],0][btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Conexión reinicializada por la 
máquina remota (104)
[shuttle:04177] pml_ob1_sendreq.c:187 FATAL
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node shuttle exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

@jsquyres
Copy link
Member

I'm kinda running out of ideas here -- how do we track this down?

@goodell
Copy link
Member

goodell commented Feb 11, 2015

I'd still proceed by trying to get some memory debugger like DUMA, ElectricFence, NJAMD, etc. that will add red zones to all allocations and trap accesses with mprotect() or similar. I'd probably poke at DUMA first using LD_DEBUG to figure out why the LD_PRELOAD isn't having the expected effect. Something like LD_DEBUG=bindings,reloc,symbols,libs will probably give you a lot of info, but is likely to show what is happening.

You could also try running Valgrind on the JVM, but it will probably be tricky unless you have a bit of experience with Valgrind. There's a SO post with some suggestions here: http://stackoverflow.com/questions/9216815/valgrind-and-java

@ggouaillardet
Copy link
Contributor

@hppritcha i think you understood correctly what i meant

per the java doc

The contents of direct buffers may reside outside of the normal garbage-collected heap

so unless i am misreading English (which happens quite often ...) this is equivalent to

The contents of direct buffers may reside inside of the normal garbage-collected heap,
and might be freed by the garbage collector when there are no more references to the
direct buffer

i made PR #815 and it solves the issue for me, as long as i run with --mca mtl ^psm but this is a different story.
strictly speaking, we could consider this is a bug on the user side. but since this is very tricky to debug, i'd rather have ompi handle this case transparently from the enduser point of view.

generally speaking, i am wondering whether this fix is enough.
For example, could the garbage collector free a buffer after a call to MPI_Isend and before the message is sent ? same thing for MPI_Irecv (even if it is generally dumb to recv a message and never check its content ...)

bottom line, i think we might have to keep a pointer to the buffer in the Request class instead of the Prequest class.

any idea ?

in the mean time, can you confirm the test runs just fine on hopper with this PR ?

@osvegis
Copy link
Contributor Author

osvegis commented Aug 18, 2015

Thanks Gilles.
I tested PR #815 and it also solves the issue for me.
I think is better to hold a reference to the buffer in the Request class.

@jsquyres
Copy link
Member

@ggouaillardet your explanation makes sense to me, but as @osvegis can attest, I am far from a Java expert. 😄

@hppritcha
Copy link
Member

I think what we will probably do is rewrite the direct buffer allocator methods in the MPI class to actually use a native method to create the buffer object. That plus using the NewGlobalRef method should prevent the gc from moving/deleting any of the buffers allocated for MPI calls. That hopefully will also solve some of the problems we are seeing trying to reproduce @osvegis results in the ompi java paper.

@osvegis
Copy link
Contributor Author

osvegis commented Aug 18, 2015

Direct buffers never are moved. They reside outside the Java heap, but when there are not any reference to them, then they are destroyed by GC. If we put a reference in Request class, the buffer won't be destroyed.
I think the @ggouaillardet suggestion is better.

@nrgraham23
Copy link
Contributor

I am going to work on a solution that solves the problem by allocation the buffers on the C side as @hppritcha suggested to see if this also fixes the poor performance we have been seeing. Ill create a PR with the changes when I complete them.

@osvegis
Copy link
Contributor Author

osvegis commented Aug 18, 2015

@hppritcha and @nrgraham23,
You are wrong. The only difference you'll get is that users must deallocate direct buffers manually.
Java already allocates buffers natively as you want.

@hppritcha
Copy link
Member

Okay, I don't want to use the patch from Gilles as is. It is only a band aid for one test. There are many more places, basically all the non-blocking pt2pt and collectives, where the same problem effectively exists. Also, for some of the functions, like iGatherv, there are two buffers associated with the request. I think we've just been getting lucky with the way java bindings have been used that we've not seen this problem elsewhere.

@nrgraham23
Copy link
Contributor

Howard and I discussed a more robust solution that uses the idea Gilles suggested. We plan to use an array list to store the buffers so we can store a variable number of buffers, and we will move it to the Request class so the non-blocking pt2pt and collective operations can store the buffers as well. Additionally, instead of modifying or making new constructors, we will add a single method that will need to be called to add the buffers to the array list.

Ill try to get this PR up tonight so it can be discussed.

nrgraham23 pushed a commit to nrgraham23/ompi that referenced this issue Aug 18, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
@nrgraham23
Copy link
Contributor

How does this look?

#820

@ggouaillardet
Copy link
Contributor

@nrgraham23 i did not test this yet, but at first glance, that looks good to me.
are there any cases in which a request can reference more than two buffers ?
if not, and from a performance point of view, i'd rather replace
ArrayList<Buffer> buffers; and addBufRef() with
Buffer sendBuf, recvBuf;, addSendBufRef() and addRecvBufRef()

@nrgraham23
Copy link
Contributor

At the moment there are not (as far as I am aware). We decided to do an array list in case there comes a time when there are more than two buffers. I think I agree with you on the suggested change, but I will wait for others to weigh in as well. Ill also see if I can find some more information on the performance hit associated with each option.

@ggouaillardet
Copy link
Contributor

@nrgraham23 in order to make the garbage collection more efficient, should the references to the buffers be removed when a non persistent request completes ?

@nrgraham23
Copy link
Contributor

@ggouaillardet We could do something like that, however there would be performance costs associated with it. We could potentially set them to null in the waitFor method, however we would have to verify that it was not a persistent request. There could also be increased costs in other methods like the testStatus method which could be called multiple times before the request is actually completed. In cases like that we would have to both check that the return was not null and that it was not a persistent request.

I do not think it would be worth while to remove the references.

@ggouaillardet
Copy link
Contributor

@nrgraham23 the costs of setting a pointer to null is very small comparing to allocating an object.
for example, at the end of the waitFor function, it is possible to add a call to this.removeBufRefs()
in the Request class, removeBufRefs() remove the buffer references.
the Prequest class redefines the removeBufRefs() method and simply return.
as you pointed, removeBufRefs() should also be invoked by testStatus and friends if a request completes.
it might not be worth it, but imho, i do not think the performance issue is a valid point not to do it.

@nrgraham23
Copy link
Contributor

@ggouaillardet Obviously setting a pointer to null costs fairly little, I was more referring to the additional checks we would also have to do. I think that since the Request object is likely to go out of scope shortly after the request completes anyways, it is not worth while to do the extra work. Your suggestion of how to remove the buffers would cause a useless method call in the case of Prequests. It is not a bad idea, but again, since the Request objects will most likely go out of scope shortly after the request completes, I do not think it is worth while.

nrgraham23 pushed a commit to nrgraham23/ompi that referenced this issue Aug 19, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
tkordenbrock pushed a commit to tkordenbrock/ompi that referenced this issue Aug 28, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
hjelmn pushed a commit to hjelmn/ompi that referenced this issue Sep 8, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
bosilca pushed a commit to bosilca/ompi that referenced this issue Sep 15, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 8, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Nov 10, 2015
…out_of_credit_fix

btl/openib: queue pending fragments once only when running out of credit
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Nov 10, 2015
This pull request adds an arraylist of type Buffer to
the Request class.  Whenever a request object is created
that has associated buffers, the buffers should be added
to this array list so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet

Fixes open-mpi#369

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 19, 2016
This pull request adds two Buffer references to the
Request class. Whenever a request object is created
that has associated buffers, the buffers should be added
to these references so the java garbage collector does
not dispose of the buffers prematurely.

This is a more robust expansion on the idea first proposed by
@ggouaillardet, and further discussed in open-mpi#369.

Fixes open-mpi#369

Manual cherry-pick from d363b5d

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants