Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

awlauria · 2017-08-30T20:46:08Z

Updated, new info 9/1/17

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.0.0rc4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

Red Hat 7.3
Power and X86

Details of the problem

I have a simple test that hits a segmentation fault intermittently in all v3.0 rc's.
The tests seems to pass on the master branch, though it's not apparent as to where the bug got 'fixed'. I attached a sample testcase that will fail running on a single node with two tasks. It may not fail every time, but if you run it in a loop you will hit the segmentation fault after some runs. The location the segmentation fault will change, so below is an example stack-trace.

I can only get it to crash using an optimized build.

Running with valgrind doesn't show any heap corruption, even on the fail case. So it seems to be stack related, unless valgrind is missing something.

0  0x000010000050eb18 in raise () from /lib64/libc.so.6
#1  0x0000100000510c9c in abort () from /lib64/libc.so.6
#2  0x0000100000555784 in __libc_message () from /lib64/libc.so.6
#3  0x000010000055f800 in malloc_consolidate () from /lib64/libc.so.6
#4  0x0000100000561be4 in _int_malloc () from /lib64/libc.so.6
#5  0x00001000005645ec in malloc () from /lib64/libc.so.6
#6  0x000010000059fc58 in __alloc_dir () from /lib64/libc.so.6
#7  0x000010000059fdcc in __opendirat () from /lib64/libc.so.6
#8  0x000010000059fe30 in opendir () from /lib64/libc.so.6
#9  0x00001000001f1004 in opal_os_dirpath_is_empty () from /smpi_dev/awlauria/ompi/exports/lib/libopen-pal.so.40
#10 0x00001000000a8d10 in orte_session_dir_cleanup () from /smpi_dev/awlauria/ompi/exports/lib/libopen-rte.so.40
#11 0x00000000100018f4 in ?? ()
#12 0x0000000010001060 in main (argc=-1, argv=0x0) at allgather_inter.c:43

sample run:

`mpirun -np 2 ./simple_test

simple_test.zip

The text was updated successfully, but these errors were encountered:

jjhursey · 2017-08-30T21:11:03Z

It's strange that a very similar test case is passing with MTT:

https://github.com/open-mpi/ompi-tests/blob/master/ibm/collective/intercomm/gather_inter.c
The failure is at the MPI_Send before an intercommunicator is created. It's sending over MPI_COMM_WORLD.

The stack looks like a data corruption. How did you configure Open MPI?

awlauria · 2017-08-31T12:09:25Z

The only configure option I used for these tests was --with-ucx=no.

awlauria · 2017-08-31T13:03:12Z

I can't seem to get it to fail with --enable-debug and --use-memchecker.

Valgrind runs on these show no apparent issues.

gpaulsen · 2017-08-31T16:54:09Z

IBM is enabling MTT testing on optimized builds tonight of the v3.0.x branch. I'll send email to devel-core.

awlauria · 2017-08-31T17:01:54Z

-I still cannot get this to fail on a debug build.
-Since valgrind is giving nothing useful it seems to be stack related.
-I can reproduce on power and x86 with 3.0.0rc4

awlauria · 2017-09-01T15:49:04Z

Updated 9/1 to reflect new info.

The failure doesn't seem intercom related. I attached a new simple reproducer.

ggouaillardet · 2017-09-04T11:15:00Z

@awlauria thanks for the report, i ll take a crack at it.
so far, i was able to reproduce the issue.

this is a very odd memory corruption in orte_data_server().
the error can be evidenced with valgrind running on mpirun

valgrind mpirun -np 2 ./test_simple

unfortunatly the error does not always occur.

i wrote it is strange because at first, i was able to trace at line 355 (iirc)

opal_pointer_array_set_item(&orte_data_server_store, data->index, NULL);
OBJ_RELEASE(data);

but then, at line 453 (iirc)

data = (orte_data_object_t*)opal_pointer_array_get_item(&orte_data_server_store, k);

with the same index returns the previous data that was OBJ_RELEASE'd instead of NULL

i will keep digging tomorrow

correctly balance some parenthesis ... Fixes open-mpi#4153 Thanks Austen Lauria for the report This is a one-off commit for the v3.0.x branch, master was fixed as part of a larger commit, and the v2 branches are unaffected. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

ggouaillardet · 2017-09-05T02:23:34Z

@awlauria this is fixed in #4167

fwiw, opal_pointer_array_set_item() was incorrectly set in a macro (parenthesis were not correctly balanced), so the bug was hidden when Open MPI is built with --enable-debug

awlauria · 2017-09-05T14:01:05Z

Great, thank you @ggouaillardet for looking at it.

awlauria · 2017-09-08T13:11:14Z

I confirmed it does not happen on master nor the new v3.0.0rc5 build.

Fixed by #4167.

awlauria changed the title ~~Segmentation fault in Intercommunicator collectives~~ Segmentation fault in when using intercommunicators Aug 30, 2017

awlauria changed the title ~~Segmentation fault in when using intercommunicators~~ Segmentation fault when using intercommunicators Aug 30, 2017

bwbarrett added bug Target: v3.0.x labels Aug 30, 2017

awlauria closed this as completed Aug 31, 2017

awlauria reopened this Aug 31, 2017

awlauria changed the title ~~Segmentation fault when using intercommunicators~~ Segmentation fault in simple program using MPI_Comm_accept()/connect() Sep 1, 2017

ggouaillardet self-assigned this Sep 4, 2017

ggouaillardet mentioned this issue Sep 5, 2017

orte_data_server: fix a typo in ORTE_PMIX_PUBLISH_CMD handling #4167

Merged

awlauria closed this as completed Sep 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

awlauria commented Aug 30, 2017 •

edited

Loading

jjhursey commented Aug 30, 2017

awlauria commented Aug 31, 2017 •

edited

Loading

awlauria commented Aug 31, 2017

gpaulsen commented Aug 31, 2017

awlauria commented Aug 31, 2017

awlauria commented Sep 1, 2017

ggouaillardet commented Sep 4, 2017

ggouaillardet commented Sep 5, 2017

awlauria commented Sep 5, 2017

awlauria commented Sep 8, 2017

Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

Comments

awlauria commented Aug 30, 2017 • edited Loading

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

jjhursey commented Aug 30, 2017

awlauria commented Aug 31, 2017 • edited Loading

awlauria commented Aug 31, 2017

gpaulsen commented Aug 31, 2017

awlauria commented Aug 31, 2017

awlauria commented Sep 1, 2017

ggouaillardet commented Sep 4, 2017

ggouaillardet commented Sep 5, 2017

awlauria commented Sep 5, 2017

awlauria commented Sep 8, 2017

awlauria commented Aug 30, 2017 •

edited

Loading

awlauria commented Aug 31, 2017 •

edited

Loading