Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in simple program using MPI_Comm_accept()/connect() #4153

Closed
awlauria opened this issue Aug 30, 2017 · 10 comments
Closed
Assignees

Comments

@awlauria
Copy link
Contributor

awlauria commented Aug 30, 2017

Updated, new info 9/1/17

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.0.0rc4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

Red Hat 7.3
Power and X86

Details of the problem

I have a simple test that hits a segmentation fault intermittently in all v3.0 rc's.
The tests seems to pass on the master branch, though it's not apparent as to where the bug got 'fixed'. I attached a sample testcase that will fail running on a single node with two tasks. It may not fail every time, but if you run it in a loop you will hit the segmentation fault after some runs. The location the segmentation fault will change, so below is an example stack-trace.

I can only get it to crash using an optimized build.

Running with valgrind doesn't show any heap corruption, even on the fail case. So it seems to be stack related, unless valgrind is missing something.

0  0x000010000050eb18 in raise () from /lib64/libc.so.6
#1  0x0000100000510c9c in abort () from /lib64/libc.so.6
#2  0x0000100000555784 in __libc_message () from /lib64/libc.so.6
#3  0x000010000055f800 in malloc_consolidate () from /lib64/libc.so.6
#4  0x0000100000561be4 in _int_malloc () from /lib64/libc.so.6
#5  0x00001000005645ec in malloc () from /lib64/libc.so.6
#6  0x000010000059fc58 in __alloc_dir () from /lib64/libc.so.6
#7  0x000010000059fdcc in __opendirat () from /lib64/libc.so.6
#8  0x000010000059fe30 in opendir () from /lib64/libc.so.6
#9  0x00001000001f1004 in opal_os_dirpath_is_empty () from /smpi_dev/awlauria/ompi/exports/lib/libopen-pal.so.40
#10 0x00001000000a8d10 in orte_session_dir_cleanup () from /smpi_dev/awlauria/ompi/exports/lib/libopen-rte.so.40
#11 0x00000000100018f4 in ?? ()
#12 0x0000000010001060 in main (argc=-1, argv=0x0) at allgather_inter.c:43

sample run:

`mpirun -np 2 ./simple_test

simple_test.zip

@awlauria awlauria changed the title Segmentation fault in Intercommunicator collectives Segmentation fault in when using intercommunicators Aug 30, 2017
@awlauria awlauria changed the title Segmentation fault in when using intercommunicators Segmentation fault when using intercommunicators Aug 30, 2017
@jjhursey
Copy link
Member

It's strange that a very similar test case is passing with MTT:

The stack looks like a data corruption. How did you configure Open MPI?

@awlauria
Copy link
Contributor Author

awlauria commented Aug 31, 2017

The only configure option I used for these tests was --with-ucx=no.

@awlauria
Copy link
Contributor Author

I can't seem to get it to fail with --enable-debug and --use-memchecker.

Valgrind runs on these show no apparent issues.

@gpaulsen
Copy link
Member

IBM is enabling MTT testing on optimized builds tonight of the v3.0.x branch. I'll send email to devel-core.

@awlauria awlauria reopened this Aug 31, 2017
@awlauria
Copy link
Contributor Author

-I still cannot get this to fail on a debug build.
-Since valgrind is giving nothing useful it seems to be stack related.
-I can reproduce on power and x86 with 3.0.0rc4

@awlauria awlauria changed the title Segmentation fault when using intercommunicators Segmentation fault in simple program using MPI_Comm_accept()/connect() Sep 1, 2017
@awlauria
Copy link
Contributor Author

awlauria commented Sep 1, 2017

Updated 9/1 to reflect new info.

The failure doesn't seem intercom related. I attached a new simple reproducer.

@ggouaillardet ggouaillardet self-assigned this Sep 4, 2017
@ggouaillardet
Copy link
Contributor

@awlauria thanks for the report, i ll take a crack at it.
so far, i was able to reproduce the issue.

this is a very odd memory corruption in orte_data_server().
the error can be evidenced with valgrind running on mpirun

valgrind mpirun -np 2 ./test_simple

unfortunatly the error does not always occur.

i wrote it is strange because at first, i was able to trace at line 355 (iirc)

opal_pointer_array_set_item(&orte_data_server_store, data->index, NULL);
OBJ_RELEASE(data);

but then, at line 453 (iirc)

data = (orte_data_object_t*)opal_pointer_array_get_item(&orte_data_server_store, k);

with the same index returns the previous data that was OBJ_RELEASE'd instead of NULL

i will keep digging tomorrow

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Sep 5, 2017
correctly balance some parenthesis ...

Fixes open-mpi#4153

Thanks Austen Lauria for the report

This is a one-off commit for the v3.0.x branch, master was fixed as part of a larger commit,
and the v2 branches are unaffected.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet
Copy link
Contributor

@awlauria this is fixed in #4167

fwiw, opal_pointer_array_set_item() was incorrectly set in a macro (parenthesis were not correctly balanced), so the bug was hidden when Open MPI is built with --enable-debug

@awlauria
Copy link
Contributor Author

awlauria commented Sep 5, 2017

Great, thank you @ggouaillardet for looking at it.

@awlauria
Copy link
Contributor Author

awlauria commented Sep 8, 2017

I confirmed it does not happen on master nor the new v3.0.0rc5 build.

Fixed by #4167.

@awlauria awlauria closed this as completed Sep 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants