Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid memory access in vader / opal_free_list_destruct #6648

Closed
q-p opened this issue May 10, 2019 · 3 comments
Closed

Invalid memory access in vader / opal_free_list_destruct #6648

q-p opened this issue May 10, 2019 · 3 comments
Assignees

Comments

@q-p
Copy link

q-p commented May 10, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Binary installation from homebrew (open-mpi/4.0.1_1)

Please describe the system on which you are running

  • Operating system/version: macOS Mojave 10.14.4
  • Computer hardware: 2013 Mac Pro
  • Network type: local only (using vader, it seems)

Details of the problem

There seems to be an invalid memory access during MPI_Finalize() when using vader (on >1 process). The following simple example

#include <mpi.h>

int main (int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}

when run with libgmalloc (man libgmalloc on a Mac for more info) as

mpirun -np 2 -x DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib ./a.out

leads to the following segmentation fault (on both processes):

[1,0]<stderr>:GuardMalloc[a.out-15800]: Allocations will be placed on 16 byte boundaries.
[1,0]<stderr>:GuardMalloc[a.out-15800]:  - Some buffer overruns may not be noticed.
[1,0]<stderr>:GuardMalloc[a.out-15800]:  - Applications using vector instructions (e.g., SSE) should work.
[1,0]<stderr>:GuardMalloc[a.out-15800]: version 109
[1,0]<stderr>:[Seerose:15800] *** Process received signal ***
[1,0]<stderr>:[Seerose:15800] Signal: Segmentation fault: 11 (11)
[1,0]<stderr>:[Seerose:15800] Signal code: Address not mapped (1)
[1,0]<stderr>:[Seerose:15800] Failing at address: 0x10d953f50
[1,0]<stderr>:[Seerose:15800] [ 0] 0   libsystem_platform.dylib            0x00007fff76551b5d _sigtramp + 29
[1,0]<stderr>:[Seerose:15800] [ 1] 0   ???                                 0x000000010a498b1c 0x0 + 4467559196
[1,0]<stderr>:[Seerose:15800] [ 2] 0   libopen-pal.40.dylib                0x0000000103a8295a opal_free_list_destruct + 231
[1,0]<stderr>:[Seerose:15800] [ 3] 0   mca_btl_vader.so                    0x000000010ad33a95 mca_btl_vader_component_close + 42
[1,0]<stderr>:[Seerose:15800] [ 4] 0   libopen-pal.40.dylib                0x0000000103aa2f1b mca_base_component_close + 27
[1,0]<stderr>:[Seerose:15800] [ 5] 0   libopen-pal.40.dylib                0x0000000103aa2fbe mca_base_components_close + 94
[1,0]<stderr>:[Seerose:15800] [ 6] 0   libopen-pal.40.dylib                0x0000000103aa2f5c mca_base_framework_components_close + 24
[1,0]<stderr>:[Seerose:15800] [ 7] 0   libopen-pal.40.dylib                0x0000000103abcb35 mca_btl_base_close + 115
[1,0]<stderr>:[Seerose:15800] [ 8] 0   libopen-pal.40.dylib                0x0000000103aab56d mca_base_framework_close + 254
[1,0]<stderr>:[Seerose:15800] [ 9] 0   libopen-pal.40.dylib                0x0000000103aab56d mca_base_framework_close + 254
[1,0]<stderr>:[Seerose:15800] [10] 0   libmpi.40.dylib                     0x00000001038cfed4 ompi_mpi_finalize + 2170
[1,0]<stderr>:[Seerose:15800] [11] 0   a.out                               0x000000010389cf7b main + 43
[1,0]<stderr>:[Seerose:15800] [12] 0   libdyld.dylib                       0x00007fff7636c3d5 start + 1
[1,0]<stderr>:[Seerose:15800] *** End of error message ***
@bosilca
Copy link
Member

bosilca commented May 11, 2019

b5f79c2 needs to be cherry-picked on the release branches.

@ggouaillardet
Copy link
Contributor

@bosilca I made the PR for v4.0.x. The commit id you referred is already there and I cherry-picked the latest one. Note I also removed some dead code that is only for master and only causes some warnings in the release branch.
I will PR to the other release branches once #6651 is approved.

@jsquyres
Copy link
Member

After looking at v3.0.x and v3.1.x, I think that the fixes @ggouaillardet made in #6651 are not applicable to v3.0.x and v3.1.x. Since #6651 has been merged to v4.0.x, I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants