Skip to content

Segfault in MPI_Init #13520

@noproblemwiththat

Description

@noproblemwiththat

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Debian Trixie package libopenmpi40

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Debian Trixie
  • Computer hardware: Docker container (GitLab CI)
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I have a package vfplot which reads files in the Gerris GFS format (fluid dynamics) using the Gerris library, that calls gfs_init here

int gfs_csv(gfs_csv_t *opt)
{
  int err = 0;
  gfs_init(&err, NULL);
  : 

which calls MPI_init here

int argc1 = 1;
char ** argv1;
argv1 = g_malloc (sizeof (char *));
argv1[0] = g_strdup ("gfs_init");
MPI_Init (&argc1, &argv1);
g_free (argv1[0]); g_free (argv1);

This is mature code and has a decent test suite and CI. I recently tried updating the CI to Debian Trixie and find that CI fails due to segfaults, never seen this before. In the first case, an acceptance test, the backtrace is

# (in test file ./gfs-csv.bats, line 77)
#   `[ $status -eq 0 ]' failed
# Last output:
# This is gfs-csv (version 2.0.2)
# [runner-t3kwblnv-project-6939729-concurrent-4:12596:0:12596] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x21)
# ==== backtrace (tid:  12596) ====
#  0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7f60e694a64c]
#  1  /lib/x86_64-linux-gnu/libucs.so.0(+0x3182f) [0x7f60e694a82f]
#  2  /lib/x86_64-linux-gnu/libucs.so.0(+0x319fa) [0x7f60e694a9fa]
#  3  /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7f60e7221df0]
#  4  /lib/x86_64-linux-gnu/libc.so.6(+0x16fc59) [0x7f60e7351c59]
#  5  /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_argv_join+0x45) [0x7f60e6824cc5]
#  6  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_init+0x892) [0x7f60e6e96c92]
#  7  /lib/x86_64-linux-gnu/libmpi.so.40(+0x9b42a) [0x7f60e6e9b42a]
#  8  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_init+0x68) [0x7f60e6e9c188]
#  9  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x80) [0x7f60e6e93720]
# 10  /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6f) [0x7f60e6ec449f]
# 11  /lib/x86_64-linux-gnu/libgfs2D-1.3.so.2(gfs_init+0x14f) [0x7f60e76db30f]
# 12  ../../gfs-csv/gfs-csv(gfs_csv+0x1d) [0x557f6722b52d]
# 13  ../../gfs-csv/gfs-csv(main+0x130) [0x557f67229450]
# 14  /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f60e720bca8]
# 15  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f60e720bd65]
# 16  ../../gfs-csv/gfs-csv(_start+0x21) [0x557f672295e1]

Peculiarities

  • always happens on gcc-12, 13, 14, never happens on clang -17, 18, 19
  • no segfault with -fsanitize=address
  • no segfault under valgrind

The second failure is a unit-test running with Electric Fence linked, that complains of a malloc(0) and errors out, but that's entirely well-defined (and returns NULL), so add EF_ALLOW_MALLOC_0=1 to the environment on test. Result is the same(ish) segfault

 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7f622a24164c]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x3182f) [0x7f622a24182f]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x319fa) [0x7f622a2419fa]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7f622b25adf0]
 4  /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_argv_join+0x45) [0x7f622a11bcc5]
 5  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_init+0x892) [0x7f622ae96c92]
 6  /lib/x86_64-linux-gnu/libmpi.so.40(+0x9b42a) [0x7f622ae9b42a]
 7  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_init+0x68) [0x7f622ae9c188]
 8  /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x80) [0x7f622ae93720]
 9  /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6f) [0x7f622aec449f]
10  /lib/x86_64-linux-gnu/libgfs2D-1.3.so.2(gfs_init+0x14f) [0x7f622b93b30f]
11  ./unit-vf(field_read_gfs+0x21) [0x5561ac2c0fc1]
12  ./unit-vf(field_read+0x544) [0x5561ac2c0564]
13  ./unit-vf(test_field_read_gfs_absent+0x2c) [0x5561ac2be9ec]
14  /lib/x86_64-linux-gnu/libcunit.so.1(+0x4a83) [0x7f622b9c3a83]
15  /lib/x86_64-linux-gnu/libcunit.so.1(+0x4cd8) [0x7f622b9c3cd8]
16  /lib/x86_64-linux-gnu/libcunit.so.1(CU_run_all_tests+0x58) [0x7f622b9c4138]
17  ./unit-vf(main+0x3f) [0x5561ac2be53f]
18  /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f622b244ca8]
19  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f622b244d65]
20  ./unit-vf(_start+0x21) [0x5561ac2be5e1]

I'm well used to tracing down segfaults, but these have me baffled. So my question -- has anyone seen this sort of thing before? Or have an idea of where the issue could be?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions