-
Notifications
You must be signed in to change notification settings - Fork 937
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Debian Trixie package libopenmpi40
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe the system on which you are running
- Operating system/version: Debian Trixie
- Computer hardware: Docker container (GitLab CI)
- Network type:
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I have a package vfplot which reads files in the Gerris GFS format (fluid dynamics) using the Gerris library, that calls gfs_init here
int gfs_csv(gfs_csv_t *opt)
{
int err = 0;
gfs_init(&err, NULL);
: which calls MPI_init here
int argc1 = 1;
char ** argv1;
argv1 = g_malloc (sizeof (char *));
argv1[0] = g_strdup ("gfs_init");
MPI_Init (&argc1, &argv1);
g_free (argv1[0]); g_free (argv1);This is mature code and has a decent test suite and CI. I recently tried updating the CI to Debian Trixie and find that CI fails due to segfaults, never seen this before. In the first case, an acceptance test, the backtrace is
# (in test file ./gfs-csv.bats, line 77)
# `[ $status -eq 0 ]' failed
# Last output:
# This is gfs-csv (version 2.0.2)
# [runner-t3kwblnv-project-6939729-concurrent-4:12596:0:12596] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x21)
# ==== backtrace (tid: 12596) ====
# 0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7f60e694a64c]
# 1 /lib/x86_64-linux-gnu/libucs.so.0(+0x3182f) [0x7f60e694a82f]
# 2 /lib/x86_64-linux-gnu/libucs.so.0(+0x319fa) [0x7f60e694a9fa]
# 3 /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7f60e7221df0]
# 4 /lib/x86_64-linux-gnu/libc.so.6(+0x16fc59) [0x7f60e7351c59]
# 5 /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_argv_join+0x45) [0x7f60e6824cc5]
# 6 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_init+0x892) [0x7f60e6e96c92]
# 7 /lib/x86_64-linux-gnu/libmpi.so.40(+0x9b42a) [0x7f60e6e9b42a]
# 8 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_init+0x68) [0x7f60e6e9c188]
# 9 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x80) [0x7f60e6e93720]
# 10 /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6f) [0x7f60e6ec449f]
# 11 /lib/x86_64-linux-gnu/libgfs2D-1.3.so.2(gfs_init+0x14f) [0x7f60e76db30f]
# 12 ../../gfs-csv/gfs-csv(gfs_csv+0x1d) [0x557f6722b52d]
# 13 ../../gfs-csv/gfs-csv(main+0x130) [0x557f67229450]
# 14 /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f60e720bca8]
# 15 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f60e720bd65]
# 16 ../../gfs-csv/gfs-csv(_start+0x21) [0x557f672295e1]
Peculiarities
- always happens on gcc-12, 13, 14, never happens on clang -17, 18, 19
- no segfault with
-fsanitize=address - no segfault under
valgrind
The second failure is a unit-test running with Electric Fence linked, that complains of a malloc(0) and errors out, but that's entirely well-defined (and returns NULL), so add EF_ALLOW_MALLOC_0=1 to the environment on test. Result is the same(ish) segfault
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2bc) [0x7f622a24164c]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x3182f) [0x7f622a24182f]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x319fa) [0x7f622a2419fa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x3fdf0) [0x7f622b25adf0]
4 /lib/x86_64-linux-gnu/libopen-pal.so.80(opal_argv_join+0x45) [0x7f622a11bcc5]
5 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_rte_init+0x892) [0x7f622ae96c92]
6 /lib/x86_64-linux-gnu/libmpi.so.40(+0x9b42a) [0x7f622ae9b42a]
7 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_instance_init+0x68) [0x7f622ae9c188]
8 /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x80) [0x7f622ae93720]
9 /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6f) [0x7f622aec449f]
10 /lib/x86_64-linux-gnu/libgfs2D-1.3.so.2(gfs_init+0x14f) [0x7f622b93b30f]
11 ./unit-vf(field_read_gfs+0x21) [0x5561ac2c0fc1]
12 ./unit-vf(field_read+0x544) [0x5561ac2c0564]
13 ./unit-vf(test_field_read_gfs_absent+0x2c) [0x5561ac2be9ec]
14 /lib/x86_64-linux-gnu/libcunit.so.1(+0x4a83) [0x7f622b9c3a83]
15 /lib/x86_64-linux-gnu/libcunit.so.1(+0x4cd8) [0x7f622b9c3cd8]
16 /lib/x86_64-linux-gnu/libcunit.so.1(CU_run_all_tests+0x58) [0x7f622b9c4138]
17 ./unit-vf(main+0x3f) [0x5561ac2be53f]
18 /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f622b244ca8]
19 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f622b244d65]
20 ./unit-vf(_start+0x21) [0x5561ac2be5e1]
I'm well used to tracing down segfaults, but these have me baffled. So my question -- has anyone seen this sort of thing before? Or have an idea of where the issue could be?
Thanks in advance.