You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The spindle with application executable built with BIND_NOW option occur segmentation fault. I saw the fault on a x86 cluster and an aarch64 cluster.
Reproduce steps
I confirmed the following reproduce steps on the x86 cluster.
The linker version in x86 cluster.
$ LC_ALL=C ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
$ gcc -Wl,-z,now -o hello_bind_now hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello_bind_now
<Aug 31 16:19:45> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:19:45> <Launchmon> (INFO): Just continued the RM process out of the first trap
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 247311 RUNNING AT 10.xx.yy.zz
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
Expected results
Without BIND_NOW option, the application can run with Spindle.
$ gcc -o hello hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello
<Aug 31 16:20:26> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:20:26> <Launchmon> (INFO): Just continued the RM process out of the first trap
Hello world!
Detail
In the debug output, the SPINDLE client looks stop with the following log.
[Client.0.252100@auditclient_common.c:92] la_objopen - la_objopen(): loading /lib64/libc.so.6, link_map = 0x2b60c23859c8, lmid = LM_ID_BASE, cookie = 0x2b60c2385e30
[Client.0.252100@auditclient_common.c:116] la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[Client.0.252100@rogot.c:30] remove_lib_rogot - Checking whether /lib64/libc.so.6 has R GOT
[Client.0.252100@rogot.c:41] remove_lib_rogot - Changing /lib64/libc.so.6 R GOT to RW GOT from 2b60c2b40000 to 2b60c2b44000
[Client.0.252100@rogot.c:30] remove_lib_rogot - Checking whether /lib64/ld-linux-x86-64.so.2 has R GOT
[Client.0.252100@rogot.c:41] remove_lib_rogot - Changing /lib64/ld-linux-x86-64.so.2 R GOT to RW GOT from 2b60c2566000 to 2b60c2567000
[Client.0.252100@auditclient.c:39] spindle_la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[Server.252113@ldcs_api_listen.c:174] ldcs_listen - Select returned data. Calling callback for fd 14 id=0
[Server.252113@ldcs_audit_server_client_cb.c:61] _ldcs_client_CB - Receiving message from client 0 on fd 14
[Server.252113@ldcs_api_pipe.c:387] _ldcs_read_pipe - before read from fifo 14, bytes_to_read = 8
[Server.252113@ldcs_api_pipe.c:398] _ldcs_read_pipe - read from fifo: 0 bytes ...
[Server.252113@ldcs_api_pipe.c:338] ldcs_recv_msg_static_pipe - Client disconnected. Returning END message
I believe this issue is related to a glibc bug I recently learned about where LD_BIND_NOW breaks the LD_AUDIT interface that spindle relies on. It can be worked around by running spindle with its '--audit-type=subaudit' option.
Overview
The spindle with application executable built with BIND_NOW option occur segmentation fault. I saw the fault on a x86 cluster and an aarch64 cluster.
Reproduce steps
I confirmed the following reproduce steps on the x86 cluster.
The linker version in x86 cluster.
I downloaded v0.12 from https://github.com/hpc/Spindle/releases/tag/v0.12 and built it.
Prepare the simple application built with BIND_NOW and run with Spindle like the following.
$ cat hello.c
Expected results
Without BIND_NOW option, the application can run with Spindle.
Detail
In the debug output, the SPINDLE client looks stop with the following log.
Appendix
The result of the readelf -d for each application binary.
The text was updated successfully, but these errors were encountered: