Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with BIND_NOW executable #37

Closed
fukai-t opened this issue Aug 31, 2020 · 1 comment
Closed

Segmentation fault with BIND_NOW executable #37

fukai-t opened this issue Aug 31, 2020 · 1 comment

Comments

@fukai-t
Copy link

fukai-t commented Aug 31, 2020

Overview

The spindle with application executable built with BIND_NOW option occur segmentation fault. I saw the fault on a x86 cluster and an aarch64 cluster.

Reproduce steps

I confirmed the following reproduce steps on the x86 cluster.

The linker version in x86 cluster.

$ LC_ALL=C ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
  1. I downloaded v0.12 from https://github.com/hpc/Spindle/releases/tag/v0.12 and built it.

  2. Prepare the simple application built with BIND_NOW and run with Spindle like the following.

    $ cat hello.c

#include <stdio.h>

int main (int argc, char* argv[])
{
  printf ("Hello world!\n");
  return 0;
}
$ gcc -Wl,-z,now -o hello_bind_now hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello_bind_now
<Aug 31 16:19:45> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:19:45> <Launchmon> (INFO): Just continued the RM process out of the first trap

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 247311 RUNNING AT 10.xx.yy.zz
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

Expected results

Without BIND_NOW option, the application can run with Spindle.

$ gcc  -o hello hello.c
SPINDLE_DEBUG=3 TMPDIR='/tmp' spindle --location='/tmp' mpiexec -np 1 spindlemarker $(pwd)/hello
<Aug 31 16:20:26> <Launchmon> (INFO): The RM process has just been forked and exec'ed.
<Aug 31 16:20:26> <Launchmon> (INFO): Just continued the RM process out of the first trap
Hello world!

Detail

In the debug output, the SPINDLE client looks stop with the following log.

[Client.0.252100@auditclient_common.c:92] la_objopen - la_objopen(): loading /lib64/libc.so.6, link_map = 0x2b60c23859c8, lmid = LM_ID_BASE, cookie = 0x2b60c2385e30
[Client.0.252100@auditclient_common.c:116] la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[Client.0.252100@rogot.c:30] remove_lib_rogot - Checking whether /lib64/libc.so.6 has R GOT
[Client.0.252100@rogot.c:41] remove_lib_rogot - Changing /lib64/libc.so.6 R GOT to RW GOT from 2b60c2b40000 to 2b60c2b44000
[Client.0.252100@rogot.c:30] remove_lib_rogot - Checking whether /lib64/ld-linux-x86-64.so.2 has R GOT
[Client.0.252100@rogot.c:41] remove_lib_rogot - Changing /lib64/ld-linux-x86-64.so.2 R GOT to RW GOT from 2b60c2566000 to 2b60c2567000
[Client.0.252100@auditclient.c:39] spindle_la_activity - la_activity(): cookie = 0x2b60c25685c0; flag = LA_ACT_CONSISTENT
[Server.252113@ldcs_api_listen.c:174] ldcs_listen - Select returned data.  Calling callback for fd 14 id=0
[Server.252113@ldcs_audit_server_client_cb.c:61] _ldcs_client_CB - Receiving message from client 0 on fd 14
[Server.252113@ldcs_api_pipe.c:387] _ldcs_read_pipe - before read from fifo 14, bytes_to_read = 8
[Server.252113@ldcs_api_pipe.c:398] _ldcs_read_pipe - read from fifo: 0 bytes ...
[Server.252113@ldcs_api_pipe.c:338] ldcs_recv_msg_static_pipe - Client disconnected.  Returning END message

Appendix

The result of the readelf -d for each application binary.

$ LC_ALL=C readelf -d hello_bind_now

Dynamic section at offset 0xdd8 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x4003e0
 0x000000000000000d (FINI)               0x4005c4
 0x0000000000000019 (INIT_ARRAY)         0x600dc0
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x600dc8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
 0x0000000000000005 (STRTAB)             0x400318
 0x0000000000000006 (SYMTAB)             0x4002b8
 0x000000000000000a (STRSZ)              61 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x600fc8
 0x0000000000000002 (PLTRELSZ)           72 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x400398
 0x0000000000000007 (RELA)               0x400380
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000018 (BIND_NOW)
 0x000000006ffffffb (FLAGS_1)            Flags: NOW
 0x000000006ffffffe (VERNEED)            0x400360
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x400356
 0x0000000000000000 (NULL)               0x0
$

$ LC_ALL=C readelf -d hello

Dynamic section at offset 0xe28 contains 24 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x4003e0
 0x000000000000000d (FINI)               0x4005c4
 0x0000000000000019 (INIT_ARRAY)         0x600e10
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x600e18
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
 0x0000000000000005 (STRTAB)             0x400318
 0x0000000000000006 (SYMTAB)             0x4002b8
 0x000000000000000a (STRSZ)              61 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x601000
 0x0000000000000002 (PLTRELSZ)           72 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x400398
 0x0000000000000007 (RELA)               0x400380
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x400360
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x400356
 0x0000000000000000 (NULL)               0x0
$
@mplegendre
Copy link
Member

I believe this issue is related to a glibc bug I recently learned about where LD_BIND_NOW breaks the LD_AUDIT interface that spindle relies on. It can be worked around by running spindle with its '--audit-type=subaudit' option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants