Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Jun 14, 2016

PSM/PSM2: Disable signal handler hijacking by default

Per discussion on #1767 (and some subsequent phone calls and off-issue email discussions), the PSM library is hijacking signal handlers by default. Specifically: unless the environment variables IPATH_NO_BACKTRACE=1 (for PSM / Intel TrueScale) is set, the library constructor for this library will hijack various signal handlers for the purpose of invoking its own error reporting mechanisms.

This may be a bit surprising, but is not a problem, per se. The real problem is that older versions of at least the PSM library do not unregister these signal handlers upon being unloaded from memory. Hence, a segv can actually result in a double segv (i.e., the original segv and then another segv when the now-non-existent signal handler is invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting mechanism, which may be a bit surprising for some users (particularly those who do not have Intel TrueScale). As such, we disable it by default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2 library that may cause problems in the unloading of its signal handlers. This problem can be avoided by setting HFI_NO_BACKTRACE=1 (for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries can be loaded by the OFI MTL and the usNIC BTL (because they are loaded by libfabric), even when there is no Intel networking hardware present. Having the PSM/PSM2 libraries behave this way when no Intel hardware is present is clearly undesirable (and is likely to be fixed in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable this behavior from the PSM/PSM2 libraries (if they are not already set):

  • IPATH_NO_BACKTRACE=1
  • HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres jsquyres@cisco.com

@rhc54 @matcabral @yburette @ggouaillardet Please review. Thanks.

Per discussion on open-mpi#1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres jsquyres force-pushed the pr/disable-psm-psm2-signal-hijacking branch from f6f03a4 to 5071602 Compare June 14, 2016 18:49
@rhc54
Copy link
Contributor

rhc54 commented Jun 14, 2016

this looks fine - will merge once cleared

@matcabral
Copy link
Contributor

the change looks according to what was agreed.

@jsquyres
Copy link
Member Author

Great; thanks @matcabral @rhc54. I'll merge.

@jsquyres jsquyres merged commit c2185bb into open-mpi:master Jun 14, 2016
@jsquyres jsquyres deleted the pr/disable-psm-psm2-signal-hijacking branch June 14, 2016 19:33
opal_setenv("IPATH_NO_BACKTRACE", "1", true, &environ);
}
if (NULL == getenv("HFI_NO_BACKTRACE")) {
opal_setenv("HFI_NO_BACKTRACE", "1", true, &environ);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit late, however, I believe this is unnecessary. Even though PSM2 library calls into the fini() code and runs sigaction() the resulting calls use a static struct that is initialized to all ZEROs (I double checked all the structs are in the .bss section). The resulting calls into sigaction with this parameter set, results in a simple query of the signal handlers, but no modification. So this work around really doesn't fix anything, other than avoiding some CPU cycles during dlclose(). I only mention this as near the end of the month this will be patched and then the HFI_NO_BACKTRACE will not exist inside libpsm2, so it will be wasted code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We opted for the "extra" protection solely because we couldn't be certain of the behavior, and the old libraries are going to be around on systems for a very long time (or at least so history says). So even though it technically isn't required, it doesn't hurt and maybe saves us from a little user angst.

@ggouaillardet
Copy link
Contributor

PSM2 changed its logic,
it used to be

if (!getenv("HFI_NO_BACKTRACE")) {
/* permanent, although probably
 undocumented way to disable backtraces. 
*/
  [set signal handler]
}

and it was changed in cornelisnetworks/opa-psm2@e951cf3 to

if (getenv("HFI_BACKTRACE")) {
 /* permanent, although probably
    undocumented way to disable backtraces.
*/
[set signal handler]

so after everyone updates its PSM2 library, that will become wasted code.
but in the meantime, we have to assume not everyone has an up-to-date PSM2 library, and hence handle that defensively in Open MPI.

as a side note, the comment was and is now

/* permanent, although probably
    undocumented way to disable backtraces.
*/
  • permanent was wrong in the first place
  • the comment could be updated to
/* hopefully permanent, although probably
   undocumented way to enable backtraces.
*/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants