- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
PSM/PSM2: Disable signal handler hijacking by default #1781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PSM/PSM2: Disable signal handler hijacking by default #1781
Conversation
Per discussion on open-mpi#1767 (and some subsequent phone calls and off-issue email discussions), the PSM library is hijacking signal handlers by default. Specifically: unless the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel TrueScale) is set, the library constructor for this library will hijack various signal handlers for the purpose of invoking its own error reporting mechanisms. This may be a bit *surprising*, but is not a *problem*, per se. The real problem is that older versions of at least the PSM library do not unregister these signal handlers upon being unloaded from memory. Hence, a segv can actually result in a double segv (i.e., the original segv and then another segv when the now-non-existent signal handler is invoked). This PSM signal hijacking subverts Open MPI's own signal reporting mechanism, which may be a bit surprising for some users (particularly those who do not have Intel TrueScale). As such, we disable it by default so that Open MPI's own error-reporting mechanisms are used. Additionally, there is a typo in the library destructor for the PSM2 library that may cause problems in the unloading of its signal handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1` (for PSM2 / Intel OmniPath). This is further compounded by the fact that the PSM / PSM2 libraries can be loaded by the OFI MTL and the usNIC BTL (because they are loaded by libfabric), even when there is no Intel networking hardware present. Having the PSM/PSM2 libraries behave this way when no Intel hardware is present is clearly undesirable (and is likely to be fixed in future releases of the PSM/PSM2 libraries). This commit sets the following two environment variables to disable this behavior from the PSM/PSM2 libraries (if they are not already set): * IPATH_NO_BACKTRACE=1 * HFI_NO_BACKTRACE=1 If the user has set these variables before invoking Open MPI, we will not override their values (i.e., their preferences will be honored). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
f6f03a4    to
    5071602      
    Compare
  
    | this looks fine - will merge once cleared | 
| the change looks according to what was agreed. | 
| Great; thanks @matcabral @rhc54. I'll merge. | 
| opal_setenv("IPATH_NO_BACKTRACE", "1", true, &environ); | ||
| } | ||
| if (NULL == getenv("HFI_NO_BACKTRACE")) { | ||
| opal_setenv("HFI_NO_BACKTRACE", "1", true, &environ); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit late, however, I believe this is unnecessary. Even though PSM2 library calls into the fini() code and runs sigaction() the resulting calls use a static struct that is initialized to all ZEROs (I double checked all the structs are in the .bss section). The resulting calls into sigaction with this parameter set, results in a simple query of the signal handlers, but no modification. So this work around really doesn't fix anything, other than avoiding some CPU cycles during dlclose(). I only mention this as near the end of the month this will be patched and then the HFI_NO_BACKTRACE will not exist inside libpsm2, so it will be wasted code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We opted for the "extra" protection solely because we couldn't be certain of the behavior, and the old libraries are going to be around on systems for a very long time (or at least so history says). So even though it technically isn't required, it doesn't hurt and maybe saves us from a little user angst.
| PSM2 changed its logic, if (!getenv("HFI_NO_BACKTRACE")) {
/* permanent, although probably
 undocumented way to disable backtraces. 
*/
  [set signal handler]
}and it was changed in cornelisnetworks/opa-psm2@e951cf3 to if (getenv("HFI_BACKTRACE")) {
 /* permanent, although probably
    undocumented way to disable backtraces.
*/
[set signal handler]so after everyone updates its PSM2 library, that will become wasted code. as a side note, the comment was and is now 
  | 
PSM/PSM2: Disable signal handler hijacking by default
Per discussion on #1767 (and some subsequent phone calls and off-issue email discussions), the PSM library is hijacking signal handlers by default. Specifically: unless the environment variables
IPATH_NO_BACKTRACE=1(for PSM / Intel TrueScale) is set, the library constructor for this library will hijack various signal handlers for the purpose of invoking its own error reporting mechanisms.This may be a bit surprising, but is not a problem, per se. The real problem is that older versions of at least the PSM library do not unregister these signal handlers upon being unloaded from memory. Hence, a segv can actually result in a double segv (i.e., the original segv and then another segv when the now-non-existent signal handler is invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting mechanism, which may be a bit surprising for some users (particularly those who do not have Intel TrueScale). As such, we disable it by default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2 library that may cause problems in the unloading of its signal handlers. This problem can be avoided by setting
HFI_NO_BACKTRACE=1(for PSM2 / Intel OmniPath).This is further compounded by the fact that the PSM / PSM2 libraries can be loaded by the OFI MTL and the usNIC BTL (because they are loaded by libfabric), even when there is no Intel networking hardware present. Having the PSM/PSM2 libraries behave this way when no Intel hardware is present is clearly undesirable (and is likely to be fixed in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable this behavior from the PSM/PSM2 libraries (if they are not already set):
IPATH_NO_BACKTRACE=1HFI_NO_BACKTRACE=1If the user has set these variables before invoking Open MPI, we will not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres jsquyres@cisco.com
@rhc54 @matcabral @yburette @ggouaillardet Please review. Thanks.