Skip to content

Conversation

@jladd-mlnx
Copy link
Member

This commit fixes rare race condition that occurs when the process
that is calling exit(-1) has delay between fd cleanup and actual
OS-level exit. This may happen if the process has some work to do
on_exit().

Problem description:
Consider an application process that has called exit(nonzero), it's
fd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via on_exit()).
Observed sequence of events was the following:

  • orted gets stdio disconnection and activating IOF COMPLETE state.
  • parallel OOB disconnection causes COMMUNICATION FAILURE state to be
    activated.
  • during COMMUNICATION FAILURE processing odls_base_default_wait_local_proc
    is called even though real waitpid wasn't yet called (code mentions that
    waitpid might not be called for unspecified reason). Because of that real exit
    code is unknown and set to 0. odls_base_default_wait_local_proc callback sees
    IOF COMPLETE flag and in conjunction with 0-exit-code it activates
    WAITPID FIRED state.
  • processing of WAITPID FIRED leads to NORMALLY TERMINATED to be
    activated.
  • NORMALLY TERMINATED state in particular leads ORTE_PROC_FLAG_ALIVE flag
    for this proc to be dropped.
  • when application process finally exits and wait_signal_callback is
    launched. It sets real exit code and calls odls_base_default_wait_local_proc
    again but at this time since the process has ORTE_PROC_FLAG_ALIVE flag
    dropped WAITPID FIRED state is activated (instead of EXITED WITH NON-ZERO)
    leading to a hang that was observed.

Signed-off-by: Artem Polyakov artpol84@gmail.com
(cherry picked from commit 3eb6c98)

@jladd-mlnx jladd-mlnx added this to the v2.1.0 milestone Jan 9, 2017
@jladd-mlnx jladd-mlnx added the bug label Jan 9, 2017
This commit fixes rare race condition that occurs when the process
that is calling `exit(-1)` has delay between fd cleanup and actual
OS-level exit. This may happen if the process has some work to do
`on_exit()`.

**Problem description**:
Consider an application process that has called `exit(nonzero)`, it's
fd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via `on_exit()`).
Observed sequence of events was the following:

* orted gets stdio disconnection and activating `IOF COMPLETE` state.
* parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be
activated.
* during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc`
is called even though real waitpid wasn't yet called (code mentions that
waitpid might not be called for unspecified reason). Because of that real exit
code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees
`IOF COMPLETE` flag and in conjunction with 0-exit-code it activates
`WAITPID FIRED` state.
* processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be
activated.
* `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag
for this proc to be dropped.
* when application process finally exits and `wait_signal_callback` is
launched. It sets real exit code and calls `odls_base_default_wait_local_proc`
again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag
dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`)
leading to a hang that was observed.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 3eb6c98)
Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No observed issues

@hppritcha
Copy link
Member

@jsquyres good to go.

@jsquyres jsquyres merged commit 6b26a4d into open-mpi:v2.x Jan 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants