orte/odls: Fix ORTE state machine for the non-zero exit case #2694
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
This commit fixes rare race condition that occurs when the process
that is calling
exit(-1)has delay between fd cleanup and actualOS-level exit. This may happen if the process has some work to do
on_exit().Problem description:
Consider an application process that has called
exit(nonzero), it'sfd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via
on_exit()).Observed sequence of events was the following:
IOF COMPLETEstate.COMMUNICATION FAILUREstate to beactivated.
COMMUNICATION FAILUREprocessingodls_base_default_wait_local_procis called even though real waitpid wasn't yet called (code mentions that
waitpid might not be called for unspecified reason). Because of that real exit
code is unknown and set to 0.
odls_base_default_wait_local_proccallback seesIOF COMPLETEflag and in conjunction with 0-exit-code it activatesWAITPID FIREDstate.WAITPID FIREDleads toNORMALLY TERMINATEDto beactivated.
NORMALLY TERMINATEDstate in particular leadsORTE_PROC_FLAG_ALIVEflagfor this proc to be dropped.
wait_signal_callbackislaunched. It sets real exit code and calls
odls_base_default_wait_local_procagain but at this time since the process has
ORTE_PROC_FLAG_ALIVEflagdropped
WAITPID FIREDstate is activated (instead ofEXITED WITH NON-ZERO)leading to a hang that was observed.
Signed-off-by: Artem Polyakov artpol84@gmail.com
(cherry picked from commit 3eb6c98)