-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple commits #1931
Merged
Merged
Multiple commits #1931
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We don't support a cmd line option for this as it isn't something a user should ever do. Instead, we use two MCA params to specify it: prte_daemon_fail <N> - specifies the daemon rank that should commit suicide prte_daemon_fail_delay <N> - time in seconds the target rank should wait before dying. A value of zero means no delay, just die after calling init. This is the default value. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 618dd0a)
Correctly set parent rank so that the OOB can correctly identify its lifeline and cause the daemon to abort when it dies. Fix the `--debug-daemons-file` flag so it works, and preserve the resulting output file from cleanup. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a87d172)
Session directories now always include the PID of the daemon Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit c4d5f81)
Trigger the "job failed to start" state only when the first process to do so reports. This avoids a "bounce" effect that causes the job object to be multiply released. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a386514)
Ported from open-mpi/ompi#12329 Thanks to @jsquyres! Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 31c948f)
Update oac submodule pointer to pick up a stronger test for Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX. Signed-off-by: Jeff Squyres <jeff@squyres.com> (cherry picked from commit d3171cc)
We now have multiple tools (e.g., psched, prte, and even multiple prte instances) running on the same node. Keeping all those session directory trees under a single root is problematic and leading to inadvertent deletion of contact files. So simplify things and put each instance under its own session directory tree root. Add the pid and uid to the session directory root name. Prefix the root name with the argv[0] of the tool so we know what generated it. Fix an error in PRRTE that assumed the job-level session was a global name. It is not - it is different for each job, so we need to track it by job. Have the prte_job_t destructor call the session_dir_destroy function to remove it when the job is complete. Fix refcounts so the job object destructor gets called upon job completion. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 14dd818)
as it exits by removing unneeded activity Signed-off-by: Howard Pritchard <howardp@lanl.gov> pr feedback Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 025d5ab)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix testing of suicide for daemons
We don't support a cmd line option for this as it isn't
something a user should ever do. Instead, we use two
MCA params to specify it:
prte_daemon_fail - specifies the daemon rank that
should commit suicide
prte_daemon_fail_delay - time in seconds the target
rank should wait before dying. A value of zero means
no delay, just die after calling init. This is the
default value.
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 618dd0a)
Fix daemon suicide and preserve output files
Correctly set parent rank so that the OOB can
correctly identify its lifeline and cause the
daemon to abort when it dies. Fix the
--debug-daemons-file
flag so it works, andpreserve the resulting output file from cleanup.
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit a87d172)
Remove unused MCA param
Session directories now always include the PID of the daemon
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit c4d5f81)
Only trigger job failed to start once
Trigger the "job failed to start" state only when the
first process to do so reports. This avoids a "bounce"
effect that causes the job object to be multiply
released.
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit a386514)
Add "close stale issues" actions
Ported from open-mpi/ompi#12329
Thanks to @jsquyres!
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 31c948f)
oac: strengthen Sphinx check
Update oac submodule pointer to pick up a stronger test for
Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX.
Signed-off-by: Jeff Squyres jeff@squyres.com
(cherry picked from commit d3171cc)
Revamp the session directory system
We now have multiple tools (e.g., psched, prte, and even
multiple prte instances) running on the same node. Keeping
all those session directory trees under a single root is
problematic and leading to inadvertent deletion of contact
files. So simplify things and put each instance under its
own session directory tree root.
Add the pid and uid to the session directory root name. Prefix
the root name with the argv[0] of the tool so we know what
generated it.
Fix an error in PRRTE that assumed the job-level session was
a global name. It is not - it is different for each job, so
we need to track it by job. Have the prte_job_t destructor
call the session_dir_destroy function to remove it when
the job is complete.
Fix refcounts so the job object destructor gets called upon
job completion.
Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 14dd818)
guard against possible segfault in prted
as it exits by removing unneeded activity
Signed-off-by: Howard Pritchard howardp@lanl.gov
pr feedback
Signed-off-by: Howard Pritchard howardp@lanl.gov
(cherry picked from commit 025d5ab)