-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple commits #1931
Multiple commits #1931
Commits on Feb 25, 2024
-
Fix testing of suicide for daemons
We don't support a cmd line option for this as it isn't something a user should ever do. Instead, we use two MCA params to specify it: prte_daemon_fail <N> - specifies the daemon rank that should commit suicide prte_daemon_fail_delay <N> - time in seconds the target rank should wait before dying. A value of zero means no delay, just die after calling init. This is the default value. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 618dd0a)
Configuration menu - View commit details
-
Copy full SHA for e2cff33 - Browse repository at this point
Copy the full SHA e2cff33View commit details -
Fix daemon suicide and preserve output files
Correctly set parent rank so that the OOB can correctly identify its lifeline and cause the daemon to abort when it dies. Fix the `--debug-daemons-file` flag so it works, and preserve the resulting output file from cleanup. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a87d172)
Configuration menu - View commit details
-
Copy full SHA for fd088cf - Browse repository at this point
Copy the full SHA fd088cfView commit details -
Session directories now always include the PID of the daemon Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit c4d5f81)
Configuration menu - View commit details
-
Copy full SHA for f1a4222 - Browse repository at this point
Copy the full SHA f1a4222View commit details -
Only trigger job failed to start once
Trigger the "job failed to start" state only when the first process to do so reports. This avoids a "bounce" effect that causes the job object to be multiply released. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit a386514)
Configuration menu - View commit details
-
Copy full SHA for 7b80594 - Browse repository at this point
Copy the full SHA 7b80594View commit details -
Add "close stale issues" actions
Ported from open-mpi/ompi#12329 Thanks to @jsquyres! Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 31c948f)
Configuration menu - View commit details
-
Copy full SHA for 7714e04 - Browse repository at this point
Copy the full SHA 7714e04View commit details -
Update oac submodule pointer to pick up a stronger test for Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX. Signed-off-by: Jeff Squyres <jeff@squyres.com> (cherry picked from commit d3171cc)
Configuration menu - View commit details
-
Copy full SHA for aa2df0e - Browse repository at this point
Copy the full SHA aa2df0eView commit details -
Revamp the session directory system
We now have multiple tools (e.g., psched, prte, and even multiple prte instances) running on the same node. Keeping all those session directory trees under a single root is problematic and leading to inadvertent deletion of contact files. So simplify things and put each instance under its own session directory tree root. Add the pid and uid to the session directory root name. Prefix the root name with the argv[0] of the tool so we know what generated it. Fix an error in PRRTE that assumed the job-level session was a global name. It is not - it is different for each job, so we need to track it by job. Have the prte_job_t destructor call the session_dir_destroy function to remove it when the job is complete. Fix refcounts so the job object destructor gets called upon job completion. Signed-off-by: Ralph Castain <rhc@pmix.org> (cherry picked from commit 14dd818)
Configuration menu - View commit details
-
Copy full SHA for 9d54eda - Browse repository at this point
Copy the full SHA 9d54edaView commit details -
guard against possible segfault in prted
as it exits by removing unneeded activity Signed-off-by: Howard Pritchard <howardp@lanl.gov> pr feedback Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 025d5ab)
Configuration menu - View commit details
-
Copy full SHA for e22cf80 - Browse repository at this point
Copy the full SHA e22cf80View commit details