Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple commits #1931

Merged
merged 8 commits into from
Feb 26, 2024
Merged

Multiple commits #1931

merged 8 commits into from
Feb 26, 2024

Commits on Feb 25, 2024

  1. Fix testing of suicide for daemons

    We don't support a cmd line option for this as it isn't
    something a user should ever do. Instead, we use two
    MCA params to specify it:
    
    prte_daemon_fail <N> - specifies the daemon rank that
    should commit suicide
    
    prte_daemon_fail_delay <N> - time in seconds the target
    rank should wait before dying. A value of zero means
    no delay, just die after calling init. This is the
    default value.
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit 618dd0a)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    e2cff33 View commit details
    Browse the repository at this point in the history
  2. Fix daemon suicide and preserve output files

    Correctly set parent rank so that the OOB can
    correctly identify its lifeline and cause the
    daemon to abort when it dies. Fix the
    `--debug-daemons-file` flag so it works, and
    preserve the resulting output file from cleanup.
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit a87d172)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    fd088cf View commit details
    Browse the repository at this point in the history
  3. Remove unused MCA param

    Session directories now always include the PID of the daemon
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit c4d5f81)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    f1a4222 View commit details
    Browse the repository at this point in the history
  4. Only trigger job failed to start once

    Trigger the "job failed to start" state only when the
    first process to do so reports. This avoids a "bounce"
    effect that causes the job object to be multiply
    released.
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit a386514)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    7b80594 View commit details
    Browse the repository at this point in the history
  5. Add "close stale issues" actions

    Ported from open-mpi/ompi#12329
    
    Thanks to @jsquyres!
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit 31c948f)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    7714e04 View commit details
    Browse the repository at this point in the history
  6. oac: strengthen Sphinx check

    Update oac submodule pointer to pick up a stronger test for
    Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX.
    
    Signed-off-by: Jeff Squyres <jeff@squyres.com>
    (cherry picked from commit d3171cc)
    jsquyres authored and rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    aa2df0e View commit details
    Browse the repository at this point in the history
  7. Revamp the session directory system

    We now have multiple tools (e.g., psched, prte, and even
    multiple prte instances) running on the same node. Keeping
    all those session directory trees under a single root is
    problematic and leading to inadvertent deletion of contact
    files. So simplify things and put each instance under its
    own session directory tree root.
    
    Add the pid and uid to the session directory root name. Prefix
    the root name with the argv[0] of the tool so we know what
    generated it.
    
    Fix an error in PRRTE that assumed the job-level session was
    a global name. It is not - it is different for each job, so
    we need to track it by job. Have the prte_job_t destructor
    call the session_dir_destroy function to remove it when
    the job is complete.
    
    Fix refcounts so the job object destructor gets called upon
    job completion.
    
    Signed-off-by: Ralph Castain <rhc@pmix.org>
    (cherry picked from commit 14dd818)
    rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    9d54eda View commit details
    Browse the repository at this point in the history
  8. guard against possible segfault in prted

    as it exits by removing unneeded activity
    
    Signed-off-by: Howard Pritchard <howardp@lanl.gov>
    
    pr feedback
    
    Signed-off-by: Howard Pritchard <howardp@lanl.gov>
    (cherry picked from commit 025d5ab)
    hppritcha authored and rhc54 committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    e22cf80 View commit details
    Browse the repository at this point in the history