Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple commits #1931

Merged
merged 8 commits into from
Feb 26, 2024
Merged

Multiple commits #1931

merged 8 commits into from
Feb 26, 2024

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Feb 25, 2024

Fix testing of suicide for daemons

We don't support a cmd line option for this as it isn't
something a user should ever do. Instead, we use two
MCA params to specify it:

prte_daemon_fail - specifies the daemon rank that
should commit suicide

prte_daemon_fail_delay - time in seconds the target
rank should wait before dying. A value of zero means
no delay, just die after calling init. This is the
default value.

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 618dd0a)

Fix daemon suicide and preserve output files

Correctly set parent rank so that the OOB can
correctly identify its lifeline and cause the
daemon to abort when it dies. Fix the
--debug-daemons-file flag so it works, and
preserve the resulting output file from cleanup.

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit a87d172)
Remove unused MCA param

Session directories now always include the PID of the daemon

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit c4d5f81)

Only trigger job failed to start once

Trigger the "job failed to start" state only when the
first process to do so reports. This avoids a "bounce"
effect that causes the job object to be multiply
released.

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit a386514)
Add "close stale issues" actions

Ported from open-mpi/ompi#12329

Thanks to @jsquyres!

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 31c948f)

oac: strengthen Sphinx check

Update oac submodule pointer to pick up a stronger test for
Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX.

Signed-off-by: Jeff Squyres jeff@squyres.com
(cherry picked from commit d3171cc)

Revamp the session directory system

We now have multiple tools (e.g., psched, prte, and even
multiple prte instances) running on the same node. Keeping
all those session directory trees under a single root is
problematic and leading to inadvertent deletion of contact
files. So simplify things and put each instance under its
own session directory tree root.

Add the pid and uid to the session directory root name. Prefix
the root name with the argv[0] of the tool so we know what
generated it.

Fix an error in PRRTE that assumed the job-level session was
a global name. It is not - it is different for each job, so
we need to track it by job. Have the prte_job_t destructor
call the session_dir_destroy function to remove it when
the job is complete.

Fix refcounts so the job object destructor gets called upon
job completion.

Signed-off-by: Ralph Castain rhc@pmix.org
(cherry picked from commit 14dd818)

guard against possible segfault in prted

as it exits by removing unneeded activity

Signed-off-by: Howard Pritchard howardp@lanl.gov

pr feedback

Signed-off-by: Howard Pritchard howardp@lanl.gov
(cherry picked from commit 025d5ab)

rhc54 and others added 8 commits February 25, 2024 14:18
We don't support a cmd line option for this as it isn't
something a user should ever do. Instead, we use two
MCA params to specify it:

prte_daemon_fail <N> - specifies the daemon rank that
should commit suicide

prte_daemon_fail_delay <N> - time in seconds the target
rank should wait before dying. A value of zero means
no delay, just die after calling init. This is the
default value.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 618dd0a)
Correctly set parent rank so that the OOB can
correctly identify its lifeline and cause the
daemon to abort when it dies. Fix the
`--debug-daemons-file` flag so it works, and
preserve the resulting output file from cleanup.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit a87d172)
Session directories now always include the PID of the daemon

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit c4d5f81)
Trigger the "job failed to start" state only when the
first process to do so reports. This avoids a "bounce"
effect that causes the job object to be multiply
released.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit a386514)
Ported from open-mpi/ompi#12329

Thanks to @jsquyres!

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 31c948f)
Update oac submodule pointer to pick up a stronger test for
Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
(cherry picked from commit d3171cc)
We now have multiple tools (e.g., psched, prte, and even
multiple prte instances) running on the same node. Keeping
all those session directory trees under a single root is
problematic and leading to inadvertent deletion of contact
files. So simplify things and put each instance under its
own session directory tree root.

Add the pid and uid to the session directory root name. Prefix
the root name with the argv[0] of the tool so we know what
generated it.

Fix an error in PRRTE that assumed the job-level session was
a global name. It is not - it is different for each job, so
we need to track it by job. Have the prte_job_t destructor
call the session_dir_destroy function to remove it when
the job is complete.

Fix refcounts so the job object destructor gets called upon
job completion.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 14dd818)
as it exits by removing unneeded activity

Signed-off-by: Howard Pritchard <howardp@lanl.gov>

pr feedback

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 025d5ab)
@rhc54 rhc54 merged commit 675c524 into openpmix:v3.0 Feb 26, 2024
12 checks passed
@rhc54 rhc54 deleted the cmr30/up branch February 26, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants