Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the runtime version of PMIx #1982

Merged
merged 1 commit into from
Jun 5, 2024
Merged

Check the runtime version of PMIx #1982

merged 1 commit into from
Jun 5, 2024

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 18, 2024

It has been reported (and confirmed) that building against one version of PMIx and then running with another version will cause PRRTE to segfault. This isn't a universal rule. For example, one can switch v5.0 and master without a problem. However, switching v5.0 and v4.2 is a definite segfault.

A little playing indicates that at least for some PMIx series, it is possible to switch between subreleases within the series - i.e., v5.0.1 and v5.0.2. I would not consider this a guaranteed rule at this time.

For now, we check the runtime version of PMIx against the build version. If the major/minor values don't match, then we print an explanatory error message
and error out.

@rhc54
Copy link
Contributor Author

rhc54 commented May 18, 2024

@jsquyres @bwbarrett I could use your help with this PR. This impacts OMPI as well since you are using PRRTE for your mpirun, which will exhibit the same behavior. I believe the root cause is the reuse of PMIx internals (e.g., the MCA base) in PRRTE. When we pull in the dynamic bits of the runtime PMIx, I think this causes memory corruption on the other pieces that were already present in PRRTE.

What I observe is that everything involving MCA base segfaults. The list of open components, list of MCA params - they are all corrupted. If I fix/skip one, the next one in line segfaults. So I don't believe it is possible to "fix" the situation.

Any suggestions would be much appreciated.

Copy link
Contributor

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I am pondering the situation... don't have any immediately obvious ideas.

src/runtime/prte_init.c Outdated Show resolved Hide resolved
src/runtime/prte_init.c Outdated Show resolved Hide resolved
It has been reported (and confirmed) that building against
one version of PMIx and then running with another version
will cause PRRTE to segfault. This isn't a universal rule.
For example, one can switch v5.0 and master without a
problem. However, switching v5.0 and v4.2 is a definite
segfault.

The root cause of the problem is a change in the layout
of the base pmix_object_t definition. This renders all
PMIx objects binary incompatible when crossing between
the v5 and v4 (and below) series.

Changing the v5 definition back to match v4 is an
overly complex task. The changes were required to
accommodate the new shared memory support that
was introduced in v5.

So instead, we check the runtime version of PMIx against
the build version. If the runtime version is incompatible
with the build version, then we print an explanatory
error message and error out.

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54 rhc54 merged commit d02ad07 into openpmix:master Jun 5, 2024
9 checks passed
@rhc54 rhc54 deleted the topic/chk branch June 5, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants