Skip to content

Conversation

@jeking3
Copy link

@jeking3 jeking3 commented Nov 14, 2025

While moving a job from a small number of GPU nodes to a larger number of CPU nodes, I was able to reliably reproduce #11087 in my environment. While in the debugger, I found that opal_common_ucx_mca_pmix_fence was spinning forever waiting to become fenced. Calls down to the UCX layer showed that it had no pending operations, no active endpoints, and no outstanding flushes. Given that UCX is the transport that allows them to synchronize in this case, it doesn't make any sense to fence after disconnecting. Reversing the order of operations resolved the shutdown hang.

This fixes #11087
This might fix openucx/ucx#8738

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b633c68: ucx: meet at barrier before disconnecting, not aft...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@jsquyres
Copy link
Member

@jeking3 Thanks for the PR! Can you add the sign-off message (e.g., via git commit --amend -s)?

https://docs.open-mpi.org/en/v5.0.x/contributing.html#open-source-contributions

@jsquyres jsquyres requested review from bosilca and janjust November 14, 2025 19:01
Call pmix_fence while we still have connectivity, because
after we disconnect we may never get to being fenced.

Signed-off-by: Jim King <jimk@nvidia.com>
@jeking3
Copy link
Author

jeking3 commented Nov 21, 2025

@jeking3 Thanks for the PR! Can you add the sign-off message (e.g., via git commit --amend -s)?

https://docs.open-mpi.org/en/v5.0.x/contributing.html#open-source-contributions

That was done, by the way.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this PR seems to address an issueMPI_Finalize issue, but I don't think this is really the case. Instead, it is hiding away the real root cause.

Let me delve a little into the logic here. The main reason for the PMIX fence was to give time (provided by a fence via an external communication framework instead of a barrier) for all processes to close and cleanup their connections/endpoints before returning from the call. To state this clearly, once any process returns from the opal_common_ucx_del_procs we have a guarantee that all processes have destroyed all their connections and joined the fence, a strong guarantee for the rest of the OMPI teardown.

With this change we are now at the complete opposite, because processes synchronize before starting to tear down their connections but then the return of a process does not provide any global guarantee about the others. In this particular scenario there is no need for a fence, a simple barrier on the communicator to be destroyed (MPI_COMM_WORLD I think) would provide the same synchronization. Clearly this is at the opposite of how this entire stage is expected to work.

@jeking3
Copy link
Author

jeking3 commented Nov 25, 2025

Thanks for the explanation. It has solved the shutdown issue in 100s of runs compared to previously where it failed almost all the time, going into an infinite loop. Perhaps the second call to opal_common_ucx_mca_pmix_fence is unnecessary, and perhaps even erroneous, given your description of opal_common_ucx_del_procs.

@bosilca
Copy link
Member

bosilca commented Nov 26, 2025

More fences was never a solid way to fix conceptual bugs. We care about scale, just adding more fences on the least performance network is not desirable, especially at scale. I know a lot of people internally (including myself) who run similar jobs regularly and never had any issues with this. We might need to dig a little more to understand the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCP transport hangs during cleanup Hang in libucs on mpi_finalize

3 participants