-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the connect/accept/spawn code to better utilize PMIx Group support #12398
Conversation
The "old" method relies on PMIx publish/lookup followed by a call to PMIx_Connect. It then does a "next cid" method to get the next communicator ID, which has multiple algorithms including one that calls PMIx_Group. Simplify this by using PMIx_Group_construct in place of PMIx_Connect, and have it return the next communicator ID. This is more scalable and reliable than the prior method. Retain the "old" method for now as this is new code. Create a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch between the two approaches. Default it to "true" for now for ease of debugging. NOTE: this includes an update to the submodule pointers for PMIx and PRRTE to obtain the required updates to those code bases. Signed-off-by: Ralph Castain <rhc@pmix.org>
It provides a barrier and gives both the host and the PMIx library a chance to cleanup the group. Retain the old "fence"-based algorithm as well for when the old connect/accept method is used. Signed-off-by: Ralph Castain <rhc@pmix.org>
Update to get the changes required to support this new dpm connect/accept/spawn method. Signed-off-by: Ralph Castain <rhc@pmix.org>
Did you figure out the modex data issue that was causing tcp btl to not work? |
I believe so. The initial problem was a bug in the group code that caused us not to store the data from the child job. Once I fixed that, I could pass Lisandro's test. However, once I started adding more procs and spreading them over multiple nodes, it failed again. I tracked it down to not getting the modex data from all participating procs. Currently, we collect the data from the procs that participate directly in the So when the parent and child jobs only have one process, then all is good. Once either has more than one process, then we are missing modex info. What I intend to do is have I'm tied up today with a medical procedure, but hope to get to it over the next few days. |
(while talking in the Tuesday meeting, we saw Ralph's note that he's still working on this, so we moved the PR to "draft" just to make sure we don't accidentally merge it too early) |
Ah yes - thanks for moving to draft! As I noted elsewhere, I am not actually proposing this for OMPI to use. Only using this to help drive some work on the PMIx/PRRTE group support and wanted to test it a bit more broadly. If someone else wants to utilize the capability, they are welcome to do so. Should have included that clarification here as well. Sorry for the confusion. |
Add a second method for doing connect/accept
The "old" method relies on PMIx publish/lookup followed by
a call to PMIx_Connect. It then does a "next cid" method
to get the next communicator ID, which has multiple algorithms
including one that calls PMIx_Group.
Simplify this by using PMIx_Group_construct in place of
PMIx_Connect, and have it return the next communicator ID.
This is more scalable and reliable than the prior method.
Retain the "old" method for now as this is new code. Create
a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
between the two approaches. Default it to "true" for now
for ease of debugging.
NOTE: this includes an update to the submodule pointers
for PMIx and PRRTE to obtain the required updates to
those code bases.
Signed-off-by: Ralph Castain rhc@pmix.org
Use PMIx_Group_destruct for disconnect
It provides a barrier and gives both the host
and the PMIx library a chance to cleanup the
group. Retain the old "fence"-based algorithm
as well for when the old connect/accept
method is used.
Signed-off-by: Ralph Castain rhc@pmix.org
Up the submodule pointers to include required support
Update to get the changes required to support this new
dpm connect/accept/spawn method.
Signed-off-by: Ralph Castain rhc@pmix.org