-
Notifications
You must be signed in to change notification settings - Fork 68
btl/openib: update rdmacm for dynamic add_procs #1038
Conversation
This commit adds the data necessesary for supporting dynamic add_procs to the rdma message (opal_process_name_t). The endpoint lookup function has been updated to match the code in udcm. Closes open-mpi/ompi#1468. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@645bd9d) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
@alinask Could you verify that this works for you? |
|
Test PASSed. |
|
@hjelmn , thanks for this fix! I tested on the master branch, where mpi_add_procs_cutoff default is set to 0. rdmacm, only per-peer QP: rdmacm, first QP is per-peer, rest are SRQ: udcm: Since the default of btl_openib_receive_queues is to only use SRQ QPs (in the master branch), rdmacm won't work without changing this mca parameter from the command line. I added a comment stating this in the FAQ but maybe a warning about this should be added to the code like @jsquyres suggested or the default should have a per-peer QP at the beginning like the release branch (v1.10) has? |
|
An enhancement of the warning "rdmacm CPC only supported when the first QP is a PP QP" is here: |
|
👍 bot:assign: @jsquyres Please merge. |
|
OMPIBot error: Label "reviewed" is already set on issue 1038. |
|
Am I understanding this PR correctly:
|
|
There is an error message if btl_base_verbose is set (open-mpi/ompi#1488) and the current default is no PP QP (before this PR). |
|
If we're still segv'ing by default, this seems like half a solution. |
|
The SEGV still needs to be fixed in r2. I plan to take a look later today. |
|
Kind of. The SEGV requires the user to specifically ask for rdmacm which is suboptimal on IB systems. I would prefer this get merged regardless of the warning message and r2 fix. |
|
Also, there is a really clear error message about openib not working. It calls out the CPCs tried. |
|
How long to get the 2nd segv fix? I'm only trying to prevent the "I promise to commit the 2nd half of the fix!" ...but then get too busy/forget to do it kind of scenario. |
This commit adds code to detect when procs are unreachable when using the dynamic add_procs functionality. Fixes #1501 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@9d5eeec) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
OMPIBot error: Label "-pushed-back" does not exist. |
|
bot:nolabel:pushed-back |
|
@jsquyres The SEGV is now fixed. |
|
Test PASSed. |
|
@alinask Can you confirm that this PR now fixes everything for you / review the PR? Thanks. |
|
sure, I'll check. |
|
@hjelmn says yes, it was pushed to master as well (he said this verbally on the call today). |
|
@alinask, please ack ASAP. We are trying to get this release out. |
|
I checked the v2.x branch with the patch from this PR with the command lines from above. When using rdmacm without a per-peer QP, the output is: So there is no segmentation fault anymore. Thanks @hjelmn ! |
|
Thanks @alinask. I'm thinking that's an implicit 👍 :-) |
|
@hppritcha Good to go. |
This commit adds the data necessesary for supporting dynamic add_procs
to the rdmacm message (opal_process_name_t). The endpoint lookup
function has been updated to match the code in udcm.
Closes open-mpi/ompi#1468.
:bot:assign: @jladd-mlnx
:bot:milestone:v2.0.0
:bot🏷️bug
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from open-mpi/ompi@645bd9d)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov