New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
han bugfix: coll_han_allreduce_low_module does not set the correct module #11823
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: 0d10025: han: explicitly exclude sm/tuned when splitting co...
5403341: opal info_subscriber: update subscriber object s_i...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
opal/util/info_subscriber.c
Outdated
&& 0 != strncmp(updated_value, value_str->string, value_str->length)) { | ||
err = opal_info_set(object->s_info, iterator->ie_key->string, updated_value); | ||
} | ||
opal_infosubscribe_inform_subscribers(object, iterator->ie_key->string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading opal_infosubscribe_inform_subscribers
I do not think the return value has any significance. Hence we should always update object->s_info
0d10025
to
3b5a528
Compare
CI failure
I'm not sure if this is related - @jsquyres could you help me understand the problem? Thanks! |
The test is fixed and passed |
* This sub-communicator contains the ranks that share my node. | ||
*/ | ||
opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han"); | ||
opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han,^sm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that this is necessary, but anyway if memory serves right, it has to be tuned,^han,sm
(only one ^
); needs checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: #10335 Edit: FYI the patch that I have at the end of that issue (with |
93d3a67
to
996713d
Compare
Thanks all. I excluded the problematic commit. TBH I don't have a perfect solution to #10335, and I agree with @bosilca that this is more of a bandaid. But I think this patch does make things marginally better. I'm open to suggestions. |
Thank you for the work on this. I don't know if the The root problem that this PR is trying to address is the same as #10335 right? I mean, here the problem is |
I agree. The issue I experienced was merely a symptom. This patch should apply to all info in general. |
👋 @bosilca Could you kindly review this change? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what the problem is, but I don't understand the proposed solution. Why is there a need to duplicate the info before updating them ?
996713d
to
040ad7c
Compare
opal/util/info_subscriber.c
Outdated
if (0 < update_cnt) { | ||
err = opal_info_set(object->s_info, iterator->ie_key->string, | ||
iterator->ie_value->string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what the problem is, but I don't understand the proposed solution. Why is there a need to duplicate the info before updating them ?
@bosilca To me the confusion is actually here - is opal_infosubscribe_change_info
supposed to also update object->s_info
or not. Correct me if I'm wrong, but the code reads to me that opal_infosubscribe_change_info
is only responsible for notifying the 'subscribers' of new_info
. L260-L262 somewhat looks out of place.
This leads me to think that setting object->s_info
(via opal_info_dup(info, &newcomp->super.s_info)
) should be separate from opal_infosubscribe_change_info
.
Please let me know what I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging into the code it seems the root cause was a lack of validation and a lack of understanding of the impact of 6bd36a7 into the code base. This issue with the info is just one of the many other issues that got into the main, in exchange of being able to create communicators with a tag given by an external software stack (aka to create sessions).
As a result, I retract my original complain about duplicating the info. I still have an issue with the changes in opal/util/info_subscriber.c
.
@bosilca Do you have more thoughts on this? |
@bosilca Heartbeat in case it's buried. |
I did test this again and can again confirm that (as far as I can tell) it does solve the info-key/HAN problem. I'm also trying again to understand the info-subscriber code. I guess the big question here is what
Am I on point here? This PR here looks like it operates on the first assumption? (as my patch in #10335 did) Does the implementation of e.g. ompi/ompi/mpi/c/comm_set_info.c Line 49 in 3b6fe68
|
opal/util/info_subscriber.c
Outdated
updated_value = item->callback(object, key, updated_value); | ||
if (found_callback) { | ||
*found_callback = 1; | ||
if (item->callback(object, key, new_value)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are throwing away the value returned from the callback. Because we are assuming that the potentially updated key will be saved in new_value
? But seeing some instances where a subscriber is utilized (mca_pml_ob1_set_allow_overtake
, ompi_osc_ucx_set_no_lock_info
), the opposite appears to happen. (this looks to be a result of somewhat confusing variable naming in the original implementation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenduwan any thoughts on this? I'm looking at e.g. mca_pml_ob1_set_allow_overtake()
, which appears to be an info-subscriber callback (registered in mca_pml_ob1_add_comm()
):
ompi/ompi/mca/pml/ob1/pml_ob1.c
Lines 268 to 270 in bb7ecde
/* Register the subscriber alert for the mpi_assert_allow_overtaking info. */ | |
opal_infosubscribe_subscribe (&comm->super, "mpi_assert_allow_overtaking", | |
"false", mca_pml_ob1_set_allow_overtake); |
ompi/ompi/mca/pml/ob1/pml_ob1.c
Lines 208 to 243 in bb7ecde
static const char* | |
mca_pml_ob1_set_allow_overtake(opal_infosubscriber_t* obj, | |
const char* key, | |
const char* value) | |
{ | |
ompi_communicator_t *ompi_comm = (ompi_communicator_t *) obj; | |
bool allow_overtake_was_set = OMPI_COMM_CHECK_ASSERT_ALLOW_OVERTAKE(ompi_comm); | |
/* As we keep the out-of-sequence messages ordered by their sequence, as a receiver we | |
* can just move the previously considered out-of-order messages into the unexpected queue, | |
* and we maintain some form of logical consistency with the message order. | |
*/ | |
if (opal_str_to_bool(value)) { | |
if (!allow_overtake_was_set) { | |
ompi_comm->c_flags |= OMPI_COMM_ASSERT_ALLOW_OVERTAKE; | |
mca_pml_ob1_merge_cant_match(ompi_comm); | |
} | |
return "true"; | |
} | |
if (allow_overtake_was_set) { | |
/* However, in the case we are trying to turn off allow_overtake, it is not clear what | |
* should be done with the previous messages that are pending on our peers, nor with | |
* the messages currently in the network. Similarly, if one process turns off allow | |
* overtake, before any potential sender start sending valid sequence numbers there | |
* is no way to order the messages in a sensible order. | |
* The possible solution is cumbersome, it would force a network quiescence followed by | |
* a synchronization of all processes in the communicator, and then all peers will | |
* start sending messages starting with sequence number 0. | |
* A lot of code for minimal benefit, especially taking in account that the MPI standard | |
* does not define this. Instead, refuse to disable allow overtake, and at least the | |
* user has the opportunity to check if we accepted to change it. | |
*/ | |
return "true"; | |
} | |
return "false"; | |
} |
The modified info value is returned from the function. From what I understand we need to keep this returned value and pass it to opal_info_set()
in opal_infosubscribe_change_info()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gkatev Many apologies for missing your comment earlier.
I took another look at this change. I don't quite understand how the callback value can be useful in this case(and now I doubt if this change is even correct).
Consider this snippet:
OPAL_LIST_FOREACH (item, list, opal_callback_list_item_t) {
updated_value = item->callback(object, key, updated_value);
...
}
This code iterates over a list of subscribers, and invokes the callback with a mutable variable updated_value
, and after the loop updated_value
will be set to the last callback's return value. I have 2 problems/questions about this approach:
- Why would it invoke the callback with a changing
update_value
- this means the order of subscribers matters. I do not understand this. - Why would it only keep the last
updated_value
and ignore the others? Again this means the order of subscribers must be critical.
@bosilca @devreal Could you please give some background about the original purpose of opal_infosubscribe_inform_subscribers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MPI info values are complicated to handle, because they need to reflect not only what the user intent but also what the library accept. Thus, in some cases the user should be able to investigate which of the info provided will be taken into consideration by the implementation. This mechanism is set in place in OMPI via the subscribers, when the user sets a value we inform the subscribers about it and then we set it with a flag to indicate if the it was accepted or not.
Let's add this into the next week's agenda, as it deserves a more in-depth discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to meeting agenda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code iterates over a list of subscribers, and invokes the callback with a mutable variable
updated_value
, and after the loopupdated_value
will be set to the last callback's return value. I have 2 problems/questions about this approach:1. Why would it invoke the callback with a changing `update_value` - this means the order of subscribers matters. I do not understand this. 2. Why would it only keep the last `updated_value` and ignore the others? Again this means the order of subscribers must be critical.
Hmm I see yes. With the current state of the callbacks it generally seems like we have to keep and use the return value, but your concerns above do indeed raise deeper questions about how all this is supposed to work (and I personally don't really know 🙂),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was never meant to support multiple subscribers on the same info key. Basically the idea is that an info object belongs, or could impact, a single subscriber. Which means that is a single subscriber that will modify the value, while everyone else might just look at it. The order might matter, but clearly we never had an issue with the ordering so far.
Let me try a pre-meeting recap of my understanding of the root of this specific issue, just to make sure that we are on the same page (or at least that I'm on the correct one).
It's not duplication per se, it's a copy into the new communicator. In the past, at 81917f2, Lines 935 to 939 in 81917f2
The way I understand it is that it's simply copying the passed info into the newly created comm. (the coll components then read this info during the (Similar situation in This changes in 6bd36a7, where Lines 963 to 967 in 6bd36a7
(Similarly in This is where the bug is introduced. The problem is that Now, perhaps I'll leave it up to you to discuss how it's all supposed to be in the grand scheme of things! |
This patch adds missing info duplication steps. Unused/unreferenced info will be removed in later calls to opal_info_remove_unreferenced Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
040ad7c
to
64db513
Compare
Indeed, @devreal and myself reviewed the info/subscriber code and come to the conclusion that is does not do what it was required to, neither by the MPI standard nor by the users. Forcefully adding the keys, as proposed in this patch will only solve the HAN issue, while in same time breaking the MPI requirements (because the subscribers infrastructure lost the ability to remove keys that will be ignored). We figured out what needs to be done (mostly undoing an old patch), and @devreal volunteered to do so. Meanwhile, I'm skeptical about merging this PR as it breaks as many things as it fixes. |
Currently
coll_han_allreduce_low_module
does not set the correct intra-node collectives module, i.e. tuned vs sm.The root cause is the
ompi_comm_coll_preference
hint is not correctly passed and stored in the new communicator'ss_info
(I'm not sure about the naming - does it mean subscriber info?). This patch addresses that issue.