Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

han bugfix: coll_han_allreduce_low_module does not set the correct module #11823

Closed
wants to merge 1 commit into from

Conversation

wenduwan
Copy link
Contributor

Currently coll_han_allreduce_low_module does not set the correct intra-node collectives module, i.e. tuned vs sm.

The root cause is the ompi_comm_coll_preference hint is not correctly passed and stored in the new communicator's s_info(I'm not sure about the naming - does it mean subscriber info?). This patch addresses that issue.

@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

0d10025: han: explicitly exclude sm/tuned when splitting co...

  • check_signed_off: does not contain a valid Signed-off-by line

5403341: opal info_subscriber: update subscriber object s_i...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

&& 0 != strncmp(updated_value, value_str->string, value_str->length)) {
err = opal_info_set(object->s_info, iterator->ie_key->string, updated_value);
}
opal_infosubscribe_inform_subscribers(object, iterator->ie_key->string,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading opal_infosubscribe_inform_subscribers I do not think the return value has any significance. Hence we should always update object->s_info

@wenduwan
Copy link
Contributor Author

CI failure

 ++ /GIT/ompi-refs_pull_11823_merge/ompi/ompi_install/bin/mpicc -o /GIT/ompi-refs_pull_11823_merge/jenkins_scripts/jenkins/ompi/c11_test /GIT/ompi-refs_pull_11823_merge/jenkins_scripts/jenkins/ompi/c11_test.c /GIT/ompi/ompi_install/lib/liboshmem.so
  gcc: error: /GIT/ompi/ompi_install/lib/liboshmem.so: No such file or directory

I'm not sure if this is related - @jsquyres could you help me understand the problem? Thanks!

@B-a-S
Copy link
Contributor

B-a-S commented Jul 14, 2023

CI failure

 ++ /GIT/ompi-refs_pull_11823_merge/ompi/ompi_install/bin/mpicc -o /GIT/ompi-refs_pull_11823_merge/jenkins_scripts/jenkins/ompi/c11_test /GIT/ompi-refs_pull_11823_merge/jenkins_scripts/jenkins/ompi/c11_test.c /GIT/ompi/ompi_install/lib/liboshmem.so
  gcc: error: /GIT/ompi/ompi_install/lib/liboshmem.so: No such file or directory

I'm not sure if this is related - @jsquyres could you help me understand the problem? Thanks!

The test is fixed and passed

* This sub-communicator contains the ranks that share my node.
*/
opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han");
opal_info_set(&comm_info, "ompi_comm_coll_preference", "tuned,^han,^sm");
Copy link
Contributor

@gkatev gkatev Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that this is necessary, but anyway if memory serves right, it has to be tuned,^han,sm (only one ^); needs checking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkatev @wenduwan That is correct -- there should only be a single ^

@gkatev
Copy link
Contributor

gkatev commented Jul 14, 2023

related: #10335

Edit: FYI the patch that I have at the end of that issue (with opal_info_dup) works, but I really don't know if it's suitable/elegant

@wenduwan wenduwan force-pushed the info_subscriber branch 2 times, most recently from 93d3a67 to 996713d Compare July 14, 2023 14:48
@wenduwan
Copy link
Contributor Author

Thanks all. I excluded the problematic commit.
@gkatev thank you for pointing me to the issue - I wasn't aware of the conversation. I think your patch makes sense, so I included it in this PR(not trying to steal your thunder 😅).

TBH I don't have a perfect solution to #10335, and I agree with @bosilca that this is more of a bandaid. But I think this patch does make things marginally better.

I'm open to suggestions.

@gkatev
Copy link
Contributor

gkatev commented Jul 14, 2023

Thank you for the work on this. I don't know if the opal_info_dup patch is perfect, but it's been working without hiccups (but perhaps it's not robust and may spuriously fail??)

The root problem that this PR is trying to address is the same as #10335 right? I mean, here the problem is ompi_comm_coll_preference, but from what I understand there is nothing special about ompi_comm_coll_preference, and the issue should apply to all info keys (?).

@wenduwan
Copy link
Contributor Author

The root problem that this PR is trying to address is the same as #10335 right? I mean, here the problem is ompi_comm_coll_preference, but from what I understand there is nothing special about ompi_comm_coll_preference, and the issue should apply to all info keys (?).

I agree. The issue I experienced was merely a symptom. This patch should apply to all info in general.

@wenduwan
Copy link
Contributor Author

👋 @bosilca Could you kindly review this change? Thank you!

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what the problem is, but I don't understand the proposed solution. Why is there a need to duplicate the info before updating them ?

opal/util/info_subscriber.c Outdated Show resolved Hide resolved
opal/util/info_subscriber.c Outdated Show resolved Hide resolved
Comment on lines 260 to 262
if (0 < update_cnt) {
err = opal_info_set(object->s_info, iterator->ie_key->string,
iterator->ie_value->string);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what the problem is, but I don't understand the proposed solution. Why is there a need to duplicate the info before updating them ?

@bosilca To me the confusion is actually here - is opal_infosubscribe_change_info supposed to also update object->s_info or not. Correct me if I'm wrong, but the code reads to me that opal_infosubscribe_change_info is only responsible for notifying the 'subscribers' of new_info. L260-L262 somewhat looks out of place.

This leads me to think that setting object->s_info(via opal_info_dup(info, &newcomp->super.s_info)) should be separate from opal_infosubscribe_change_info.

Please let me know what I missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what the extra information might be worth, there were opal_info_dup calls present before 6bd36a7:

/* Copy info if there is one */
if (info) {
newcomp->super.s_info = OBJ_NEW(opal_info_t);
opal_info_dup(info, &(newcomp->super.s_info));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into the code it seems the root cause was a lack of validation and a lack of understanding of the impact of 6bd36a7 into the code base. This issue with the info is just one of the many other issues that got into the main, in exchange of being able to create communicators with a tag given by an external software stack (aka to create sessions).

As a result, I retract my original complain about duplicating the info. I still have an issue with the changes in opal/util/info_subscriber.c.

@wenduwan
Copy link
Contributor Author

@bosilca Do you have more thoughts on this?

@wenduwan
Copy link
Contributor Author

@bosilca Heartbeat in case it's buried.

@wenduwan wenduwan self-assigned this Nov 15, 2023
@wenduwan
Copy link
Contributor Author

@bosilca @gkatev Do you have more comments on this PR? We are currently blocked on this - the sm coll cannot be selected for HAN allreduce.

@gkatev
Copy link
Contributor

gkatev commented Jan 30, 2024

I did test this again and can again confirm that (as far as I can tell) it does solve the info-key/HAN problem.

I'm also trying again to understand the info-subscriber code. I guess the big question here is what opal_infosubscribe_change_info() is supposed to do? I can see two possibilities:

  • It goes through the keys in new_info, and if a subscriber decides to update the key, it updates its value in object->s_info.
  • It goes through the keys in new_info, and adds everything to object->s_info, whether as-is, or modified as dictated by a subscriber.
    • (If it's supposed to do this, the current implementation's "as-is" part is bugged from what I understand)

Am I on point here? This PR here looks like it operates on the first assumption? (as my patch in #10335 did)

Does the implementation of e.g. MPI_Comm_set_info(), which uses opal_infosubscribe_change_info(), help us arrive to either of the possibilities above?

opal_infosubscribe_change_info(&(comm->super), &(info->super));

updated_value = item->callback(object, key, updated_value);
if (found_callback) {
*found_callback = 1;
if (item->callback(object, key, new_value)) {
Copy link
Contributor

@gkatev gkatev Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are throwing away the value returned from the callback. Because we are assuming that the potentially updated key will be saved in new_value? But seeing some instances where a subscriber is utilized (mca_pml_ob1_set_allow_overtake, ompi_osc_ucx_set_no_lock_info), the opposite appears to happen. (this looks to be a result of somewhat confusing variable naming in the original implementation?)

Copy link
Contributor

@gkatev gkatev Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenduwan any thoughts on this? I'm looking at e.g. mca_pml_ob1_set_allow_overtake(), which appears to be an info-subscriber callback (registered in mca_pml_ob1_add_comm()):

/* Register the subscriber alert for the mpi_assert_allow_overtaking info. */
opal_infosubscribe_subscribe (&comm->super, "mpi_assert_allow_overtaking",
"false", mca_pml_ob1_set_allow_overtake);

static const char*
mca_pml_ob1_set_allow_overtake(opal_infosubscriber_t* obj,
const char* key,
const char* value)
{
ompi_communicator_t *ompi_comm = (ompi_communicator_t *) obj;
bool allow_overtake_was_set = OMPI_COMM_CHECK_ASSERT_ALLOW_OVERTAKE(ompi_comm);
/* As we keep the out-of-sequence messages ordered by their sequence, as a receiver we
* can just move the previously considered out-of-order messages into the unexpected queue,
* and we maintain some form of logical consistency with the message order.
*/
if (opal_str_to_bool(value)) {
if (!allow_overtake_was_set) {
ompi_comm->c_flags |= OMPI_COMM_ASSERT_ALLOW_OVERTAKE;
mca_pml_ob1_merge_cant_match(ompi_comm);
}
return "true";
}
if (allow_overtake_was_set) {
/* However, in the case we are trying to turn off allow_overtake, it is not clear what
* should be done with the previous messages that are pending on our peers, nor with
* the messages currently in the network. Similarly, if one process turns off allow
* overtake, before any potential sender start sending valid sequence numbers there
* is no way to order the messages in a sensible order.
* The possible solution is cumbersome, it would force a network quiescence followed by
* a synchronization of all processes in the communicator, and then all peers will
* start sending messages starting with sequence number 0.
* A lot of code for minimal benefit, especially taking in account that the MPI standard
* does not define this. Instead, refuse to disable allow overtake, and at least the
* user has the opportunity to check if we accepted to change it.
*/
return "true";
}
return "false";
}

The modified info value is returned from the function. From what I understand we need to keep this returned value and pass it to opal_info_set() in opal_infosubscribe_change_info().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkatev Many apologies for missing your comment earlier.

I took another look at this change. I don't quite understand how the callback value can be useful in this case(and now I doubt if this change is even correct).

Consider this snippet:

OPAL_LIST_FOREACH (item, list, opal_callback_list_item_t) {
        updated_value = item->callback(object, key, updated_value);
        ...
}

This code iterates over a list of subscribers, and invokes the callback with a mutable variable updated_value, and after the loop updated_value will be set to the last callback's return value. I have 2 problems/questions about this approach:

  1. Why would it invoke the callback with a changing update_value - this means the order of subscribers matters. I do not understand this.
  2. Why would it only keep the last updated_value and ignore the others? Again this means the order of subscribers must be critical.

@bosilca @devreal Could you please give some background about the original purpose of opal_infosubscribe_inform_subscribers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MPI info values are complicated to handle, because they need to reflect not only what the user intent but also what the library accept. Thus, in some cases the user should be able to investigate which of the info provided will be taken into consideration by the implementation. This mechanism is set in place in OMPI via the subscribers, when the user sets a value we inform the subscribers about it and then we set it with a flag to indicate if the it was accepted or not.

Let's add this into the next week's agenda, as it deserves a more in-depth discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to meeting agenda

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code iterates over a list of subscribers, and invokes the callback with a mutable variable updated_value, and after the loop updated_value will be set to the last callback's return value. I have 2 problems/questions about this approach:

1. Why would it invoke the callback with a changing `update_value` - this means the order of subscribers matters. I do not understand this.

2. Why would it only keep the last `updated_value` and ignore the others? Again this means the order of subscribers must be critical.

Hmm I see yes. With the current state of the callbacks it generally seems like we have to keep and use the return value, but your concerns above do indeed raise deeper questions about how all this is supposed to work (and I personally don't really know 🙂),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was never meant to support multiple subscribers on the same info key. Basically the idea is that an info object belongs, or could impact, a single subscriber. Which means that is a single subscriber that will modify the value, while everyone else might just look at it. The order might matter, but clearly we never had an issue with the ordering so far.

@gkatev
Copy link
Contributor

gkatev commented Apr 20, 2024

Let me try a pre-meeting recap of my understanding of the root of this specific issue, just to make sure that we are on the same page (or at least that I'm on the correct one).

Why is there a need to duplicate the info before updating them ?

It's not duplication per se, it's a copy into the new communicator.

In the past, at 81917f2, ompi_comm_split_type() did contain an opal_info_dup() call:

// Copy info if there is one.
newcomp->super.s_info = OBJ_NEW(opal_info_t);
if (info) {
opal_info_dup(info, &(newcomp->super.s_info));
}

The way I understand it is that it's simply copying the passed info into the newly created comm. (the coll components then read this info during the ompi_comm_activate() call)

(Similar situation in ompi_comm_dup_with_info(), L1039-1043)


This changes in 6bd36a7, where opal_info_dup() is replaced with opal_infosubscribe_change_info():

ompi_comm_assert_subscribe (newcomp, OMPI_COMM_ASSERT_LAZY_BARRIER);
ompi_comm_assert_subscribe (newcomp, OMPI_COMM_ASSERT_ACTIVE_POLL);
if (info) {
opal_infosubscribe_change_info(&newcomp->super, info);
}

(Similarly in ompi_comm_dup_with_info(), L1068-1072)

This is where the bug is introduced. The problem is that opal_infosubscribe_change_info() only sets values that get returned from subscribers, so keys like ompi_comm_coll_preference are never added to the new comm, and never get passed to coll base/components.


Now, perhaps opal_infosubscribe_change_info() is supposed to be doing something different than it currently is?? Like adding all info keys, whether as-is or as-modified by subscribers? Or maybe the opal_info_dup() calls were simply forgotten? Or perhaps the use case of collective components passing around info keys was not fully taken into account.

I'll leave it up to you to discuss how it's all supposed to be in the grand scheme of things!

This patch adds missing info duplication steps. Unused/unreferenced info will
be removed in later calls to opal_info_remove_unreferenced

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan
Copy link
Contributor Author

After face-to-face discussion with @bosilca and @devreal , we recognize that the info subscriber subsystem is bust and needs major rework. For this PR we only need to duplicate the info.

@bosilca
Copy link
Member

bosilca commented Apr 26, 2024

Indeed, @devreal and myself reviewed the info/subscriber code and come to the conclusion that is does not do what it was required to, neither by the MPI standard nor by the users. Forcefully adding the keys, as proposed in this patch will only solve the HAN issue, while in same time breaking the MPI requirements (because the subscribers infrastructure lost the ability to remove keys that will be ignored). We figured out what needs to be done (mostly undoing an old patch), and @devreal volunteered to do so.

Meanwhile, I'm skeptical about merging this PR as it breaks as many things as it fixes.

@wenduwan
Copy link
Contributor Author

@bosilca Ah thanks for the elaboration. I will cancel this PR then and follow up with @devreal .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants