Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct some operations modifying status.MPI_ERROR when it is disallowed #9128

Merged
merged 3 commits into from
Jul 22, 2021

Conversation

abouteiller
Copy link
Member

This is in response to issue #9032

Signed-off-by: Aurelien Bouteiller bouteill@icl.utk.edu

Copy link
Member

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix ucx pml in a separate PR?

@abouteiller
Copy link
Member Author

should we fix ucx pml in a separate PR?

I'll add it, I also found some similar issues in Portals, I can fix but not test as I don't have access to hardware.

by the standard.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Cleanup

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
@abouteiller abouteiller force-pushed the bugfix/err-not-in-status branch 2 times, most recently from bffb449 to eba0892 Compare July 8, 2021 10:10
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
@abouteiller
Copy link
Member Author

bot:retest

@@ -67,7 +67,7 @@ int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status
if (MPI_PROC_NULL == source) {
*flag = 1;
if (MPI_STATUS_IGNORE != status) {
*status = ompi_request_empty.req_status;
OMPI_COPY_STATUS(status, ompi_request_empty.req_status, false);
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the memcheck below ? Why are we checking that the user passes a pointer to a struct of a size at least equal to MPI_Status ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I figured out what is going on. According to section 3.2.5 the MPI_ERROR might not be set for calls working on a single request. However, the MPI standard does not prohibit this, and on current architectures is faster to flush entire line of cache than having a branch with multiple instructions on the critical path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca MPI-4.0 3.2.5 says:

In general, message-passing calls do not modify the value of the error code field of status variables. This field may be updated only by the functions in Section 3.7.5 which return multiple statuses. The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

This, to me, unambiguously says that MPI only updates MPI_ERROR from MPI_[TEST|WAIT][SOME|ALL|ANY].

I agree with @bosilca that this might mean losing a potentially useful cache line optimization. ☹️

That being said, MPI-4.0 3.7.3 says:

An empty status is a status which is set to return tag = MPI_ANY_TAG, source = MPI_ANY_SOURCE, error = MPI_SUCCESS...

So it seems like there's at least some cases of contradiction here: if we're supposed to set the empty status but it's a single completion function (e.g., MPI_TEST / MPI_WAIT), do we set MPI_ERROR or not?

Bottom line(s):

  • I think that in the cases where we're not setting the empty status, it's unambiguous: we should only be setting status.MPI_ERROR in the multi-completion functions MPI_[TEST|WAIT][SOME|ALL|ANY].
  • In the cases where we're setting the empty status from a single completion function, it's unclear what we're supposed to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "may" which gives us some latitude of updating it or not.

Copy link
Member

@jsquyres jsquyres Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are parsing the text wrong. It has two operative sentences:

This field may be updated only by the functions in Section 3.7.5 which return multiple statuses.

This says that MPI_[TEST|WAIT][ALL|ANY|SOME] may update the field. And only those functions.

The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

This says that status.MPI_ERROR will be set only when MPI_[TEST|WAIT][ALL|ANY|SOME] returns MPI_ERROR_IN_STATUS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, he first part excludes everything except what is allowed by the sentences that follows. Still strange for the standard to be that precise on how to handle the error code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, he first part excludes everything except what is allowed by the sentences that follows. Still strange for the standard to be that precise on how to handle the error code.

Totally agree. Unfortunately, we're stuck with it. ☹️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bring this to the forum, apparently the only rationale is to 'help implementations go faster'. which it definitely doesn't do, here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably wouldn't hurt.

@awlauria
Copy link
Contributor

is this ready to go in @jsquyres @bosilca ? Would like to get this into v5.0.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants