Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request: return error code from free_fn of a generalized request #6751

Merged
merged 8 commits into from
Oct 30, 2023

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Oct 26, 2023

Pull Request Description

We are supposed to return the error code from the free_fn of a generalized request. This was broken in PR #6718

Also add to .gitignore the missing entry src/binding/c/c_binding.c.

Also fix a warning of missing prototypes in mpif_h binding.

Fixes #6747

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2310_greq_free branch 2 times, most recently from b4039da to 6e67e50 Compare October 26, 2023 21:59
@hzhou
Copy link
Contributor Author

hzhou commented Oct 27, 2023

test:mpich/ch3/tcp ✔️
test:mpich/ch4/ofi - 1 unrelated failure ch4-ofi-asan threads/pt2pt/mt_improbe_sendrecv_huge 2 TIMEOUT

@hzhou hzhou requested a review from raffenet October 27, 2023 14:15
@hzhou hzhou added this to Fixes in MPICH-4.2 Release Oct 27, 2023
array_of_requests[idx] = MPI_REQUEST_NULL;
if (rc && array_of_statuses != MPI_STATUSES_IGNORE) {
array_of_statuses[idx].MPI_ERROR = rc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the other status objects guaranteed to have MPI_SUCCESS in them at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, a typo. array_of_statuses[idx] should be array_of_statuses[i].

are the other status objects guaranteed to have MPI_SUCCESS in them at this point?

I am not sure I understand the question. Why would it matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is supposed to look in the status array to find out which operation has an error. So presumably all operations that succeeded need to have MPI_SUCCESS in the status object or else the user may think that they failed? That way this actual error stands out as the one needing handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Ahh, so messy. I am tempted to let MPIR_Request_completion_processing always set the status->MPI_ERROR field. If we do that, do you think we can argue that we are not breaking the MPI standard?

Copy link
Contributor Author

@hzhou hzhou Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote from 3.2.5 -

In general, message-passing calls do not modify the value of the error code field of
status variables. This field may be updated only by the functions in Section 3.7.5 which
return multiple statuses. The field is updated if and only if such function returns with an
error code of MPI_ERR_IN_STATUS.

Rationale. The error field in status is not needed for calls that return only one status,
such as MPI_WAIT, since that would only duplicate the information returned by the
function itself. The current design avoids the additional overhead of setting it, in such
cases. The field is needed for calls that return multiple statuses, since each request
may have had a difierent failure. (End of rationale.)

Copy link
Contributor Author

@hzhou hzhou Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording such as "in general ... do not modify" and "may be updated only by..." does not exclude that we do set the MPI_ERROR field. @raffenet What do you think?

The rationale essentially explains to the user the reason that the MPI_ERROR field may not get updated thus should not be relied on, but it does not say that MPI_ERROR field is supposed to be left untouched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I think we can always set it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the text again because the commit modifying a bunch of tests gave me pause. I no longer think we can just blindly set the MPI_ERROR field in status. This sentence seems clear cut:

The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll reverse the changes. Still IMO, this is a nonsense clause.

@hzhou hzhou force-pushed the 2310_greq_free branch 6 times, most recently from 8f06200 to e0629ce Compare October 29, 2023 04:27
@dalcinl
Copy link
Contributor

dalcinl commented Oct 29, 2023

The changes to this PR lead to deadlock when running the mpi4py testsuite in singleton init mode.
https://github.com/mpi4py/mpi4py-testing/actions/runs/6681969488
I cancelled the runs and there are no logs to look at, sorry.

@hzhou
Copy link
Contributor Author

hzhou commented Oct 29, 2023

test:mpich/ch4/ofi
test:mpich/ch3/tcp

test:mpich/custom
netmod: ch4:ofi
compiler: intel

@hzhou
Copy link
Contributor Author

hzhou commented Oct 29, 2023

test:mpich/ch4/ofi
test:mpich/ch3/tcp

test:mpich/custom
netmod: ch4:ofi
compiler: intel

@hzhou
Copy link
Contributor Author

hzhou commented Oct 30, 2023 via email

@dalcinl
Copy link
Contributor

dalcinl commented Oct 30, 2023

Now I get different failures, all related to MPI_Errhandler. Not sure if that is a consequence of this PR, or some other regression that accumulated over the last few days.
https://github.com/mpi4py/mpi4py-testing/actions/runs/6688933854

@hzhou
Copy link
Contributor Author

hzhou commented Oct 30, 2023

Now I get different failures, all related to MPI_Errhandler. Not sure if that is a consequence of this PR, or some other regression that accumulated over the last few days. https://github.com/mpi4py/mpi4py-testing/actions/runs/6688933854

On invalid errorhandler argument, we now return MPI_ERR_ERRHANDLER instead of MPI_ERR_ARG -- #6750

@hzhou
Copy link
Contributor Author

hzhou commented Oct 30, 2023

test:mpich/ch4/ofi - 2 TIMEOUT due to server overload
test:mpich/ch3/tcp ✔️

Add MPIR_Request_free_return, which is the same as MPIR_Request_free but
will return any error code encountered during free. Right now, it just
returns the potential return code from the free_fn for generalized
request.
Make MPI_{Test,Wait}{,all,any,some} to return error code from the
free_fn of a generalized request.
This is produced by maint/gen_binding_c.py.
The newly added routine mpi_alloc_mem_cptr_ is missing declaration in
mpi_fortimpl.h.
Typos from copy pasting resulted in missing pmpi_base.mod rules.
In MPIC_Waitall, presetting the status_array assumes MPIR_Waitall will
not modify the status fields for send requests. This is not a good
assumption and makes code tricky to maintain. Instead, we let's only
call MPIR_Process_status on recv requests by checking request kinds.
The test issues an MPI_Irecv first then an MPI_Irsend, while the partner
process will only send the message after receiving from MPI_Irsend. The
idea is the MPI_Irsend should complete first. Thus MPI_Waitany should
complete MPI_Irsend first. However, it is possible that the progress in
MPI_Waitany completes both requests and return the index for MPI_Irecv
first -- this is not breaking the standard, strictly speaking.

Even though waitany completes the Irsend first most of the time, but
some time, especially the ubsan build, seems often will complete both
and return the Irecv first, causing CI test failure.

Since it is not strictly a standard behavior, we relax the test.
Check the following 3 cases explicitly:
   * statuses is MPI_STATUSES_IGNORE, just return MPI_ERR_IN_STATUS
   * mpi_error already set to MPI_ERR_IN_STATUS
   * mpi_error was MPI_SUCCESS. We need initialize all statuses'
     MPI_ERROR field to MPI_SUCCESS first.
@hzhou
Copy link
Contributor Author

hzhou commented Oct 30, 2023

test:mpich/ch4/ofi

@hzhou hzhou merged commit 43befc9 into pmodels:main Oct 30, 2023
2 of 5 checks passed
@hzhou hzhou deleted the 2310_greq_free branch October 30, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

mpi4py: Regression in generalized requests
3 participants