request: return error code from free_fn of a generalized request #6751

hzhou · 2023-10-26T19:43:41Z

Pull Request Description

We are supposed to return the error code from the free_fn of a generalized request. This was broken in PR #6718

Also add to .gitignore the missing entry src/binding/c/c_binding.c.

Also fix a warning of missing prototypes in mpif_h binding.

Fixes #6747

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2023-10-27T04:18:59Z

test:mpich/ch3/tcp ✔️
test:mpich/ch4/ofi - 1 unrelated failure ch4-ofi-asan threads/pt2pt/mt_improbe_sendrecv_huge 2 TIMEOUT

raffenet · 2023-10-27T16:18:24Z

src/binding/c/request_api.txt

            array_of_requests[idx] = MPI_REQUEST_NULL;
+            if (rc && array_of_statuses != MPI_STATUSES_IGNORE) {
+                array_of_statuses[idx].MPI_ERROR = rc;


are the other status objects guaranteed to have MPI_SUCCESS in them at this point?

Ahh, a typo. array_of_statuses[idx] should be array_of_statuses[i].

are the other status objects guaranteed to have MPI_SUCCESS in them at this point?

I am not sure I understand the question. Why would it matter?

The user is supposed to look in the status array to find out which operation has an error. So presumably all operations that succeeded need to have MPI_SUCCESS in the status object or else the user may think that they failed? That way this actual error stands out as the one needing handling.

I see. Ahh, so messy. I am tempted to let MPIR_Request_completion_processing always set the status->MPI_ERROR field. If we do that, do you think we can argue that we are not breaking the MPI standard?

quote from 3.2.5 -

In general, message-passing calls do not modify the value of the error code field of
status variables. This field may be updated only by the functions in Section 3.7.5 which
return multiple statuses. The field is updated if and only if such function returns with an
error code of MPI_ERR_IN_STATUS.

Rationale. The error field in status is not needed for calls that return only one status,
such as MPI_WAIT, since that would only duplicate the information returned by the
function itself. The current design avoids the additional overhead of setting it, in such
cases. The field is needed for calls that return multiple statuses, since each request
may have had a difierent failure. (End of rationale.)

I think the wording such as "in general ... do not modify" and "may be updated only by..." does not exclude that we do set the MPI_ERROR field. @raffenet What do you think?

The rationale essentially explains to the user the reason that the MPI_ERROR field may not get updated thus should not be relied on, but it does not say that MPI_ERROR field is supposed to be left untouched.

Yes, I agree. I think we can always set it.

I read the text again because the commit modifying a bunch of tests gave me pause. I no longer think we can just blindly set the MPI_ERROR field in status. This sentence seems clear cut:

The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

Okay, I'll reverse the changes. Still IMO, this is a nonsense clause.

dalcinl · 2023-10-29T08:07:03Z

The changes to this PR lead to deadlock when running the mpi4py testsuite in singleton init mode.
https://github.com/mpi4py/mpi4py-testing/actions/runs/6681969488
I cancelled the runs and there are no logs to look at, sorry.

hzhou · 2023-10-29T14:41:51Z

test:mpich/ch4/ofi
test:mpich/ch3/tcp

test:mpich/custom
netmod: ch4:ofi
compiler: intel

hzhou · 2023-10-29T22:38:17Z

test:mpich/ch4/ofi
test:mpich/ch3/tcp

test:mpich/custom
netmod: ch4:ofi
compiler: intel

hzhou · 2023-10-30T02:47:39Z

I think it should work now. Hui

…

________________________________ From: Lisandro Dalcin ***@***.***> Sent: Sunday, October 29, 2023 3:07 AM To: pmodels/mpich ***@***.***> Cc: Hui Zhou ***@***.***>; Author ***@***.***> Subject: Re: [pmodels/mpich] request: return error code from free_fn of a generalized request (PR #6751) The changes to this PR lead to deadlock when running the mpi4py testsuite in singleton init mode. https://github.com/mpi4py/mpi4py-testing/actions/runs/6681969488 I cancelled the runs and there are no logs to look at, sorry. The hydra runs refused to stop upon manual cancellation (I guess the GHA runtime limit will eventually trigger). — Reply to this email directly, view it on GitHub<#6751 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AALNM7R7MV7UX2YIHOYXY63YBYFDHAVCNFSM6AAAAAA6RXUU3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBUGAZDQOBTGI>. You are receiving this because you authored the thread.Message ID: ***@***.***>

dalcinl · 2023-10-30T06:31:13Z

Now I get different failures, all related to MPI_Errhandler. Not sure if that is a consequence of this PR, or some other regression that accumulated over the last few days.
https://github.com/mpi4py/mpi4py-testing/actions/runs/6688933854

hzhou · 2023-10-30T13:12:28Z

Now I get different failures, all related to MPI_Errhandler. Not sure if that is a consequence of this PR, or some other regression that accumulated over the last few days. https://github.com/mpi4py/mpi4py-testing/actions/runs/6688933854

On invalid errorhandler argument, we now return MPI_ERR_ERRHANDLER instead of MPI_ERR_ARG -- #6750

hzhou · 2023-10-30T16:27:30Z

test:mpich/ch4/ofi - 2 TIMEOUT due to server overload
test:mpich/ch3/tcp ✔️

Add MPIR_Request_free_return, which is the same as MPIR_Request_free but will return any error code encountered during free. Right now, it just returns the potential return code from the free_fn for generalized request.

Make MPI_{Test,Wait}{,all,any,some} to return error code from the free_fn of a generalized request.

This is produced by maint/gen_binding_c.py.

The newly added routine mpi_alloc_mem_cptr_ is missing declaration in mpi_fortimpl.h.

Typos from copy pasting resulted in missing pmpi_base.mod rules.

In MPIC_Waitall, presetting the status_array assumes MPIR_Waitall will not modify the status fields for send requests. This is not a good assumption and makes code tricky to maintain. Instead, we let's only call MPIR_Process_status on recv requests by checking request kinds.

The test issues an MPI_Irecv first then an MPI_Irsend, while the partner process will only send the message after receiving from MPI_Irsend. The idea is the MPI_Irsend should complete first. Thus MPI_Waitany should complete MPI_Irsend first. However, it is possible that the progress in MPI_Waitany completes both requests and return the index for MPI_Irecv first -- this is not breaking the standard, strictly speaking. Even though waitany completes the Irsend first most of the time, but some time, especially the ubsan build, seems often will complete both and return the Irecv first, causing CI test failure. Since it is not strictly a standard behavior, we relax the test.

Check the following 3 cases explicitly: * statuses is MPI_STATUSES_IGNORE, just return MPI_ERR_IN_STATUS * mpi_error already set to MPI_ERR_IN_STATUS * mpi_error was MPI_SUCCESS. We need initialize all statuses' MPI_ERROR field to MPI_SUCCESS first.

hzhou · 2023-10-30T18:49:44Z

test:mpich/ch4/ofi

hzhou force-pushed the 2310_greq_free branch 2 times, most recently from b4039da to 6e67e50 Compare October 26, 2023 21:59

hzhou requested a review from raffenet October 27, 2023 14:15

hzhou added this to Fixes in MPICH-4.2 Release Oct 27, 2023

raffenet reviewed Oct 27, 2023

View reviewed changes

hzhou force-pushed the 2310_greq_free branch 6 times, most recently from 8f06200 to e0629ce Compare October 29, 2023 04:27

hzhou force-pushed the 2310_greq_free branch from e0629ce to a48e4f6 Compare October 29, 2023 14:41

hzhou force-pushed the 2310_greq_free branch from a48e4f6 to a9002f9 Compare October 29, 2023 22:38

hzhou requested a review from raffenet October 30, 2023 14:38

hzhou force-pushed the 2310_greq_free branch from a9002f9 to e79becc Compare October 30, 2023 16:27

hzhou added 7 commits October 30, 2023 13:49

request: add MPIR_Request_free_return

196b69c

Add MPIR_Request_free_return, which is the same as MPIR_Request_free but will return any error code encountered during free. Right now, it just returns the potential return code from the free_fn for generalized request.

request: make MPI Test/Wait functions return error

b3a8d8d

Make MPI_{Test,Wait}{,all,any,some} to return error code from the free_fn of a generalized request.

misc: add to .gitignore src/binding/c/c_binding.c

cf6e5db

This is produced by maint/gen_binding_c.py.

fortran/mpif_h: suppress -Wmissing-prototypes

15449bb

The newly added routine mpi_alloc_mem_cptr_ is missing declaration in mpi_fortimpl.h.

fortran/use_mpi: fix makefile for PMPIBASEMOD

8594416

Typos from copy pasting resulted in missing pmpi_base.mod rules.

hzhou force-pushed the 2310_greq_free branch from e79becc to 26c745a Compare October 30, 2023 18:49

raffenet approved these changes Oct 30, 2023

View reviewed changes

hzhou merged commit 43befc9 into pmodels:main Oct 30, 2023
2 of 5 checks passed

hzhou deleted the 2310_greq_free branch October 30, 2023 21:24

raffenet mentioned this pull request Nov 10, 2023

Always set the MPI_ERROR field in status objects mpi-forum/mpi-issues#814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request: return error code from free_fn of a generalized request #6751

request: return error code from free_fn of a generalized request #6751

hzhou commented Oct 26, 2023 •

edited

hzhou commented Oct 27, 2023 •

edited

raffenet Oct 27, 2023

hzhou Oct 27, 2023

raffenet Oct 27, 2023

hzhou Oct 27, 2023

hzhou Oct 27, 2023 •

edited

hzhou Oct 27, 2023 •

edited

raffenet Oct 27, 2023

raffenet Oct 30, 2023

hzhou Oct 30, 2023

dalcinl commented Oct 29, 2023 •

edited

hzhou commented Oct 29, 2023

hzhou commented Oct 29, 2023

hzhou commented Oct 30, 2023 via email

dalcinl commented Oct 30, 2023

hzhou commented Oct 30, 2023

hzhou commented Oct 30, 2023 •

edited

hzhou commented Oct 30, 2023

request: return error code from free_fn of a generalized request #6751

request: return error code from free_fn of a generalized request #6751

Conversation

hzhou commented Oct 26, 2023 • edited

Pull Request Description

Author Checklist

hzhou commented Oct 27, 2023 • edited

raffenet Oct 27, 2023

Choose a reason for hiding this comment

hzhou Oct 27, 2023

Choose a reason for hiding this comment

raffenet Oct 27, 2023

Choose a reason for hiding this comment

hzhou Oct 27, 2023

Choose a reason for hiding this comment

hzhou Oct 27, 2023 • edited

Choose a reason for hiding this comment

hzhou Oct 27, 2023 • edited

Choose a reason for hiding this comment

raffenet Oct 27, 2023

Choose a reason for hiding this comment

raffenet Oct 30, 2023

Choose a reason for hiding this comment

hzhou Oct 30, 2023

Choose a reason for hiding this comment

dalcinl commented Oct 29, 2023 • edited

hzhou commented Oct 29, 2023

hzhou commented Oct 29, 2023

hzhou commented Oct 30, 2023 via email

dalcinl commented Oct 30, 2023

hzhou commented Oct 30, 2023

hzhou commented Oct 30, 2023 • edited

hzhou commented Oct 30, 2023

hzhou commented Oct 26, 2023 •

edited

hzhou commented Oct 27, 2023 •

edited

hzhou Oct 27, 2023 •

edited

hzhou Oct 27, 2023 •

edited

dalcinl commented Oct 29, 2023 •

edited

hzhou commented Oct 30, 2023 •

edited