Error propagation in free_fn callback for generalized requests #11681

dalcinl · 2023-05-16T12:10:39Z

This is with branch v5.0.x, but I guess main has the same issue (currently, I'm unable to build main with internal pmix on Fedora 38, I'll submit another issue).

Reproducer

#include <stddef.h>
#include <mpi.h>

static int query_fn  (void *ctx, MPI_Status *s) { return MPI_SUCCESS; }
static int free_fn   (void *ctx) { return MPI_ERR_OTHER; }  // <-- RETURN WITH FAILURE !!!
static int cancel_fn (void *ctx, int c) { return MPI_SUCCESS;   }

int main(int argc, char *argv[])
{
  MPI_Request request;
  MPI_Init(&argc, &argv);
  MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);

  MPI_Grequest_complete(request);
  MPI_Wait(&request, MPI_STATUS_IGNORE);

  MPI_Finalize();
  return 0;
}

Expected behavior

The reproducer should abort (via default error handler).

Actual behavior

The reproducer runs to completion with success and no output.
The error code from free_fn is not propagated to MPI_Wait as required by the MPI standard.

The text was updated successfully, but these errors were encountered:

dalcinl · 2024-01-23T10:47:11Z

@jsquyres Could you please reopen this issue? Thins still do not work quite as I would expect.

Here you have another reproducer:

#include <stdio.h>
#include <mpi.h>

static int query_fn  (void *ctx, MPI_Status *s) { return MPI_SUCCESS; }
static int free_fn   (void *ctx) { return MPI_ERR_OTHER; }  // <-- RETURN WITH FAILURE !!!
static int cancel_fn (void *ctx, int c) { return MPI_SUCCESS;   }

int main(int argc, char *argv[])
{
  int ierr;
  MPI_Request request;
  MPI_Init(&argc, &argv);

  MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);
  MPI_Grequest_complete(request);

  {
    ierr = MPI_Wait(&request, MPI_STATUS_IGNORE);
    printf("After Wait() - error: %d, active: %d\n", ierr, request!=MPI_REQUEST_NULL);
  }
  
  if (MPI_REQUEST_NULL != request) {
    ierr = MPI_Request_free(&request);
    printf("After Free() - error: %d, active: %d\n", ierr, request!=MPI_REQUEST_NULL);
  }
  
  MPI_Finalize();
  return 0;
}

I'm getting the following output:

$ mpicc test.c
$ ./a.out 
After Wait() - error: 16, active: 1
After Free() - error: 17, active: 1

As you can see, I'm still getting an error after MPI_Request_free. Additionally, the request variable is not set to MPI_REQUEST_NULL afterwards. This kind of behavior make error handling and handle lifetime management cumbersome.
IMHO, If the call to Wait() fails and does not fully deallocate the request object and set it to MPI_REQUEST_NULL on return (maybe this is the ideal behavior), then the call to Free() should not fail at all (the error has already been reported at Wait()) and of course the request should be fully deallocated and set to MPI_REQUEST_NULL.

bosilca · 2024-01-23T18:48:20Z

The grequest provided free function returned an error, why would you expect the request to be MPI_REQUEST_NULL ? For what we know the request was not freed.

dalcinl · 2024-01-23T19:02:16Z

The grequest provided free function returned an error, why would you expect the request to be MPI_REQUEST_NULL ? For what we know the request was not freed.

The free function is a user thing used to release user resources. If the free function somehow fails to release these user resources, there is nothing else to do with the user stuff. However, MPI could happily continue and release internal MPI stuff, and then, at the end, return an error code to signal the user free failure.

If you do not like/want the above behavior, then at least don't make MPI_Request_free() fail again. Why would MPI_Request_free() fail again? a) The user free function is not being called again b) the error was already reported and hopefully handled at the Wait() or Test() call. Even worse, after the MPI_Request_free(), the request handle is not set to MPI_REQUEST_NULL, making handle lifetime management hard.

Please look at my reproducer above.
Ideally, I would argue that the example should print a single line with

After Wait() - error: 16, active: 0

That is, the the wait call deallocates everything MPI-internal, then completes with error and sets request=MPI_REQUEST_NULL on return.

Again, if that is not possible, what I'm asking for is to at least make the example print

After Wait() - error: 16, active: 1
After Free() - error: 0, active: 0

that is, MPI_Request_free() deallocates everything MPI-internal, then completes successfully and sets request=MPI_REQUEST_NULL on return.

dalcinl · 2024-01-23T19:07:18Z

FWIW, the MPICH behavior is for MPI_Wait to set request=MPI_REQUEST_NULL and fail returning error. This is my preferred implementation from my previous comment. If you decide to do things differently, it would be just another instance of MPI implementation exhibiting different behavior.

EDIT: Same thing for MPI_Test.

jsquyres added Target: main Target: v5.0.x bug labels May 16, 2023

bosilca mentioned this issue May 16, 2023

Propagate the error from the generalize request free callback to the user #11683

Merged

bosilca self-assigned this Jun 24, 2023

jsquyres closed this as completed in #11683 Jan 19, 2024

jsquyres mentioned this issue Jan 19, 2024

v5.0.x: Propagate the error up to the user. #12253

Open

jsquyres added this to the v5.0.2 milestone Jan 19, 2024

dalcinl mentioned this issue Mar 5, 2024

Fix error handling for generalized requests #12392

Open

jsquyres reopened this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error propagation in free_fn callback for generalized requests #11681

Error propagation in free_fn callback for generalized requests #11681

dalcinl commented May 16, 2023

dalcinl commented Jan 23, 2024

bosilca commented Jan 23, 2024

dalcinl commented Jan 23, 2024

dalcinl commented Jan 23, 2024 •

edited

Error propagation in free_fn callback for generalized requests #11681

Error propagation in free_fn callback for generalized requests #11681

Comments

dalcinl commented May 16, 2023

Reproducer

Expected behavior

Actual behavior

dalcinl commented Jan 23, 2024

bosilca commented Jan 23, 2024

dalcinl commented Jan 23, 2024

dalcinl commented Jan 23, 2024 • edited

dalcinl commented Jan 23, 2024 •

edited