Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error propagation in free_fn callback for generalized requests #11681

Open
dalcinl opened this issue May 16, 2023 · 4 comments · Fixed by #11683
Open

Error propagation in free_fn callback for generalized requests #11681

dalcinl opened this issue May 16, 2023 · 4 comments · Fixed by #11683

Comments

@dalcinl
Copy link
Contributor

dalcinl commented May 16, 2023

This is with branch v5.0.x, but I guess main has the same issue (currently, I'm unable to build main with internal pmix on Fedora 38, I'll submit another issue).

Reproducer

#include <stddef.h>
#include <mpi.h>

static int query_fn  (void *ctx, MPI_Status *s) { return MPI_SUCCESS; }
static int free_fn   (void *ctx) { return MPI_ERR_OTHER; }  // <-- RETURN WITH FAILURE !!!
static int cancel_fn (void *ctx, int c) { return MPI_SUCCESS;   }

int main(int argc, char *argv[])
{
  MPI_Request request;
  MPI_Init(&argc, &argv);
  MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);

  MPI_Grequest_complete(request);
  MPI_Wait(&request, MPI_STATUS_IGNORE);

  MPI_Finalize();
  return 0;
}

Expected behavior

The reproducer should abort (via default error handler).

Actual behavior

The reproducer runs to completion with success and no output.
The error code from free_fn is not propagated to MPI_Wait as required by the MPI standard.

@dalcinl
Copy link
Contributor Author

dalcinl commented Jan 23, 2024

@jsquyres Could you please reopen this issue? Thins still do not work quite as I would expect.

Here you have another reproducer:

#include <stdio.h>
#include <mpi.h>

static int query_fn  (void *ctx, MPI_Status *s) { return MPI_SUCCESS; }
static int free_fn   (void *ctx) { return MPI_ERR_OTHER; }  // <-- RETURN WITH FAILURE !!!
static int cancel_fn (void *ctx, int c) { return MPI_SUCCESS;   }

int main(int argc, char *argv[])
{
  int ierr;
  MPI_Request request;
  MPI_Init(&argc, &argv);

  MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);
  MPI_Grequest_complete(request);

  {
    ierr = MPI_Wait(&request, MPI_STATUS_IGNORE);
    printf("After Wait() - error: %d, active: %d\n", ierr, request!=MPI_REQUEST_NULL);
  }
  
  if (MPI_REQUEST_NULL != request) {
    ierr = MPI_Request_free(&request);
    printf("After Free() - error: %d, active: %d\n", ierr, request!=MPI_REQUEST_NULL);
  }
  
  MPI_Finalize();
  return 0;
}

I'm getting the following output:

$ mpicc test.c
$ ./a.out 
After Wait() - error: 16, active: 1
After Free() - error: 17, active: 1

As you can see, I'm still getting an error after MPI_Request_free. Additionally, the request variable is not set to MPI_REQUEST_NULL afterwards. This kind of behavior make error handling and handle lifetime management cumbersome.
IMHO, If the call to Wait() fails and does not fully deallocate the request object and set it to MPI_REQUEST_NULL on return (maybe this is the ideal behavior), then the call to Free() should not fail at all (the error has already been reported at Wait()) and of course the request should be fully deallocated and set to MPI_REQUEST_NULL.

@bosilca
Copy link
Member

bosilca commented Jan 23, 2024

The grequest provided free function returned an error, why would you expect the request to be MPI_REQUEST_NULL ? For what we know the request was not freed.

@dalcinl
Copy link
Contributor Author

dalcinl commented Jan 23, 2024

The grequest provided free function returned an error, why would you expect the request to be MPI_REQUEST_NULL ? For what we know the request was not freed.

The free function is a user thing used to release user resources. If the free function somehow fails to release these user resources, there is nothing else to do with the user stuff. However, MPI could happily continue and release internal MPI stuff, and then, at the end, return an error code to signal the user free failure.

If you do not like/want the above behavior, then at least don't make MPI_Request_free() fail again. Why would MPI_Request_free() fail again? a) The user free function is not being called again b) the error was already reported and hopefully handled at the Wait() or Test() call. Even worse, after the MPI_Request_free(), the request handle is not set to MPI_REQUEST_NULL, making handle lifetime management hard.

Please look at my reproducer above.
Ideally, I would argue that the example should print a single line with

After Wait() - error: 16, active: 0

That is, the the wait call deallocates everything MPI-internal, then completes with error and sets request=MPI_REQUEST_NULL on return.

Again, if that is not possible, what I'm asking for is to at least make the example print

After Wait() - error: 16, active: 1
After Free() - error: 0, active: 0

that is, MPI_Request_free() deallocates everything MPI-internal, then completes successfully and sets request=MPI_REQUEST_NULL on return.

@dalcinl
Copy link
Contributor Author

dalcinl commented Jan 23, 2024

FWIW, the MPICH behavior is for MPI_Wait to set request=MPI_REQUEST_NULL and fail returning error. This is my preferred implementation from my previous comment. If you decide to do things differently, it would be just another instance of MPI implementation exhibiting different behavior.

EDIT: Same thing for MPI_Test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants