-
Notifications
You must be signed in to change notification settings - Fork 931
Fix yalla PML: MPI_Recv does not return MPI_ERR_TRUNCATE upon overflow #3260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
|
hmm...something odd with AWS jenkins. |
|
okay did a reboot of AWS jenkins. |
|
bot:retest |
|
bot:retest Fixed the webhooks, we think, so trying to rebuild again. |
|
bot:lanl:retest |
|
Why should MPI_Recv return error in this case? |
|
extracted from MPI_Recv() man page: Actually this is the behavior of the ob1 pml: if you set |
|
i'm not sure this behavior is consistent.. for example MPI_Irecv would not have a fatal error in this case, and fail only the particular receive operation. |
|
Actually we got an issue because our validation complained about a regression: they used to work with ob1 and used to get an MPI_ERR_TRUNCATE in case of recv overflow. |
|
@derbeyn yes. |
|
Ok, so now we have to decide which PML is behaving correctly... |
|
It is correct for an MPI_Recv to return MPI_ERR_TRUNCATE if the buffer that was passed was too small. MPI will then invoke the error handler on that communicator (which defaults to aborting). Same is essentially true for MPI_Irecv, except the error is not usually discovered until after MPI_Irecv returns. In that case, whenever the error is discovered (e.g., in a call to MPI_Test or MPI_Wait), MPI invokes the error handler. If you have a non-default error handler (i.e., one that does not abort), then MPI_ERR_TRUNCATE should be returned from MPI_Recv, and should be returned in the status.MPI_ERROR field of the status upon completion from MPI_Test/MPI_Wait. |
|
Jeff beat me to it, because I was getting the citation in the MPI standard... Which is the rationale in 3.2.5 of MPI-3.1. So the PML should set the error field in the internal status object (IIRC, it's been a while) as soon as it knows a truncate has occurred. The upper layers will invoke the error handler at the proper time. |
|
@bwbarrett ic, so this fix is good. Since ob1 returns the status, i think we can define that it's the responsibility of the PML to return the status. Comment on the patch: We can make PML_YALLA_SET_RECV_STATUS() return the value of rc instead of adding it as a parameter, so "int rc;" would not have to be defined everywhere, even when not used. |
|
sure, will change that. |
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
|
@yosofe done. Sorry for the delay, but I've got a training during the whole week. |
|
Oops, sorry! Didn't correctly mention @yosefe in the last post. So may be you didn't receive a notification? |
|
@derbeyn got it now. |
|
Please open PRs for other branches that need this fix. |
This commit fixes a bug in the following context:
OMPI_MCA_pml=yalla
OMPI_MCA_mtl=mxm
When calling MPI_Recv() with an overflow condition (size in the receive smaller than the sent size), MPI_Recv() succeeds instead of returning MPI_ERR_TRUNCATE
Signed-off-by: Nadia Derbey nadia.derbey@atos.net