Skip to content

Conversation

@derbeyn
Copy link
Contributor

@derbeyn derbeyn commented Mar 30, 2017

This commit fixes a bug in the following context:
OMPI_MCA_pml=yalla
OMPI_MCA_mtl=mxm
When calling MPI_Recv() with an overflow condition (size in the receive smaller than the sent size), MPI_Recv() succeeds instead of returning MPI_ERR_TRUNCATE

Signed-off-by: Nadia Derbey nadia.derbey@atos.net

Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@hppritcha
Copy link
Member

hmm...something odd with AWS jenkins.
bot:lanl:retest

@hppritcha
Copy link
Member

okay did a reboot of AWS jenkins.
bot:lanl:retest

@mike-dubman
Copy link
Member

@yosefe - plz review
@derbeyn - thanks for a patch!

@hppritcha
Copy link
Member

bot:retest

@bwbarrett
Copy link
Member

bot:retest

Fixed the webhooks, we think, so trying to rebuild again.

@hppritcha
Copy link
Member

bot:lanl:retest

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2017

Why should MPI_Recv return error in this case?
The error should be examined by MPI_ERROR field of "status" as defined by MPI spec.

@derbeyn
Copy link
Contributor Author

derbeyn commented Mar 31, 2017

extracted from MPI_Recv() man page:
"The length of the received message must be less than or equal to the length of the receive buffer. An MPI_ERR_TRUNCATE is returned upon the overflow condition."

Actually this is the behavior of the ob1 pml: if you set
OMPI_MCA_pml=ob1
OMPI_MCA_btl=self,openib
and you run a test with a MPI_Recv() size that is smaller than the corresponding MPI_Send(), you will get the following error message
...
[btp8:18862] *** MPI_ERR_TRUNCATE: message truncated
[btp8:18862] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
...

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2017

i'm not sure this behavior is consistent.. for example MPI_Irecv would not have a fatal error in this case, and fail only the particular receive operation.
@jsquyres @rhc54 @hppritcha what do you think?

@derbeyn
Copy link
Contributor Author

derbeyn commented Mar 31, 2017

Actually we got an issue because our validation complained about a regression: they used to work with ob1 and used to get an MPI_ERR_TRUNCATE in case of recv overflow.
Now, they are using yalla and don't get the error anymore.
So, whatever the behavior, it should be the same for both ob1 and yalla. So 1 of them should be fixed, nomatter which one.
Do you agree with that?

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2017

@derbeyn yes.

@derbeyn
Copy link
Contributor Author

derbeyn commented Mar 31, 2017

Ok, so now we have to decide which PML is behaving correctly...

@jsquyres
Copy link
Member

It is correct for an MPI_Recv to return MPI_ERR_TRUNCATE if the buffer that was passed was too small. MPI will then invoke the error handler on that communicator (which defaults to aborting).

Same is essentially true for MPI_Irecv, except the error is not usually discovered until after MPI_Irecv returns. In that case, whenever the error is discovered (e.g., in a call to MPI_Test or MPI_Wait), MPI invokes the error handler.

If you have a non-default error handler (i.e., one that does not abort), then MPI_ERR_TRUNCATE should be returned from MPI_Recv, and should be returned in the status.MPI_ERROR field of the status upon completion from MPI_Test/MPI_Wait.

@bwbarrett
Copy link
Member

Jeff beat me to it, because I was getting the citation in the MPI standard... Which is the rationale in 3.2.5 of MPI-3.1.

So the PML should set the error field in the internal status object (IIRC, it's been a while) as soon as it knows a truncate has occurred. The upper layers will invoke the error handler at the proper time.

@yosefe
Copy link
Contributor

yosefe commented Apr 3, 2017

@bwbarrett ic, so this fix is good. Since ob1 returns the status, i think we can define that it's the responsibility of the PML to return the status.

Comment on the patch: We can make PML_YALLA_SET_RECV_STATUS() return the value of rc instead of adding it as a parameter, so "int rc;" would not have to be defined everywhere, even when not used.

@derbeyn
Copy link
Contributor Author

derbeyn commented Apr 4, 2017

sure, will change that.

Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
@derbeyn
Copy link
Contributor Author

derbeyn commented Apr 6, 2017

@yosofe done. Sorry for the delay, but I've got a training during the whole week.

@derbeyn
Copy link
Contributor Author

derbeyn commented Apr 10, 2017

Oops, sorry! Didn't correctly mention @yosefe in the last post. So may be you didn't receive a notification?

@yosefe
Copy link
Contributor

yosefe commented Apr 10, 2017

@derbeyn got it now.

@hppritcha
Copy link
Member

Please open PRs for other branches that need this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants