Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 1, 2016

The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

The request code was setting the request as pml_complete before
calling MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE. This was causing
MCA_PML_OB1_SEND_REQUEST_RETURN to be called twice in some cases. The
code now mirrors the recvreq code and only sets the request as pml
complete if the request has not already been freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 1, 2016

@bosilca Found this race yesterday. Was causing hangs and crashes with the overlap test in thread-tests-1.1.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 1, 2016

@thananon Please take a look.

@thananon
Copy link
Member

thananon commented Jun 1, 2016

👍

@hjelmn hjelmn merged commit d844442 into open-mpi:master Jun 1, 2016
@bosilca
Copy link
Member

bosilca commented Jun 1, 2016

I don't think this is the right fix. Based on what you say in the log, it is possible that send_request_pml_complete is called multiple times, and that in some of these cases req_pml_complete is false. If this is the case then we have more serious problems than what you fixed with this patch as we will release the RDMA resources and the buffered data twice (and no good can ensue).

@hjelmn
Copy link
Member Author

hjelmn commented Jun 1, 2016

@bosilca Not the case. I put messages in send_request_pml_complete and MCA_PML_OB1_SEND_REQUEST_RETURN and determined send_request_pml_complete is only called once for the request but MCA_PML_OB1_SEND_REQUEST_RETURN is called twice. It was returned by the MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE macro and then we attempted to return it again by calling MCA_PML_OB1_SEND_REQUEST_RETURN directly.

@thananon
Copy link
Member

thananon commented Jun 1, 2016

@bosilca I don't think it is possible for send_request_pml_complete to get called multiple time.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

@bosilca I should mention that the hang/crash occurred while one thread was in MPI_Barrier() and another in MPI_Recv. The request that was returned twice was from the barrier.

@bosilca
Copy link
Member

bosilca commented Jun 2, 2016

So the execution path is for small messages. I see that we are calling the MCA_PML_OB1_SEND_REQUEST_RETURN macro explicitly in that path. As there was a request lwe can further assume the issue is not with inlined messages.

@bosilca
Copy link
Member

bosilca commented Jun 2, 2016

@hjelmn can you identify the explicit call to MCA_PML_OB1_SEND_REQUEST_RETURN ? For send requests I do not see how we can ending up calling the MCA_PML_OB1_SEND_REQUEST_RETURN twice (if we don't do it in isend.c line 261 which would be b.a.d).

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

I only got to the extra call coming from the MCA_PML_OB1_SEND_REQUEST_MPI_COMPLETE macro. That only calls ompi_request_complete() so it must be called somehow by that. This was detected by messages before and after the macro call.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

What probably happened is the thread calling send_request_pml_complete() set the request as pml and mpi complete then got suspended. The other thread then came in and called ompi_request_free() which returned the request as it was pml complete. The thread in send_request_pml_complete() then was resumed and saw that free was called and tried to return it.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

I bet the request lock prevented this race before and I probably only made it less likely. We probably want to set the request as pml complete only after it is mpi complete and free hasn't been called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants