start bug fixes #1729

hjelmn · 2016-05-30T17:26:40Z

@bosilca I think I have the problem fixed. Took less work than I expected. In order to detect the difference between a new request (fresh from isend_init) and an incomplete buffered send both the cm and ob1 isend_init functions now mark the request as pml_complete. This allows start to accurately use pml_complete to decide if a new request is needed. I kept the remainder of the cleanup as we know that no receive request can be in this state. Let me know if you are happy with the changes. I want to attach this to the request rework as osc/pt2pt hangs without this

This replaces #1726.

There were several problems with the implementation of start in Open MPI: - There are no checks whatsoever on the state of the request(s) provided to MPI_Start/MPI_Start_all. It is erroneous to provide an active request to either of these calls. Since we are already looping over the provided requests there is little overhead in verifying that the request can be started. - Both ob1 and cm were always throwing away the request on the initial call to start and start_all with a particular request. Subsequent calls would see that the request was pml_complete and reuse it. This introduced a leak as the initial request was never freed. Since the only pml request that can be mpi complete but not pml complete is a buffered send the code to reallocate the request has been moved. To detect that a request is indeed mpi complete but not pml complete isend_init in both cm and ob1 now marks the new request as pml complete. - If a new request was needed the callbacks on the original request were not copied over to the new request. This can cause osc/pt2pt to hang as the incoming message callback is never called. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Starting a new receive may cause a recursive call into the pt2pt frag receive function. If this happens and the prior request is on the garbage collection list it could cause problems. This commit moves the gc insert until after the new request has been posted. Signed-off-by: Nathan Hjelm <hjelmn@me.com>

bosilca · 2016-05-31T15:14:51Z

The code looks better than the previous version, but I did not had a chance to test it yet. I'm traveling and will not be able to check more thoughtfully before Thursday.

hjelmn · 2016-06-02T18:22:05Z

@bosilca Let me know if you are ok with this change. I would like to get it in before 9pm EDT tonight so it can run in MTT.

bosilca · 2016-06-02T23:06:27Z

@hjelmn the original way of testing for the request status was correct (looking at req_status being ACTIVE), while testing for MPI and PML completion can lead to issues in case the request is indeed MPI and PML complete but has never been waited on.

hjelmn · 2016-06-02T23:38:57Z

@bosilca I still check if the request is active on req_status. If it is active OMPI_ERR_REQUEST is returned because per MPI-3.1 pp77-78 a request should be inactive. I don't like the should wording (too weak) but I think the intent is must be inactive. Setting the request as pml_complete is just a way to differentiate between a new request (not yet started) and subsequent starts. On an initial start it is not ok to just mark the request as freed because it will never be waited on. Does that make sense?

bosilca · 2016-06-03T00:10:45Z

@hjelmn thanks for the explanation, the diff is becoming difficult to read. One last question: these requests are persistent, and you never downgrade them to more normal state. Thus, I don't see how they are released when you create a new one because of the previous one not being PML complete?

hjelmn · 2016-06-03T00:14:50Z

Yeah, it is enough code change to make the diffs a pain to read.

In most cases the request is re-used so it doesn't need to be freed on start. The only case where this is not true is buffered sends. In that case the old request is marked as freed if it is not pml complete.

bosilca · 2016-06-03T00:22:36Z

👍 Let's give it a try.

* mpi/start: fix bugs in cm and ob1 start functions There were several problems with the implementation of start in Open MPI: - There are no checks whatsoever on the state of the request(s) provided to MPI_Start/MPI_Start_all. It is erroneous to provide an active request to either of these calls. Since we are already looping over the provided requests there is little overhead in verifying that the request can be started. - Both ob1 and cm were always throwing away the request on the initial call to start and start_all with a particular request. Subsequent calls would see that the request was pml_complete and reuse it. This introduced a leak as the initial request was never freed. Since the only pml request that can be mpi complete but not pml complete is a buffered send the code to reallocate the request has been moved. To detect that a request is indeed mpi complete but not pml complete isend_init in both cm and ob1 now marks the new request as pml complete. - If a new request was needed the callbacks on the original request were not copied over to the new request. This can cause osc/pt2pt to hang as the incoming message callback is never called. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> * osc/pt2pt: add request for gc after starting a new request Starting a new receive may cause a recursive call into the pt2pt frag receive function. If this happens and the prior request is on the garbage collection list it could cause problems. This commit moves the gc insert until after the new request has been posted. Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from open-mpi/ompi@e968ddf) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* mpi/start: fix bugs in cm and ob1 start functions There were several problems with the implementation of start in Open MPI: - There are no checks whatsoever on the state of the request(s) provided to MPI_Start/MPI_Start_all. It is erroneous to provide an active request to either of these calls. Since we are already looping over the provided requests there is little overhead in verifying that the request can be started. - Both ob1 and cm were always throwing away the request on the initial call to start and start_all with a particular request. Subsequent calls would see that the request was pml_complete and reuse it. This introduced a leak as the initial request was never freed. Since the only pml request that can be mpi complete but not pml complete is a buffered send the code to reallocate the request has been moved. To detect that a request is indeed mpi complete but not pml complete isend_init in both cm and ob1 now marks the new request as pml complete. - If a new request was needed the callbacks on the original request were not copied over to the new request. This can cause osc/pt2pt to hang as the incoming message callback is never called. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> * osc/pt2pt: add request for gc after starting a new request Starting a new receive may cause a recursive call into the pt2pt frag receive function. If this happens and the prior request is on the garbage collection list it could cause problems. This commit moves the gc insert until after the new request has been posted. Signed-off-by: Nathan Hjelm <hjelmn@me.com>

hjelmn force-pushed the request_fixes branch from 41cc6f8 to b38c8b3 Compare May 30, 2016 17:40

hjelmn changed the title ~~start buf fixes~~ start bug fixes May 30, 2016

hjelmn force-pushed the request_fixes branch from b38c8b3 to abfbc6f Compare May 30, 2016 17:59

hjelmn and others added 2 commits May 30, 2016 12:12

hjelmn force-pushed the request_fixes branch from abfbc6f to 5a6ea4d Compare May 30, 2016 18:12

bosilca merged commit e968ddf into open-mpi:master Jun 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

start bug fixes #1729

start bug fixes #1729

Uh oh!

hjelmn commented May 30, 2016

Uh oh!

bosilca commented May 31, 2016

Uh oh!

hjelmn commented Jun 2, 2016

Uh oh!

bosilca commented Jun 2, 2016

Uh oh!

hjelmn commented Jun 2, 2016

Uh oh!

bosilca commented Jun 3, 2016

Uh oh!

hjelmn commented Jun 3, 2016

Uh oh!

bosilca commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

start bug fixes #1729

start bug fixes #1729

Uh oh!

Conversation

hjelmn commented May 30, 2016

Uh oh!

bosilca commented May 31, 2016

Uh oh!

hjelmn commented Jun 2, 2016

Uh oh!

bosilca commented Jun 2, 2016

Uh oh!

hjelmn commented Jun 2, 2016

Uh oh!

bosilca commented Jun 3, 2016

Uh oh!

hjelmn commented Jun 3, 2016

Uh oh!

bosilca commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants