Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented May 3, 2019

See #6633 for a full explanation.

NOTE: The cherry-pick of a16cf0e had to include an extra typedef in opal/threads/thread_usage.h from elsewhere in the tree (and not part of a16cf0e), clearly noted in the commit.

FYI @EmmanuelBRELLE

In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.

I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.

Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
(cherry picked from commit e630046)
@ibm-ompi
Copy link

ibm-ompi commented May 3, 2019

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/90c02b30b0acdf60d49f6eff5f77e769

@ibm-ompi
Copy link

ibm-ompi commented May 3, 2019

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e21e4f2797e937a188a1f6df3cd01bed

The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.

Be more explicit on the comment.

The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.

NOTE: Had to add a typedef for "opal_atomic_size_t" from master into
opal/threads/thread_usage.h into this cherry pick (it is in
opal/include/opal_stdatomic.h on master, but that file does not exist
here on the v4.0.x branch).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a16cf0e)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres jsquyres force-pushed the pr/v4.0.x/ob1-fixes branch from 97bb75c to 48f8243 Compare May 3, 2019 13:21
Copy link
Member

@kawashima-fj kawashima-fj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I confirmed it actually fixes the leak.

@bosilca You added WIP-DNM label in #6633 to add one more fix. I think you should add it to this PR too. Right?

@gpaulsen
Copy link
Member

gpaulsen commented May 17, 2019

Lets discuss on Tuesday. Perhaps we should pull this PR, and then do a separate PR for #6633 content later.

@gpaulsen
Copy link
Member

@hppritcha We didn't get to this PR in today's web-ex. Lets bump discussion to next week.

@hppritcha
Copy link
Member

hppritcha commented Jun 25, 2019

at 6/25/19 devel meeting we decided this can be merged in but it doesn't fix issues with the put protocol pipelining (it fixes a different OB1 PUT protocol issue).

@hppritcha hppritcha added the NEWS label Jun 26, 2019
@hppritcha hppritcha merged commit 6424857 into open-mpi:v4.0.x Jun 26, 2019
@jsquyres jsquyres deleted the pr/v4.0.x/ob1-fixes branch September 25, 2019 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants