Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Oct 22, 2014

This commit brings over two commits:

open-mpi/ompi@eed7b45

osc/rdma: fix issue identified by Berk Hess

osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.

Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php

This commit fixes both issues. Tested against MTT and issue reproducer.

open-mpi/ompi@23dd3af

osc/rdma: use unsigned types for all counters

Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 22, 2014

This replaces pull request #13. In addition to the original commit this addresses the issue identified by @goodell

@jsquyres jsquyres added this to the v1.8.4 milestone Oct 23, 2014
@jsquyres jsquyres added the bug label Oct 23, 2014
@jsquyres
Copy link
Member

Does it need a new review from @goodell?

@goodell
Copy link
Member

goodell commented Oct 23, 2014

Looks OK to me, though I would prefer that if you are bringing over two commits, just bring over two commits. There's no reason to squash them into a single commit now that we have Git. That's the role of the merge commit that will be created when Ralph approves the PR.

osc/rdma uses counters to determine if all messages have been received
before exiting synchronization calls. The problem is that the active
target counter is always increasing (never zeroed). If over 2^31-1
messages are sent this causes the counter to overflow (in itself this
isn't an error). This causes test/wait to return before the communication
is complete. There is an additional error in the use of the fragment
flush function. If PSCW synchronization is in use this function CAN NOT
be called unless a post message has arrived.

Relevant mailing list thread: http://www.open-mpi.org/community/lists/devel/2014/10/16016.php

This commit fixes both issues. Tested against MTT and issue reproducer.

(cherry picked from commit open-mpi/ompi@eed7b45)
Some of the counters used by the "rdma" one-sided component are intended
to overflow. Since overflow behavior is undefined for signed integers in
C it is safer to use unsigned integers here.

(cherry picked from commit open-mpi/ompi@23dd3af)
@hjelmn
Copy link
Member Author

hjelmn commented Oct 24, 2014

@goodell Redid the branch with cherry-picks

@goodell
Copy link
Member

goodell commented Oct 24, 2014

Looks OK to me.

rhc54 pushed a commit that referenced this pull request Oct 24, 2014
osc/rdma: bring over one-sided fixes from master
@rhc54 rhc54 merged commit 57ec207 into open-mpi:v1.8 Oct 24, 2014
mike-dubman added a commit to mike-dubman/ompi-release that referenced this pull request Mar 5, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants