Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 22, 2015

The osc/sm component was using a simple counter to determine if all
expected posts had arrived to start a PSCW access epoch. This is
incorrect as a post may arrive from a peer that isn't part of the
current start group. There are many ways this could have been fixed.
This commit adds an n^2 bitmap. When a process posts it sets a bit in
the bitmap associated with the access rank to indicate the post is
complete. The access rank checks for and clears the bits associated
with all the processes in the start group.

The bitmap requires comm_size ^ 2 bits of space. This should be
managable as most nodes have relatively small numbers of processes. If
this changes another algorigthm can be implemented.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

The osc/sm component was using a simple counter to determine if all
expected posts had arrived to start a PSCW access epoch. This is
incorrect as a post may arrive from a peer that isn't part of the
current start group. There are many ways this could have been fixed.
This commit adds an n^2 bitmap. When a process posts it sets a bit in
the bitmap associated with the access rank to indicate the post is
complete. The access rank checks for and clears the bits associated
with all the processes in the start group.

The bitmap requires comm_size ^ 2 bits of space. This should be
managable as most nodes have relatively small numbers of processes. If
this changes another algorigthm can be implemented.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Sep 22, 2015

Thanks to Steffen Christgau for catching this. I never tested the PSCW synchronization in osc/sm. The MTT tests all use the antiquated MPI_Win_create and not MPI_Win_allocate so they never cover the osc/sm path.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Sep 22, 2015

Added commits to fix selection and fix osc/pt2pt pscw on 0 size groups.

hjelmn added a commit that referenced this pull request Sep 23, 2015
osc/sm: fix pscw synchronization
@hjelmn hjelmn merged commit b80cd56 into open-mpi:master Sep 23, 2015
@hjelmn hjelmn deleted the osc_sm_fix branch August 3, 2016 18:17
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
…-hcoll-coll_request-v1.10

BufFix for coll/hcoll: coll_request must be set to ACTIVE when alloced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant