Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jul 12, 2018

This commit updates the atomic fifo code to fix a consistency issue
observed on Power9 systems when builtin atomics are used. The cause
was two things: 1) a missing write memory barrier in fifo push, and 2)
a read ordering issue when reading the fifo head non-atomically. This
commit fixes both issues and appears to correct then inconsistency.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from commit 8b09010)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit updates the atomic fifo code to fix a consistency issue
observed on Power9 systems when builtin atomics are used. The cause
was two things: 1) a missing write memory barrier in fifo push, and 2)
a read ordering issue when reading the fifo head non-atomically. This
commit fixes both issues and appears to correct then inconsistency.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 8b09010)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn added this to the v3.0.3 milestone Jul 12, 2018
@hjelmn hjelmn requested a review from gpaulsen July 12, 2018 16:47
@jsquyres jsquyres changed the title opal/fifo: fix 128-bit atomic fifo on Power9 v3.0.x: opal/fifo: fix 128-bit atomic fifo on Power9 Jul 12, 2018
Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn, one spot looks iffy (since there's an assignment to old_head.data.item (inside of opal_read_counted_pointer) followed by an asignment to a local variable item OF that value. If those two instructions happened in opposite order, we might have problems. Is the compiler free to re-arrange instructions before or after (or both) after inlining functions?

old_head.data.counter = lifo->opal_lifo_head.data.counter;
opal_atomic_rmb ();
old_head.data.item = item = (opal_list_item_t*)lifo->opal_lifo_head.data.item;
opal_read_counted_pointer (&lifo->opal_lifo_head, &old_head);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need an opal_atomic_rmb() here after the opal_read_counted_pointer(), or is there a read memory barrier even if that function gets inlined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok since there is a rmb in opal_read_counted_pointer and old_head.data.item is a data dependency for the next line. Shouldn't be re-ordered by the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking a second look.
looks good to me.

@gpaulsen
Copy link
Member

I tested and verified this fixes the issue on Power9.

@bwbarrett bwbarrett merged commit 8be3c8c into open-mpi:v3.0.x Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants