Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Mar 17, 2016

This commit changes the opal_fifo_push code to use
opal_update_counted_pointer to set the head. This fixes a data race
that occurs because the read of the fifo head in opal_fifo_pop
requires two instructions. This combined with the non-atomic update in
opal_fifo_push can lead to an ABA issue that puts the fifo in an
inconsistant state.

There are other ways this problem could be fixed. One way would be to
introduce an opal_atomic_read_128 implementation. On x86_64 this would
have to use the cmpxchg16b instruction. Since this instruction would
have to be in the pop path (and always executed) it would be slower
than the fix in this commit.

Closes #1460.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit changes the opal_fifo_push code to use
opal_update_counted_pointer to set the head. This fixes a data race
that occurs because the read of the fifo head in opal_fifo_pop
requires two instructions. This combined with the non-atomic update in
opal_fifo_push can lead to an ABA issue that puts the fifo in an
inconsistant state.

There are other ways this problem could be fixed. One way would be to
introduce an opal_atomic_read_128 implementation. On x86_64 this would
have to use the cmpxchg16b instruction. Since this instruction would
have to be in the pop path (and always executed) it would be slower
than the fix in this commit.

Closes open-mpi#1460.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@adrianreber
Copy link
Member

I tried to run the old opal_fifo in a loop and it segfaulted during the 8th run. With @hjelmn's patch applied it has now ran over 140 times without a segfault. I keep it running for a few more hours

@hjelmn
Copy link
Member Author

hjelmn commented Mar 17, 2016

Mellanox jenkins failure looks unrelated.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 17, 2016

:bot:retest:

hjelmn added a commit that referenced this pull request Mar 18, 2016
opal/fifo: use atomics to set fifo head in opal_fifo_push
@hjelmn hjelmn merged commit c749d7d into open-mpi:master Mar 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opal_fifo segfaults on x86_64

2 participants