-
Notifications
You must be signed in to change notification settings - Fork 68
opal/fifo: use atomics to set fifo head in opal_fifo_push #1032
Conversation
This commit changes the opal_fifo_push code to use opal_update_counted_pointer to set the head. This fixes a data race that occurs possibly because the read of the fifo head in opal_fifo_pop requires two instructions. This combined with the non-atomic update in opal_fifo_push can lead to an ABA issue that puts the fifo in an inconsistant state. There are other ways this problem could be fixed. One way would be to introduce an opal_atomic_read_128 implementation. On x86_64 this would have to use the cmpxchg16b instruction. Since this instruction would have to be in the pop path (and always executed) it would be slower than the fix in this commit. Closes open-mpi/ompi#1460. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@dc00021) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
Putting this for 2.0.1 as the fifo is not critical at this time. The change is small enough to slip in for 2.0.0 if desired. |
|
Test PASSed. |
|
Re-assigning to @adrianreber -- I'm not qualified to check it, and he was the one able to reproduce the error. |
|
I was able to get a segfault after running the test 77 times on the 2.x branch (openmpi-v2.x-dev-1201-g7c91990). With the patch applied the test is now running 660 times without a segfault. Seems to be fixed. |
|
@adrianreber Can you give this a +1? |
|
Sure, 👍 |
|
@hppritcha Good to go. |
|
This has happened again on x86_64 on the v2x branch the last two days in MTT. |
|
@hjelmn Does this cause a re-evaluation as to whether this belongs in v2.0.0 vs. v2.0.1? |
|
Might as well. There is no harm in fixing the fifo as part of 2.0.0. |
|
@hppritcha I'm ok with moving this back to v2.0.0. Are you? |
|
yes I"m okay with that. |
This commit changes the opal_fifo_push code to use
opal_update_counted_pointer to set the head. This fixes a data race
that occurs possibly because the read of the fifo head in opal_fifo_pop
requires two instructions. This combined with the non-atomic update in
opal_fifo_push can lead to an ABA issue that puts the fifo in an
inconsistant state.
There are other ways this problem could be fixed. One way would be to
introduce an opal_atomic_read_128 implementation. On x86_64 this would
have to use the cmpxchg16b instruction. Since this instruction would
have to be in the pop path (and always executed) it would be slower
than the fix in this commit.
Closes open-mpi/ompi#1460.
:bot:assign: @jsquyres
:bot🏷️bug
:bot:milestone:v2.0.1
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from open-mpi/ompi@dc00021)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov