Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Mar 18, 2016

This commit changes the opal_fifo_push code to use
opal_update_counted_pointer to set the head. This fixes a data race
that occurs possibly because the read of the fifo head in opal_fifo_pop
requires two instructions. This combined with the non-atomic update in
opal_fifo_push can lead to an ABA issue that puts the fifo in an
inconsistant state.

There are other ways this problem could be fixed. One way would be to
introduce an opal_atomic_read_128 implementation. On x86_64 this would
have to use the cmpxchg16b instruction. Since this instruction would
have to be in the pop path (and always executed) it would be slower
than the fix in this commit.

Closes open-mpi/ompi#1460.

:bot:assign: @jsquyres
:bot🏷️bug
:bot:milestone:v2.0.1

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from open-mpi/ompi@dc00021)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit changes the opal_fifo_push code to use
opal_update_counted_pointer to set the head. This fixes a data race
that occurs possibly because the read of the fifo head in opal_fifo_pop
requires two instructions. This combined with the non-atomic update in
opal_fifo_push can lead to an ABA issue that puts the fifo in an
inconsistant state.

There are other ways this problem could be fixed. One way would be to
introduce an opal_atomic_read_128 implementation. On x86_64 this would
have to use the cmpxchg16b instruction. Since this instruction would
have to be in the pop path (and always executed) it would be slower
than the fix in this commit.

Closes open-mpi/ompi#1460.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@dc00021)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@ompiteam-bot ompiteam-bot added this to the v2.0.1 milestone Mar 18, 2016
@hjelmn
Copy link
Member Author

hjelmn commented Mar 18, 2016

Putting this for 2.0.1 as the fifo is not critical at this time. The change is small enough to slip in for 2.0.0 if desired.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1451/ for details.

@jsquyres
Copy link
Member

Re-assigning to @adrianreber -- I'm not qualified to check it, and he was the one able to reproduce the error.

@jsquyres jsquyres assigned adrianreber and unassigned jsquyres Mar 19, 2016
@adrianreber
Copy link
Member

I was able to get a segfault after running the test 77 times on the 2.x branch (openmpi-v2.x-dev-1201-g7c91990). With the patch applied the test is now running 660 times without a segfault. Seems to be fixed.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 25, 2016

@adrianreber Can you give this a +1?

@adrianreber
Copy link
Member

Sure, 👍

@jsquyres
Copy link
Member

@hppritcha Good to go.

@adrianreber
Copy link
Member

This has happened again on x86_64 on the v2x branch the last two days in MTT.

@jsquyres
Copy link
Member

@hjelmn Does this cause a re-evaluation as to whether this belongs in v2.0.0 vs. v2.0.1?

@hjelmn
Copy link
Member Author

hjelmn commented Apr 20, 2016

Might as well. There is no harm in fixing the fifo as part of 2.0.0.

@jsquyres
Copy link
Member

@hppritcha I'm ok with moving this back to v2.0.0. Are you?

@hppritcha
Copy link
Member

yes I"m okay with that.

@hppritcha hppritcha merged commit 49d060e into open-mpi:v2.x Apr 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opal_fifo segfaults on x86_64

6 participants