-
Notifications
You must be signed in to change notification settings - Fork 929
opal/progress: make progress function registration mt safe #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bosilca Fixes an occasional crash when running multi-threaded tests. Looking a lot better after this change. |
|
@thananon Can you have a look, too? |
|
@jsquyres I'm not familiar with this at all. This looks great to me but I will let people with more experience say that. |
opal/runtime/opal_progress.c
Outdated
| } | ||
|
|
||
| opal_atomic_lock(&progress_lock); | ||
| fprintf (stderr, "Registering callback %p\n", cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This debugging statement should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Removed in my local branch. Debugging the crash now.
|
Looks like there is still a bug in this branch. Looking into it. |
d1fa482 to
ffaa2ec
Compare
|
@hjelmn I took a quick look at this patch and I don't see how you removed the potential bug. The opal_progress continues to use the callbacks array while the other might have free it (after replacing it with the newly allocated array). I think you have to declare the 2 callback arrays as volatile to solve at least part of the issue. Also, I would just double the size from the previous allocation instead of just adding 4 (and start with a default size of 8). |
|
Makes sense. I will make those changes. |
|
@bosilca It did help in my case but it was not an optimized build. volatile will help when optimized. |
ffaa2ec to
4bd5b70
Compare
|
Fixed. Found a typo in there as well. Will merge once jenkins finishes. |
This commit fixes a bug in opal progress registration that can cause crashes when a progress function is registered while another thread is in opal_progress(). Before this commit realloc is used to allocate more space for progress functions but it is possible for a thread in opal_progress() to try to read from the array that is freed by realloc before the array is re-assigned when realloc returns. To prevent this race use malloc + memcpy to fill the new array and atomically swap out the old and new array pointers. Per suggestion we now allocate a default of 8 slots for callbacks and double the current number when we run out of space. This commit also fixes leaking the callbacks_lp array. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
4bd5b70 to
2fad3b9
Compare
This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov