Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented May 31, 2016

This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

@hjelmn
Copy link
Member Author

hjelmn commented May 31, 2016

@bosilca Fixes an occasional crash when running multi-threaded tests. Looking a lot better after this change.

@jsquyres
Copy link
Member

@thananon Can you have a look, too?

@thananon
Copy link
Member

@jsquyres I'm not familiar with this at all. This looks great to me but I will let people with more experience say that.

}

opal_atomic_lock(&progress_lock);
fprintf (stderr, "Registering callback %p\n", cb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debugging statement should be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Removed in my local branch. Debugging the crash now.

@hjelmn
Copy link
Member Author

hjelmn commented May 31, 2016

Looks like there is still a bug in this branch. Looking into it.

@hjelmn hjelmn force-pushed the progress_threading branch 2 times, most recently from d1fa482 to ffaa2ec Compare June 1, 2016 22:37
@bosilca
Copy link
Member

bosilca commented Jun 2, 2016

@hjelmn I took a quick look at this patch and I don't see how you removed the potential bug. The opal_progress continues to use the callbacks array while the other might have free it (after replacing it with the newly allocated array). I think you have to declare the 2 callback arrays as volatile to solve at least part of the issue. Also, I would just double the size from the previous allocation instead of just adding 4 (and start with a default size of 8).

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

Makes sense. I will make those changes.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

@bosilca It did help in my case but it was not an optimized build. volatile will help when optimized.

@hjelmn hjelmn force-pushed the progress_threading branch from ffaa2ec to 4bd5b70 Compare June 2, 2016 02:21
@hjelmn
Copy link
Member Author

hjelmn commented Jun 2, 2016

Fixed. Found a typo in there as well. Will merge once jenkins finishes.

This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Per suggestion we now allocate a default of 8 slots for callbacks and
double the current number when we run out of space.

This commit also fixes leaking the callbacks_lp array.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn force-pushed the progress_threading branch from 4bd5b70 to 2fad3b9 Compare June 2, 2016 02:57
@hjelmn hjelmn merged commit fc26d9c into open-mpi:master Jun 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants