-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vader/xpmem: fix finalize #6534
Conversation
- check the incremented reference count is > 1 - remove an extra semi column Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
perform various tests to make sure the object is valid Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
make sure mca_btl_vader_endpoint_xpmem_rcache_cleanup() is passed a valid reg object. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Refs. open-mpi#6524 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@hjelmn can you please have a look at this ? the first three commits help evidence the issue reported in #6524 |
I think we found the root cause for the bug. Will double-check and see if the patch can be pushed this week. |
Can one of the admins verify this patch? |
@hjelmn @ggouaillardet Hello. Do either of you have an update on this PR? Thanks. |
Is issue of [#7030] same problem? |
I think that correction #7030 is valid for this problem. if reg to be released is which has condition However, reg is append to the list just only one time by following process, Reg is freed just once in the OPAL_LIST_FOREACH_SAFE loop at mca_btl_vader_xpmem_cleanup_endpoint layer.
|
some bugs may be still.
|
@hjelmn Ping. |
Assertion error of Magic id is reproduced in the return_registration called from the MPI_Reduce_scatter.
|
I need to take a closer look at the issue to determine what the appropriate fix is. Can do that this week. |
SIGSEGV is reproduced too.
|
Something may be destroied after xpmem_detach,OBJ_RELEASE (reg) in the vader_return_registration?
|
There is a case that reg->rcache_context value is NULL at end of the vader_get_registation.
|
Planning to look at this today. |
Still looking. Would be easier if this reproduced in a small VM. |
Hah. I see the issue. Do not know how that bug went undetected for as long as it did. Fix incoming. |
@hppritcha Long story short. opal_interval_tree_t does not like it when multiple nodes have the same high and low values (which increases in likelihood with increased ppn). This is a design choice that is ok for the rcache usage but not the vader usage. I am working on a fix. There are additional minor bugs in vader but the opal_interval_tree_t one is what is triggering the crash at finalize. Should have the PRs up today or tomorrow. |
This commit fixes an issue discovered in the XPMEM registration cache. It was possible for a registration to be invalidated by multiple threads leading to a double-free situation or re-use of an invalidated registration. This commit fixes the issue by setting the INVALID flag on a registation when it will be deleted. The flag is set while iterating over the tree to take advantage of the fact that a registration can not be removed from the VMA tree by a thread while another thread is traversing the VMA tree. References open-mpi#6524 References open-mpi#7030 Closes open-mpi#6534 Signed-off-by: Nathan Hjelm <hjelmn@google.com>
This commit fixes an issue discovered in the XPMEM registration cache. It was possible for a registration to be invalidated by multiple threads leading to a double-free situation or re-use of an invalidated registration. This commit fixes the issue by setting the INVALID flag on a registation when it will be deleted. The flag is set while iterating over the tree to take advantage of the fact that a registration can not be removed from the VMA tree by a thread while another thread is traversing the VMA tree. References open-mpi#6524 References open-mpi#7030 Closes open-mpi#6534 Signed-off-by: Nathan Hjelm <hjelmn@google.com> (cherry picked from commit f86f805)
This commit fixes an issue discovered in the XPMEM registration cache. It was possible for a registration to be invalidated by multiple threads leading to a double-free situation or re-use of an invalidated registration. This commit fixes the issue by setting the INVALID flag on a registation when it will be deleted. The flag is set while iterating over the tree to take advantage of the fact that a registration can not be removed from the VMA tree by a thread while another thread is traversing the VMA tree. References open-mpi#6524 References open-mpi#7030 Closes open-mpi#6534 Signed-off-by: Nathan Hjelm <hjelmn@google.com> (cherry picked from commit f86f805)
No description provided.