-
Notifications
You must be signed in to change notification settings - Fork 931
Add support for recursive locks (revisited) #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is the backtrace that hinted we may actually have a use for recursive locks. |
|
A better fix will be to not use the malloc hooks for internal allocations. This may take some time to implement so this fix is being submitted as a stop-gap until that is ready. |
|
@bosilca Thoughts? |
|
I should also note that the second patch in this PR does indeed fix the hang with the openib btl and thread multiple. |
There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
We now have a unique, but valid, use case for recursive mutexes. This condones their addition to OMPI code base. However, I wonder if the current lock/unlock in the mpool is as fine grain as we would expect. Why is the mpool using the rcache lock to protect it's own internal initialization? |
|
@bosilca The lock protects the rcache. One function ( |
|
It might be more correct if you put an #if OPAL_ENABLE_DEBUG around the three debug variables, m_lock_debug, m_lock_file, m_lock_line. |
|
@rolfv Ah, yes. Will fix that. |
|
@hjelmn that mutex protects more than just the rcache, as many of the mpool operations are also hidden behind. If there was a decision to share the rcache mutex then this was begging for the current issue to happen, as the mpool is indeed capable to recurse in both single-threaded and multi-threaded cases (due to the fact the allocating memory might result in a call to releasing to the OS some of the old mmaped memory arenas). Your patch (adding recursive mutexes) is correct as long as we assume that the libc unmap is called in the same thread as where the malloc was called. |
895a5c6 to
5999fdf
Compare
|
@rolfv fixed the optimized build |
|
@bosilca I will merge this fix and PR to v2.x. When I have a chance I will evaluate the mpool/rcache locking and see what can be done to remove the cause of the recursive locking. |
This new class is the same as the opal_mutex_t class but has a different constructor. This constructor adds the recursive flag to the mutex attributes for the lock. This class can be used where there may be re-enty into the lock from within the same thread. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There is currently a path through the grdma mpool and vma rcache that leads to deadlock. It happens during the rcache insert. Before the insert the rcache mutex is locked. During the call a new vma item is allocated and then inserted into the rcache tree. The allocation currently goes through the malloc hooks which may (and does) call back into the mpool if the ptmalloc heap needs to be reallocated. This callback tries to lock the rcache mutex which leads to the deadlock. This has been observed with multi-threaded tests and the openib btl. This change may lead to some minor slowdown in the rcache vma when threading is enabled. This will only affect larger message paths in some of the btls. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
5999fdf to
1d56007
Compare
|
@hjelmn the only viable solution is to avoid allocating any memory in the code protected by the mpool (in fact rcache) lock. By looking at the code it looks difficult to not to do so if we leave the rcache_new (and therefore the OBJ_NEW) deep inside the rcache_insert. However, if we split the rcahce usage in 2: allocation and insertion, then in the multi-threaded case we would need 1) special protections inside the rcache for the rcache objects manipulation and 2) special protection to avoid inserting the same element twice (with a sequence of find + new + insert from multiple threads). |
|
@bosilca Thanks, I will take a look at implementing a solution next month. |
Add support for recursive locks (revisited)
There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. Fixed in open-mpi/ompi@c101385. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi/ompi#827 master commit open-mpi/ompi@64e4419 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. Fixed in open-mpi/ompi@c101385. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi/ompi#827 master commit open-mpi/ompi@64e4419 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
…elease PML UCX: fix typo (following 7becc54).
No description provided.