-
Notifications
You must be signed in to change notification settings - Fork 936
Releasing the list items when list destructor is called #1004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
For cross reference: the prior PR was #944. |
|
@bosilca How do you feel about this one? |
|
bot:retest |
|
This is indeed what was discussed. What is not clear here is what's the status of the old ticket, and how these 2 will integrate once merged into the master? |
|
@rppendya closed the original ticket, and so this is the complete replacement. We'll then need to gradually replace all the LIST_RELEASE calls with simple OBJ_RELEASE, but we figured that could occur over time. Meanwhile, the places where we "must have" this functionality can immediately use it, and the rest can be converted as we have time. Make sense? |
|
I missed the fact that the old ticket has been closed. The approach makes perfect sense, thanks for clarifying. |
|
@jsquyres I saw an error in Mellanox yesterday. I want to debug it now and I am receiving page not found error when I click the details button. Do I need to re-run the tests? |
|
bot:retest |
|
@rppendya Yes, the Mellanox Jenkins results go stale after a while. As @miked-mellanox did, you can issue the bot retest command and have them re-run. |
|
I can't find what exactly is wrong. But here are my thoughts. Following is the Mellanox error. 08:33:41 + taskset -c 2,3 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem The error is occurring when the OBJ_RELEASE() is being called in each list item. opal_list_remove_first() returns opal_list_item_t. and OBJ_RELEASE is called on a list item I can't find how the hello_oshmem is being called from a list. Any help here would be appreciated |
|
I dug it out. The "hello_oshmem" is the application, not the object in question. Turns out this was caused by a poorly written code in the oshmem area. Adding the following to your patch fixes the problem: diff --git a/oshmem/mca/memheap/base/memheap_base_mkey.c b/oshmem/mca/memheap/base/memheap_base_mkey.c
index 3a6b544..d23764c 100644
--- a/oshmem/mca/memheap/base/memheap_base_mkey.c
+++ b/oshmem/mca/memheap/base/memheap_base_mkey.c
@@ -481,6 +481,7 @@ void memheap_oob_destruct(void)
PMPI_Request_free(&r->recv_req);
}
+ while (NULL != opal_list_remove_first(&memheap_oob.req_list)); // clear the list as these objects don't belong to us
OBJ_DESTRUCT(&memheap_oob.req_list);
OBJ_DESTRUCT(&memheap_oob.lck);
OBJ_DESTRUCT(&memheap_oob.cond); |
|
One suggestion to that patch: please add a while (NULL != opal_list_remove_first(&memheap_oob.req_list)) {
continue;
} |
…tems caused a bug in oshmem application. Fixing the bug with this patch
|
Looks like this just hung in the tests bot:retest |
Releasing the list items when list destructor is called
|
Thanks Ralph, Jeff and everyone else for patiently reviewing the pull request |
|
@rppendya Thank you for the fix! |
…neighbor_alltoallw_in_place coll/libnbc: do not handle MPI_IN_PLACE in neighborhood collectives
As per the previous review comments from kewl and Ralph, list release destructor API will be changed to release list items when list destructor is called