Skip to content

Conversation

@rppendya
Copy link
Contributor

@rppendya rppendya commented Oct 9, 2015

As per the previous review comments from kewl and Ralph, list release destructor API will be changed to release list items when list destructor is called

@jsquyres
Copy link
Member

jsquyres commented Oct 9, 2015

For cross reference: the prior PR was #944.

@jsquyres
Copy link
Member

@bosilca How do you feel about this one?

@jsquyres
Copy link
Member

bot:retest

@bosilca
Copy link
Member

bosilca commented Oct 13, 2015

This is indeed what was discussed. What is not clear here is what's the status of the old ticket, and how these 2 will integrate once merged into the master?

@rhc54
Copy link
Contributor

rhc54 commented Oct 13, 2015

@rppendya closed the original ticket, and so this is the complete replacement. We'll then need to gradually replace all the LIST_RELEASE calls with simple OBJ_RELEASE, but we figured that could occur over time. Meanwhile, the places where we "must have" this functionality can immediately use it, and the rest can be converted as we have time.

Make sense?

@bosilca
Copy link
Member

bosilca commented Oct 13, 2015

I missed the fact that the old ticket has been closed. The approach makes perfect sense, thanks for clarifying.

@jsquyres
Copy link
Member

@rppendya @rhc54 There was a failure in the Mellanox Jenkins, and it looks like it could be a real failure (i.e., due to this PR). Can you check into it?

@rppendya
Copy link
Contributor Author

@jsquyres I saw an error in Mellanox yesterday. I want to debug it now and I am receiving page not found error when I click the details button. Do I need to re-run the tests?

@mike-dubman
Copy link
Member

bot:retest

@jsquyres
Copy link
Member

@rppendya Yes, the Mellanox Jenkins results go stale after a while. As @miked-mellanox did, you can issue the bot retest command and have them re-run.

@rppendya
Copy link
Contributor Author

I can't find what exactly is wrong. But here are my thoughts.

Following is the Mellanox error.

08:33:41 + taskset -c 2,3 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem
08:33:43 Hello, world, I am 4 of 8
08:33:43 Hello, world, I am 6 of 8
08:33:43 Hello, world, I am 1 of 8
08:33:43 Hello, world, I am 3 of 8
08:33:43 Hello, world, I am 5 of 8
08:33:43 Hello, world, I am 7 of 8
08:33:43 Hello, world, I am 0 of 8
08:33:43 Hello, world, I am 2 of 8
08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion ((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed. 08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed.
08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion ((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed. 08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed.
08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion ((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed. 08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed.
08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion ((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed. 08:33:44 hello_oshmem: class/opal_list.c:108: opal_list_destruct: Assertion((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (it))->obj_magic_id' failed.

The error is occurring when the OBJ_RELEASE() is being called in each list item. opal_list_remove_first() returns opal_list_item_t. and OBJ_RELEASE is called on a list item
I did a little digging and found that magic id is a unique number assigned to every object during construct. It is set to zero when obj_release or obj_destruct is being called. So, I believe the above error happened because somewhere else a release is being called on these objects and these objects aren't removed from the list

I can't find how the hello_oshmem is being called from a list. Any help here would be appreciated

@rhc54
Copy link
Contributor

rhc54 commented Oct 20, 2015

I dug it out. The "hello_oshmem" is the application, not the object in question. Turns out this was caused by a poorly written code in the oshmem area. Adding the following to your patch fixes the problem:

diff --git a/oshmem/mca/memheap/base/memheap_base_mkey.c b/oshmem/mca/memheap/base/memheap_base_mkey.c
index 3a6b544..d23764c 100644
--- a/oshmem/mca/memheap/base/memheap_base_mkey.c
+++ b/oshmem/mca/memheap/base/memheap_base_mkey.c
@@ -481,6 +481,7 @@ void memheap_oob_destruct(void)
         PMPI_Request_free(&r->recv_req);
     }

+    while (NULL != opal_list_remove_first(&memheap_oob.req_list));  // clear the list as these objects don't belong to us
     OBJ_DESTRUCT(&memheap_oob.req_list);
     OBJ_DESTRUCT(&memheap_oob.lck);
     OBJ_DESTRUCT(&memheap_oob.cond);

@jsquyres
Copy link
Member

One suggestion to that patch: please add a { continue; } to that block to fit our defensive programming style. I.e., something like:

while (NULL != opal_list_remove_first(&memheap_oob.req_list)) {
    continue;
}

…tems

  caused a bug in oshmem application. Fixing the bug with this patch
@rhc54
Copy link
Contributor

rhc54 commented Oct 21, 2015

Looks like this just hung in the tests

bot:retest

rhc54 pushed a commit that referenced this pull request Oct 21, 2015
  Releasing the list items when list destructor is called
@rhc54 rhc54 merged commit 0bc5137 into open-mpi:master Oct 21, 2015
@rppendya
Copy link
Contributor Author

Thanks Ralph, Jeff and everyone else for patiently reviewing the pull request

@rhc54
Copy link
Contributor

rhc54 commented Oct 21, 2015

@rppendya Thank you for the fix!

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
…neighbor_alltoallw_in_place

coll/libnbc: do not handle MPI_IN_PLACE in neighborhood collectives
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants