Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Aug 3, 2015

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@nysal
Copy link
Member

nysal commented Aug 4, 2015

This looks good to me. I verified that the opal_fifo test now passes on ppc. Although unrelated to this patch, opal_lifo seems to hit a segmentation fault. Does the opal_lifo test pass for you?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 4, 2015

I couldn't reproduce the lifo crash in the virtual machine. I will take another look today to see if I can identify the issue.

hjelmn added a commit that referenced this pull request Aug 4, 2015
opal/fifo: add missing memory barrier
@hjelmn hjelmn merged commit 45a8e8d into open-mpi:master Aug 4, 2015
@nysal
Copy link
Member

nysal commented Aug 4, 2015

Here's the stack:

Core was generated by `./.libs/opal_lifo '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000010000fbc in opal_atomic_swap_32 (addr=0x20, newval=1) at ../../opal/include/opal/sys/atomic_impl.h:49
49              old = *addr;
(gdb) bt
#0  0x0000000010000fbc in opal_atomic_swap_32 (addr=0x20, newval=1) at ../../opal/include/opal/sys/atomic_impl.h:49
#1  0x0000000010001340 in opal_lifo_pop_atomic (lifo=0x3fffca1d63d8) at ../../opal/class/opal_lifo.h:193
#2  0x0000000010001530 in thread_test (arg=0x3fffca1d63d8) at opal_lifo.c:50
#3  0x00001000001b89d8 in start_thread (arg=0x10000140f1a0) at pthread_create.c:314
#4  0x00001000002fef00 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104
static inline opal_list_item_t *opal_lifo_pop_atomic (opal_lifo_t* lifo)
{
    opal_list_item_t *item;
    while ((item = (opal_list_item_t *) lifo->opal_lifo_head.data.item) != &lifo->opal_lifo_ghost) {
        opal_atomic_rmb();

        /* ensure it is safe to pop the head */
        if (opal_atomic_swap_32((volatile int32_t *) &item->item_free, 1)) {
            continue;
        }

Is the read barrier above for ordering the loads for "item" and "item->item_free"? If so, it might not be needed as dependent (data dependency) loads are always executed in order on most architectures (Probably Alpha is the exception). I tried adding a write barrier as shown below and that seems to help. I commented out the read barrier, but thats inconsequential to the segv issue:

--- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -187,13 +187,14 @@ static inline opal_list_item_t *opal_lifo_pop_atomic (opal_lifo_t* lifo)
 {
     opal_list_item_t *item;
     while ((item = (opal_list_item_t *) lifo->opal_lifo_head.data.item) != &lifo->opal_lifo_ghost) {
-        opal_atomic_rmb();
+        //opal_atomic_rmb();

         /* ensure it is safe to pop the head */
         if (opal_atomic_swap_32((volatile int32_t *) &item->item_free, 1)) {
             continue;
         }

+        opal_atomic_wmb ();
         /* try to swap out the head pointer */
         if (opal_atomic_cmpset_ptr (&lifo->opal_lifo_head.data.item, item,
                                    (void *) item->opal_list_next)) {

@hjelmn
Copy link
Member Author

hjelmn commented Aug 4, 2015

@nysal Beat you to it. See #771.

@nysal
Copy link
Member

nysal commented Aug 4, 2015

Thanks Nathan. The fix works fine here.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 4, 2015

@nysal Great! Thanks for testing. I will open a PR for 2.x.

jsquyres added a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants