Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jul 14, 2015

No description provided.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 14, 2015

These issues were first identified by @PHHargrove. Someone who knows PPC/Power should double-check before this goes into master. @jsquyres who should I recruit to do the review?

@PHHargrove
Copy link
Member

Nathan,

I see #if OPAL_HAVE_ATOMIC_LLSC_64 protecting calls to opal_atomic_{ll,sc}_ptr().
That seems to be unnecessarily excluding the possibility of using this on PPC32.

I think you should #define OPAL_HAVE_ATOMIC_LLSC_PTR 1 in the same place you define the 32- and 64-bit versions of opal_atomic_{ll,sc}_ptr() macros, and then use this new name to protect their use.

It doesn't look like the VM I gave you access to has multilib support (for "-m32").
However, I am going to give you a login on another VM that does.

-Paul

@hjelmn hjelmn force-pushed the ppc_fixes branch 2 times, most recently from c628f13 to ca518a0 Compare July 14, 2015 22:11
@hjelmn
Copy link
Member Author

hjelmn commented Jul 14, 2015

@PHHargrove Ok, should be fixed now. Tested with the second virtual machine and -m32.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 14, 2015

To give others an idea of the difference in speed:

LL/SC implementation of opal_lifo:

openmpi-dev-2057-g8be2c97/test/class$ ./opal_lifo 
Single thread test. Time: 0 s 21287 us 21 nsec/poppush
Atomics thread finished. Time: 0 s 86097 us 86 nsec/poppush
Atomics thread finished. Time: 1 s 763620 us 1763 nsec/poppush
Atomics thread finished. Time: 1 s 788929 us 1788 nsec/poppush
Atomics thread finished. Time: 1 s 809007 us 1809 nsec/poppush
Atomics thread finished. Time: 1 s 819877 us 1819 nsec/poppush
Atomics thread finished. Time: 1 s 866348 us 1866 nsec/poppush
Atomics thread finished. Time: 1 s 869468 us 1869 nsec/poppush
Atomics thread finished. Time: 1 s 844646 us 1844 nsec/poppush
Atomics thread finished. Time: 1 s 872691 us 1872 nsec/poppush
All threads finished. Thread count: 8 Time: 1 s 880358 us 235 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

CAS implementation:

openmpi-dev-2057-g8be2c97/test/class$ ./opal_lifo 
Single thread test. Time: 0 s 21217 us 21 nsec/poppush
Atomics thread finished. Time: 0 s 121659 us 121 nsec/poppush
Atomics thread finished. Time: 4 s 986224 us 4986 nsec/poppush
Atomics thread finished. Time: 5 s 248917 us 5248 nsec/poppush
Atomics thread finished. Time: 5 s 313268 us 5313 nsec/poppush
Atomics thread finished. Time: 5 s 325973 us 5325 nsec/poppush
Atomics thread finished. Time: 5 s 367270 us 5367 nsec/poppush
Atomics thread finished. Time: 5 s 493431 us 5493 nsec/poppush
Atomics thread finished. Time: 5 s 511199 us 5511 nsec/poppush
Atomics thread finished. Time: 5 s 511118 us 5511 nsec/poppush
All threads finished. Thread count: 8 Time: 5 s 520995 us 690 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

@PHHargrove
Copy link
Member

Nathan,

I tried to review the code, but my expertise falls short of being able to do so completely. However, I did find some things with the LL/SC based code that I believe might be incorrect. Hopefully somebody (from IBM?) can complete a more "accurate" review.

  1. There is no RMB in the LL, nor any WMB in the SC. So, I believe that in opal_fifo_pop_atomic() you might need an RMB between lines 222 (where you set item) and line 233 where you deference it. Think of it as the "mate" to the WMB in opal_fifo_push_atomic().
  2. Same issue as (1) exists in opal_lifo_pop_atomic() - you might need an RMB between reading the item pointer and dereferencing it.
  3. A more general concern is the use of C instead of inline ASM for the lifo/fifo code between the LL and the SC. If I understand correctly, one either cannot safely issue any store instructions between the LL and SC, or must ensure they don't fall in the same "granule" as the location operated on by LL/SC. Since "granule" is implementation specific, it might be cache-line or page. So, if the compiler were to spill any variables between the opal_atomic_ll_ptr() and the opal_atomic_sc_ptr() calls, then the reservation established by the LL would be cleared, and I think the SC would fail indefinitely!

All three comments are "maybes". You definitely need a POWER/PPC expert to look more closely at the three issues.

From a pragmatic point-of-view, I suggest that you separate this into 2 PRs: one for the new LL/SC based code and one for the one-line fix to add the missing WMB in opal_fifo_push_atomic(). The later should be uncontentious (and fixes a known bug).

-Paul

@bosilca
Copy link
Member

bosilca commented Jul 16, 2015

Paul I could find any reference that suggest that the reservation granularity is anything larger than a cache line. Can you provide a link please?

@PHHargrove
Copy link
Member

@bosilca ,

First, I admit that the idea that it could be as large as a page is my own worst-case interpretation.
However, lacking any documentation that says "cache line", it is something I think deserves some caution.
If you have a link that better defines the granularity, I'd like to know.

I did a bit of poking around, which is described later starting at "FWIW some research...", and found that for a sampling of PPC platforms, the reservation granularity and d-cache line size match.

Even if it is only cache-line size, it is not something that is necessarily known with 100% certainty at compile time. However, examining the Linux kernel source, it appears never to be larger than 128 and is fixed at that value for all PPC64 platforms but could be 128 or 32 for PPC32, and 16 or 64 for some embedded cpus. So, I guess one should be fine as long as structures are padded/guarded to ensure any writes between LL/SC pairs are no less than 128-bytes away from the word accessed via LL/SC.

As for a link to my source of information:

I am looking at PowerPC® Microprocessor Family: The Programming Environments Manual for
64-bit Microprocessors. Version 3.0

See section 4.2.6 and specifically the "Note" about half way down page 173:

Note: The reservation granularity is implementation-dependent.

I searched that doc for every instance of "granul" (to catch "granule" and "granularity") and found nothing that was more concrete than "implementation-dependent".

See also "Appendix D. Synchronization Programming Examples" (starting on page 619).
There is discussion near the bottom of p619 and again in D.5 (a linked-list insertion example) of how livelock could occur if two next pointers are in the same granule. It was that example that I was reminded of when looking at Nathan's code.

FWIW some research seems to show that reservation granule and cache-line size match:

I can confirm that the granule is a d-cache line size on two POWER7 systems:

[phargrov@gcc1-power7 ~]$ od -tu4 /proc/device-tree/cpus/PowerPC,POWER7@0/{d-cache-line-size,reservation-granule-size}
0000000 128 128
0000010

{hargrove@vestalac2 ~}$ od -tu4 /proc/device-tree/cpus/PowerPC,POWER7@0/{d-cache-line-size,reservation-granule-size}
0000000 128 128
0000010

A little-endian Power8 also gives a granule size and d-cache block size of 128.

On an old G4 laptop I get no useful info (only 3 bytes!):
phargrov@pcp-k-421:~$ od -c /proc/device-tree/cpus/PowerPC,G4@0/reservation-granule-size
0000000 \0 \0 \0
0000004

The following are from google searches for "device-tree":

Here is a portion of a device-tree found on the web showing 32-byte granularity which matches the reported d-cache-block-size

0 > dev /PowerPC,604<return>  ok
0 > .properties<return>
name                    PowerPC,604
device_type             cpu
reg                     00000000  00000000
cpu-version             00090202
clock-frequency         0BEBC200
timebase-frequency      00BEBC20
reservation-granularity 00000020
[...]
d-cache-block-size      00000020

A G5 reports the (expected for ppc64) value of 128 for both d-cache-block-size and reservation-granule-size.

A G3 and G4 both report 32 for both values, which is what the Linux kernel source leads us to expect for L1 cache line length on most PPC32 systems

@hjelmn
Copy link
Member Author

hjelmn commented Jul 16, 2015

Both the fifo and lifo implementation intentionally have only a small number of loads/stores between the LL and the SC. The memory locations being read/written are on the stack and in the opal_list_item being pushed/popped. It is possible that we could get really unlucky and have either the stack addresses or the list item mapped to the same cache line. I do not know if this is enough to cause the reservation to be lost on the fifo/lifo head pointer. Certainly implementing the entire head update in assembly would eliminate this possibility but I would like to hear from a power/ppc expert.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 16, 2015

@PHHargrove I will move the missing memory barrier to another pr so that gets in asap.

@goodell
Copy link
Member

goodell commented Jul 17, 2015

Hey Nathan, I haven't had time to review this PR yet, but I definitely echo Paul's concerns here. I'm about to head out for a short vacation, but I can probably review it on Wednesday if you'd like another set of eyeballs on it.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 19, 2015

@goodell Yes, please take a look. I plan to do a little more research on the livelock issue. I am not convinced it will happen but I am willing to rewrite the code to mitigate the potential issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty unconventional to have a barrier in the middle of a lwarx/stwcx. pair. The "third party" memory store to item->opal_list_next (which might clear the "reservation") and the wmb (which could also potentially mess with the reservation) are disconcerting.

Do you have a trustworthy source example of this style of unconventional lwarx/stwcx. usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barrier is meant to ensure the ordering between the store instruction for writing item->opal_list_next and the store conditional. I think there is an example of something like this in the PPC spec Paul pointed out.

I don't think the extra store will clear the reservation. Though I could reorder the loop to do a normal read/store to update opal_list_next. Then do a ldarx/stdcx. pair with a conditional between them that ensures the current head is the same as what was stored in the opal_list_next pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I can't reorder it that way. Would become essentially CAS and then would have the ABA problem.

Thinking about this a little more. There is no ABA problem with the reordering since this is the lifo. ABA problems come up in pop.

@jsquyres
Copy link
Member

Adding @gpaulsen and @dsolt from IBM. POWER is your architecture, guys -- can you have a look? Seems like there's quite a bit of debate about this.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 24, 2015

@jsquyres Thanks. Given my understanding of how the reservations work on PPC/Power I think the code is ok but we really need experts to take a look.

If I can get this in (with modifications if necessary) I think it will be a good performance boost for these platforms with MPI_THREAD_MULTIPLE.

@gpaulsen
Copy link
Member

I'll ask Nysal to take a look

@gpaulsen
Copy link
Member

Nysal's taking a look. Will take a few days at least for him to review and comment.

@nysal
Copy link
Member

nysal commented Aug 3, 2015

I've only looked at the LIFO so far. Here is the disassembly for opal_lifo_push_atomic:

    100017c0:   a8 48 00 7d     ldarx   r8,0,r9
    100017c4:   10 00 0a f9     std     r8,16(r10)
    100017c8:   ac 04 20 7c     lwsync
    100017cc:   ad 49 40 7d     stdcx.  r10,0,r9
    100017d0:   00 00 00 39     li      r8,0
    100017d4:   08 00 c2 40     bne-    100017dc <thread_test+0x15c>
    100017d8:   01 00 08 61     ori     r8,r8,1
    100017dc:   00 00 88 2f     cmpwi   cr7,r8,0
    100017e0:   e0 ff 9e 41     beq     cr7,100017c0 <thread_test+0x140>

This might lead to a livelock as Paul already mentioned. How about changing the code a little, to move the store before the LL/SC pair, as mentioned in the PowerPC Book 2 (Appendix B.3). However, as you noted in one of the comments above, this will make the code similar to a compare and swap.

As a next step I'll gather some performance data on a Power8 node. The performance observed on the VM might not be really representative of the coherency overheads on a real system. I'll probably have to modify the test a little to set the thread affinity, but that shouldn't be too hard.

If there is a simple fix for master branch, I'd prefer that is applied first. That gives us more time to evaluate this change.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 3, 2015

@nysal There is a fix that will go into master. The performance improvement was something I have been thinking about for some time. Finally had a chance to implement it.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 3, 2015

See #770 for the fifo hang fix on its own. A memory barrier was missing.

@nysal
Copy link
Member

nysal commented Aug 7, 2015

I modified the opal_lifo test case to set the thread affinity:

--- a/test/class/opal_lifo.c
+++ b/test/class/opal_lifo.c
@@ -20,12 +20,15 @@
 #include <stdlib.h>
 #include <stdio.h>
 #include <stddef.h>
+#include <string.h>

 #include <sys/time.h>
+#include <sched.h>

 #define OPAL_LIFO_TEST_THREAD_COUNT 8
 #define ITERATIONS 1000000
 #define ITEM_COUNT 100
+#define NR_CPUS 256

 #if !defined(timersub)
 #define timersub(a, b, r) \
@@ -39,11 +42,38 @@
     } while (0)
 #endif

+static unsigned int cpu_affinities[NR_CPUS];
+static unsigned int next_aff = 0;
+static int use_affinity;
+pthread_mutex_t aff_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static void set_affinity(void)
+{
+    cpu_set_t cpu_mask;
+    int cpu, ret;
+
+    if (!use_affinity)
+        return;
+
+    ret = pthread_mutex_lock(&aff_mutex);
+    cpu = cpu_affinities[next_aff++];
+    ret = pthread_mutex_unlock(&aff_mutex);
+    printf("Setting thread affinity to %d\n",cpu);
+    CPU_ZERO(&cpu_mask);
+    CPU_SET(cpu, &cpu_mask);
+    ret = sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask);
+    if (ret) {
+        perror("Error in sched_setaffinity");
+        exit(-1);
+    }
+}
+
 static void *thread_test (void *arg) {
     opal_lifo_t *lifo = (opal_lifo_t *) arg;
     opal_list_item_t *item;
     struct timeval start, stop, total;
     double timing;
+    set_affinity();

     gettimeofday (&start, NULL);
     for (int i = 0 ; i < ITERATIONS ; ++i) {
@@ -82,7 +112,8 @@ int main (int argc, char *argv[]) {
     opal_lifo_t lifo;
     bool success;
     double timing;
-    int rc;
+    int rc, ctr = 0;
+    char *aff, *tmp;

     rc = opal_init_util (&argc, &argv);
     test_verify_int(OPAL_SUCCESS, rc);
@@ -151,7 +182,18 @@ int main (int argc, char *argv[]) {
     } else {
         test_failure (" lifo push/pop single-threaded with atomics");
     }
-
+    // Check if affinity was supplied
+    if(argc == 2) {
+        aff = strdup(argv[1]);
+        tmp = strtok(aff, ",");
+        while (tmp != NULL)
+        {
+          cpu_affinities[ctr++] = atoi(tmp);
+          tmp = strtok(NULL, ",");
+        }
+        free(aff);
+        use_affinity = 1;
+    }
     gettimeofday (&start, NULL);
     for (int i = 0 ; i < OPAL_LIFO_TEST_THREAD_COUNT ; ++i) {
         pthread_create (threads + i, NULL, thread_test, &lifo);

I ran the opal_lifo test on a Power 8 system with 20 cores .SMT8 is enabled so there is a total of 160 h/w threads supported. Here is the performance data for the current master branch (includes the recent memory barrier fixes):

jnysal@c656f6n03:~/src/ompi/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11836 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 69695 us 69 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 6 s 740236 us 6740 nsec/poppush
Atomics thread finished. Time: 6 s 794044 us 6794 nsec/poppush
Atomics thread finished. Time: 6 s 822803 us 6822 nsec/poppush
Atomics thread finished. Time: 7 s 6026 us 7006 nsec/poppush
Atomics thread finished. Time: 7 s 8867 us 7008 nsec/poppush
Atomics thread finished. Time: 7 s 29986 us 7029 nsec/poppush
Atomics thread finished. Time: 7 s 30949 us 7030 nsec/poppush
Atomics thread finished. Time: 7 s 31042 us 7031 nsec/poppush
All threads finished. Thread count: 8 Time: 7 s 31284 us 878 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

With your performance improvements:

jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11836 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 46873 us 46 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 3 s 707453 us 3707 nsec/poppush
Atomics thread finished. Time: 3 s 843065 us 3843 nsec/poppush
Atomics thread finished. Time: 3 s 867249 us 3867 nsec/poppush
Atomics thread finished. Time: 3 s 972549 us 3972 nsec/poppush
Atomics thread finished. Time: 3 s 975923 us 3975 nsec/poppush
Atomics thread finished. Time: 3 s 979879 us 3979 nsec/poppush
Atomics thread finished. Time: 3 s 981764 us 3981 nsec/poppush
Atomics thread finished. Time: 3 s 982120 us 3982 nsec/poppush
All threads finished. Thread count: 8 Time: 3 s 982356 us 497 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

With the following patch on top of your improvements:

--- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -29,6 +29,7 @@

 #include "opal/sys/atomic.h"
 #include "opal/threads/mutex.h"
+#include <poll.h>

 BEGIN_C_DECLS
 @@ -125,8 +132,13 @@ static inline opal_list_item_t *opal_lifo_push_atomic (opal_lifo_t *lifo,
 static inline opal_list_item_t *opal_lifo_pop_atomic (opal_lifo_t* lifo)
 {
     opal_list_item_t *item, *next;
-
+    int attempt = 0;
     do {
+        if(++attempt > 5)
+        {
+          (void) poll(NULL, 0, 1);  /* Wait for 1ms */
+          attempt = 0;
+        }
         item = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
         if (&lifo->opal_lifo_ghost == item) {
             return NULL;

I see the following:

jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56                           
Single thread test. Time: 0 s 11831 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45537 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 225177 us 225 nsec/poppush
Atomics thread finished. Time: 0 s 258867 us 258 nsec/poppush
Atomics thread finished. Time: 0 s 334699 us 334 nsec/poppush
Atomics thread finished. Time: 0 s 337626 us 337 nsec/poppush
Atomics thread finished. Time: 0 s 359259 us 359 nsec/poppush
Atomics thread finished. Time: 0 s 383136 us 383 nsec/poppush
Atomics thread finished. Time: 0 s 410698 us 410 nsec/poppush
Atomics thread finished. Time: 0 s 424042 us 424 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 424412 us 53 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

Thoughts?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 7, 2015

I saw similar performance improvements for the test when using nanosleep. I was using it for every time a conflict was detected instead of every 5th time. In this case either a poll(0) or nanosleep() has the benefit of changing the timing of the conflicting threads thereby breaking the livelock.

Can you run the test again with the following in place of poll:

static struct timespec delay = {.tv_sec = 0, .tv_nsec = 10};
nanosleep (&delay, NULL);

and post the performance. I am curious to see if there is any difference.

@nysal
Copy link
Member

nysal commented Aug 7, 2015

nanosleep(10 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11842 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45544 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 24
Setting thread affinity to 16
Setting thread affinity to 8
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 754345 us 754 nsec/poppush
Atomics thread finished. Time: 0 s 773030 us 773 nsec/poppush
Atomics thread finished. Time: 0 s 774151 us 774 nsec/poppush
Atomics thread finished. Time: 0 s 849844 us 849 nsec/poppush
Atomics thread finished. Time: 0 s 853507 us 853 nsec/poppush
Atomics thread finished. Time: 0 s 854328 us 854 nsec/poppush
Atomics thread finished. Time: 0 s 855288 us 855 nsec/poppush
Atomics thread finished. Time: 0 s 856854 us 856 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 857022 us 107 nsec/poppush

nanosleep(100 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11834 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45621 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 772702 us 772 nsec/poppush
Atomics thread finished. Time: 0 s 780086 us 780 nsec/poppush
Atomics thread finished. Time: 0 s 805862 us 805 nsec/poppush
Atomics thread finished. Time: 0 s 858595 us 858 nsec/poppush
Atomics thread finished. Time: 0 s 867625 us 867 nsec/poppush
Atomics thread finished. Time: 0 s 869090 us 869 nsec/poppush
Atomics thread finished. Time: 0 s 868960 us 868 nsec/poppush
Atomics thread finished. Time: 0 s 870466 us 870 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 871151 us 108 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

nanosleep(1000 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11840 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45551 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 767050 us 767 nsec/poppush
Atomics thread finished. Time: 0 s 777560 us 777 nsec/poppush
Atomics thread finished. Time: 0 s 782619 us 782 nsec/poppush
Atomics thread finished. Time: 0 s 849595 us 849 nsec/poppush
Atomics thread finished. Time: 0 s 852778 us 852 nsec/poppush
Atomics thread finished. Time: 0 s 857903 us 857 nsec/poppush
Atomics thread finished. Time: 0 s 858255 us 858 nsec/poppush
Atomics thread finished. Time: 0 s 858200 us 858 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 858875 us 107 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

nanosleep(10000 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11063 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 42478 us 42 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 654218 us 654 nsec/poppush
Atomics thread finished. Time: 0 s 683836 us 683 nsec/poppush
Atomics thread finished. Time: 0 s 688924 us 688 nsec/poppush
Atomics thread finished. Time: 0 s 751148 us 751 nsec/poppush
Atomics thread finished. Time: 0 s 755851 us 755 nsec/poppush
Atomics thread finished. Time: 0 s 757428 us 757 nsec/poppush
Atomics thread finished. Time: 0 s 762109 us 762 nsec/poppush
Atomics thread finished. Time: 0 s 762410 us 762 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 763008 us 95 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

nanosleep(100000 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11838 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45554 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 409536 us 409 nsec/poppush
Atomics thread finished. Time: 0 s 423490 us 423 nsec/poppush
Atomics thread finished. Time: 0 s 435291 us 435 nsec/poppush
Atomics thread finished. Time: 0 s 502105 us 502 nsec/poppush
Atomics thread finished. Time: 0 s 512762 us 512 nsec/poppush
Atomics thread finished. Time: 0 s 513401 us 513 nsec/poppush
Atomics thread finished. Time: 0 s 523077 us 523 nsec/poppush
Atomics thread finished. Time: 0 s 524944 us 524 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 525426 us 65 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

nanosleep(1000000 ns):
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56
Single thread test. Time: 0 s 11833 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45529 us 45 nsec/poppush
Setting thread affinity to 0
Setting thread affinity to 8
Setting thread affinity to 16
Setting thread affinity to 24
Setting thread affinity to 32
Setting thread affinity to 40
Setting thread affinity to 48
Setting thread affinity to 56
Atomics thread finished. Time: 0 s 314969 us 314 nsec/poppush
Atomics thread finished. Time: 0 s 363788 us 363 nsec/poppush
Atomics thread finished. Time: 0 s 431147 us 431 nsec/poppush
Atomics thread finished. Time: 0 s 433865 us 433 nsec/poppush
Atomics thread finished. Time: 0 s 434452 us 434 nsec/poppush
Atomics thread finished. Time: 0 s 447049 us 447 nsec/poppush
Atomics thread finished. Time: 0 s 449957 us 449 nsec/poppush
Atomics thread finished. Time: 0 s 459706 us 459 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 459862 us 57 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

So the performance numbers with a nanosleep for 1 millisecond is pretty close to what we see with poll. The 10 nanosecond sleep too gives us a huge boost as compared to the original performance.

@hjelmn
Copy link
Member Author

hjelmn commented Aug 7, 2015

ok, thats about what I expected. It might be interesting to see how they compare with more/less contention.

I will incorporate your patch into the LL/SC implementations of the lifo and fifo. It should meet less resistance than my earlier nanosleep patch because it is actually addressing a potential bug whereas mine only improved the performance of a unit test.

@nysal
Copy link
Member

nysal commented Aug 10, 2015

With higher contention, the results are even better. There seems to be an improvement in all cases, even with 2 threads. I reduced the verbosity of the test case, but there is no functional change.

A = Current master branch
B = With Nathan's improvements
C = With Nathan's improvements + poll(1ms)

20 THREADS

A:
jnysal@c656f6n03:~/src/ompi/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11840 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 69316 us 69 nsec/poppush
Atomics thread finished. Time: 33 s 10418 us 33010 nsec/poppush
Atomics thread finished. Time: 34 s 430475 us 34430 nsec/poppush
Atomics thread finished. Time: 34 s 832087 us 34832 nsec/poppush
Atomics thread finished. Time: 34 s 932993 us 34932 nsec/poppush
Atomics thread finished. Time: 34 s 942407 us 34942 nsec/poppush
Atomics thread finished. Time: 34 s 971405 us 34971 nsec/poppush
Atomics thread finished. Time: 35 s 294624 us 35294 nsec/poppush
Atomics thread finished. Time: 35 s 314571 us 35314 nsec/poppush
Atomics thread finished. Time: 35 s 328006 us 35328 nsec/poppush
Atomics thread finished. Time: 35 s 580015 us 35580 nsec/poppush
Atomics thread finished. Time: 35 s 639544 us 35639 nsec/poppush
Atomics thread finished. Time: 35 s 722749 us 35722 nsec/poppush
Atomics thread finished. Time: 35 s 741566 us 35741 nsec/poppush
Atomics thread finished. Time: 35 s 745827 us 35745 nsec/poppush
Atomics thread finished. Time: 35 s 752639 us 35752 nsec/poppush
Atomics thread finished. Time: 35 s 810157 us 35810 nsec/poppush
Atomics thread finished. Time: 35 s 811401 us 35811 nsec/poppush
Atomics thread finished. Time: 35 s 818095 us 35818 nsec/poppush
Atomics thread finished. Time: 35 s 817911 us 35817 nsec/poppush
Atomics thread finished. Time: 35 s 818737 us 35818 nsec/poppush
All threads finished. Thread count: 20 Time: 35 s 819096 us 1790 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

B:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,
128,136,144,152
Single thread test. Time: 0 s 11829 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 46623 us 46 nsec/poppush
Atomics thread finished. Time: 19 s 842993 us 19842 nsec/poppush
Atomics thread finished. Time: 19 s 907652 us 19907 nsec/poppush
Atomics thread finished. Time: 20 s 3194 us 20003 nsec/poppush
Atomics thread finished. Time: 20 s 31412 us 20031 nsec/poppush
Atomics thread finished. Time: 20 s 875488 us 20875 nsec/poppush
Atomics thread finished. Time: 20 s 918861 us 20918 nsec/poppush
Atomics thread finished. Time: 20 s 955191 us 20955 nsec/poppush
Atomics thread finished. Time: 20 s 975688 us 20975 nsec/poppush
Atomics thread finished. Time: 21 s 1498 us 21001 nsec/poppush
Atomics thread finished. Time: 21 s 28874 us 21028 nsec/poppush
Atomics thread finished. Time: 21 s 51209 us 21051 nsec/poppush
Atomics thread finished. Time: 21 s 68225 us 21068 nsec/poppush
Atomics thread finished. Time: 21 s 68264 us 21068 nsec/poppush
Atomics thread finished. Time: 21 s 192209 us 21192 nsec/poppush
Atomics thread finished. Time: 21 s 224935 us 21224 nsec/poppush
Atomics thread finished. Time: 21 s 243998 us 21243 nsec/poppush
Atomics thread finished. Time: 21 s 246623 us 21246 nsec/poppush
Atomics thread finished. Time: 21 s 249622 us 21249 nsec/poppush
Atomics thread finished. Time: 21 s 250788 us 21250 nsec/poppush
Atomics thread finished. Time: 21 s 251126 us 21251 nsec/poppush
All threads finished. Thread count: 20 Time: 21 s 251801 us 1062 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

C:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,
128,136,144,152
Single thread test. Time: 0 s 11847 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45576 us 45 nsec/poppush
Atomics thread finished. Time: 0 s 946065 us 946 nsec/poppush
Atomics thread finished. Time: 1 s 8005 us 1008 nsec/poppush
Atomics thread finished. Time: 1 s 39556 us 1039 nsec/poppush
Atomics thread finished. Time: 1 s 65126 us 1065 nsec/poppush
Atomics thread finished. Time: 1 s 107994 us 1107 nsec/poppush
Atomics thread finished. Time: 1 s 155843 us 1155 nsec/poppush
Atomics thread finished. Time: 1 s 192223 us 1192 nsec/poppush
Atomics thread finished. Time: 1 s 200349 us 1200 nsec/poppush
Atomics thread finished. Time: 1 s 203583 us 1203 nsec/poppush
Atomics thread finished. Time: 1 s 216128 us 1216 nsec/poppush
Atomics thread finished. Time: 1 s 216257 us 1216 nsec/poppush
Atomics thread finished. Time: 1 s 237534 us 1237 nsec/poppush
Atomics thread finished. Time: 1 s 246491 us 1246 nsec/poppush
Atomics thread finished. Time: 1 s 252498 us 1252 nsec/poppush
Atomics thread finished. Time: 1 s 254472 us 1254 nsec/poppush
Atomics thread finished. Time: 1 s 276449 us 1276 nsec/poppush
Atomics thread finished. Time: 1 s 411666 us 1411 nsec/poppush
Atomics thread finished. Time: 1 s 414204 us 1414 nsec/poppush
Atomics thread finished. Time: 1 s 442641 us 1442 nsec/poppush
Atomics thread finished. Time: 1 s 448772 us 1448 nsec/poppush
All threads finished. Thread count: 20 Time: 1 s 449707 us 72 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

4 THREADS

A:
jnysal@c656f6n03:~/src/ompi/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11837 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 69251 us 69 nsec/poppush
Atomics thread finished. Time: 1 s 819509 us 1819 nsec/poppush
Atomics thread finished. Time: 1 s 840988 us 1840 nsec/poppush
Atomics thread finished. Time: 1 s 842625 us 1842 nsec/poppush
Atomics thread finished. Time: 1 s 848337 us 1848 nsec/poppush
All threads finished. Thread count: 4 Time: 1 s 848652 us 462 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

B:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11847 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 46807 us 46 nsec/poppush
Atomics thread finished. Time: 0 s 961391 us 961 nsec/poppush
Atomics thread finished. Time: 0 s 969698 us 969 nsec/poppush
Atomics thread finished. Time: 0 s 983320 us 983 nsec/poppush
Atomics thread finished. Time: 0 s 983566 us 983 nsec/poppush
All threads finished. Thread count: 4 Time: 0 s 983872 us 245 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

C:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11831 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45554 us 45 nsec/poppush
Atomics thread finished. Time: 0 s 108960 us 108 nsec/poppush
Atomics thread finished. Time: 0 s 174198 us 174 nsec/poppush
Atomics thread finished. Time: 0 s 177497 us 177 nsec/poppush
Atomics thread finished. Time: 0 s 184874 us 184 nsec/poppush
All threads finished. Thread count: 4 Time: 0 s 185294 us 46 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

2 THREADS

A:
jnysal@c656f6n03:~/src/ompi/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11836 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 69203 us 69 nsec/poppush
Atomics thread finished. Time: 0 s 503064 us 503 nsec/poppush
Atomics thread finished. Time: 0 s 512194 us 512 nsec/poppush
All threads finished. Thread count: 2 Time: 0 s 512285 us 256 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

B:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11840 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 46496 us 46 nsec/poppush
Atomics thread finished. Time: 0 s 215489 us 215 nsec/poppush
Atomics thread finished. Time: 0 s 216294 us 216 nsec/poppush
All threads finished. Thread count: 2 Time: 0 s 216562 us 108 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

C:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11822 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45647 us 45 nsec/poppush
Atomics thread finished. Time: 0 s 92262 us 92 nsec/poppush
Atomics thread finished. Time: 0 s 92720 us 92 nsec/poppush
All threads finished. Thread count: 2 Time: 0 s 92998 us 46 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

1 THREAD

A:
jnysal@c656f6n03:~/src/ompi/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11837 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 69339 us 69 nsec/poppush
Atomics thread finished. Time: 0 s 69342 us 69 nsec/poppush
All threads finished. Thread count: 1 Time: 0 s 69498 us 69 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

B:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11837 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 46528 us 46 nsec/poppush
Atomics thread finished. Time: 0 s 46503 us 46 nsec/poppush
All threads finished. Thread count: 1 Time: 0 s 46677 us 46 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

C:
jnysal@c656f6n03:~/src/ompi-github-hjelmn/test/class$ ./.libs/opal_lifo 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152
Single thread test. Time: 0 s 11841 us 11 nsec/poppush
Atomics thread finished. Time: 0 s 45650 us 45 nsec/poppush
Atomics thread finished. Time: 0 s 45652 us 45 nsec/poppush
All threads finished. Thread count: 1 Time: 0 s 45790 us 45 nsec/poppush
SUPPORT: OMPI Test Passed: opal_lifo_t: (7 tests)

@nysal
Copy link
Member

nysal commented Aug 10, 2015

To avoid the store in between the LL/SC pair, how about something like this?

--- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -112,8 +114,12 @@ static inline opal_list_item_t *opal_lifo_push_atomic (opal_lifo_t *lifo,
     opal_list_item_t *next;

     do {
-        item->opal_list_next = next = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
-        opal_atomic_wmb ();
+load_head:
+        item->opal_list_next = lifo->opal_lifo_head.data.item;
+        opal_atomic_wmb();
+        next = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
+        if(next != item->opal_list_next)
+            goto load_head;
     } while (!opal_atomic_sc_ptr(&lifo->opal_lifo_head.data.item, item));

     return next;

The disassembly:

    10001b70:   18 00 5e e9     ld      r10,24(r30)
    10001b74:   10 00 49 f9     std     r10,16(r9)
    10001b78:   ac 04 20 7c     lwsync
    10001b7c:   a8 f8 40 7d     ldarx   r10,0,r31
    10001b80:   10 00 09 e9     ld      r8,16(r9)
    10001b84:   00 50 a8 7f     cmpd    cr7,r8,r10
    10001b88:   e8 ff 9e 40     bne     cr7,10001b70 <thread_test+0x1c0>
    10001b8c:   ad f9 20 7d     stdcx.  r9,0,r31
    10001b90:   00 00 40 39     li      r10,0
    10001b94:   08 00 c2 40     bne-    10001b9c <thread_test+0x1ec>
    10001b98:   01 00 4a 61     ori     r10,r10,1
    10001b9c:   00 00 8a 2f     cmpwi   cr7,r10,0
    10001ba0:   d0 ff 9e 41     beq     cr7,10001b70 <thread_test+0x1c0>

The performance is not very different (marginally better than before).

@hjelmn
Copy link
Member Author

hjelmn commented Aug 10, 2015

Essentially that is the same as using cswap. With a lifo there really isn't an ABA issue in push (only pop). You can try copying the cswap version of push over the ll/sc one and it should have nearly identical performance as the code you posted.

That is probably the right answer though. Use LL/SC to defeat ABA in pop and use cswap in push. That both avoids the potential live-lock and improves the performance on platforms that support the more powerful LL/SC instructions.

@hjelmn hjelmn changed the title PowerPC/Power lifo/fifo fixes PowerPC/Power lifo/fifo improvements Aug 11, 2015
@jsquyres
Copy link
Member

@hjelmn @nysal It looks like you have agreed on an approach. Does this PR need to be updated before it is merged?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 15, 2015

Yup. I will push the update on Monday.

@hppritcha
Copy link
Member

should this be pushed back to v2.x?

@nysal
Copy link
Member

nysal commented Aug 17, 2015

@hjelmn @jsquyres I'll take a look once the update lands.
@hppritcha I vote for pushing this to v2.x once its ready

@hjelmn
Copy link
Member Author

hjelmn commented Aug 17, 2015

@nysal should be good to go now. i prefer nanosleep over poll because it will shorten the delay when contention is hit within Open MPI. my lifo/fifo tests are much harder on the class than we will see in practice within Open MPI.

@nysal
Copy link
Member

nysal commented Aug 18, 2015

Using nanosleep is fine. There are some typo's in the update. Do we want the nanosleep for fifo pop too?

Please make the following change:

--- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -25,6 +25,7 @@
 #define OPAL_LIFO_H_HAS_BEEN_INCLUDED

 #include "opal_config.h"
+#include <time.h>
 #include "opal/class/opal_list.h"

 #include "opal/sys/atomic.h"
@@ -190,8 +191,8 @@ static inline void _opal_lifo_release_cpu (void)
      * is a performance improvement for the lifo test when this call is made on detection
      * of contention but it may not translate into actually MPI or application performance
      * improvements. */
-    static struct timespect interval = { .tv_sec = 0, .tv_nsec = 100 };
-    nanosleeep (&interval, NULL);
+    static struct timespec interval = { .tv_sec = 0, .tv_nsec = 100 };
+    nanosleep (&interval, NULL);
 }

 /* Retrieve one element from the LIFO. If we reach the ghost element then the LIFO

@hjelmn
Copy link
Member Author

hjelmn commented Aug 18, 2015

Bah. Thanks for spotting the typos.

hjelmn added 3 commits August 18, 2015 14:01
This commit adds implementations of opal_atomic_ll_32/64 and
opal_atomic_sc_32/64. These atomics can be used to implement more
efficient lifo/fifo operations on supported platforms. The only
supported platform with this commit is powerpc/power.

This commit also adds an implementation of opal_atomic_swap_32/64 for
powerpc.

Tested with Power8.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
These instructions allow a more efficient implementation of the
opal_fifo_pop_atomic function.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds implementations for opal_atomic_lifo_pop and
opal_atomic_lifo_push that make use of the load-linked and
store-conditional instruction. These instruction allow for a more
efficient implementation on supported platforms.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit that referenced this pull request Aug 18, 2015
PowerPC/Power lifo/fifo improvements
@hjelmn hjelmn merged commit d2b3c9d into open-mpi:master Aug 18, 2015
@hjelmn
Copy link
Member Author

hjelmn commented Aug 18, 2015

@nysal Adding the delay to pop would probably have some push-back. I may re-introduce that idea in another PR.

jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
@hjelmn hjelmn deleted the ppc_fixes branch May 23, 2016 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants