-
Notifications
You must be signed in to change notification settings - Fork 931
PowerPC/Power lifo/fifo improvements #720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
These issues were first identified by @PHHargrove. Someone who knows PPC/Power should double-check before this goes into master. @jsquyres who should I recruit to do the review? |
|
Nathan, I see I think you should It doesn't look like the VM I gave you access to has multilib support (for "-m32"). -Paul |
c628f13 to
ca518a0
Compare
|
@PHHargrove Ok, should be fixed now. Tested with the second virtual machine and -m32. |
|
To give others an idea of the difference in speed: LL/SC implementation of opal_lifo: CAS implementation: |
|
Nathan, I tried to review the code, but my expertise falls short of being able to do so completely. However, I did find some things with the LL/SC based code that I believe might be incorrect. Hopefully somebody (from IBM?) can complete a more "accurate" review.
All three comments are "maybes". You definitely need a POWER/PPC expert to look more closely at the three issues. From a pragmatic point-of-view, I suggest that you separate this into 2 PRs: one for the new LL/SC based code and one for the one-line fix to add the missing WMB in opal_fifo_push_atomic(). The later should be uncontentious (and fixes a known bug). -Paul |
|
Paul I could find any reference that suggest that the reservation granularity is anything larger than a cache line. Can you provide a link please? |
|
@bosilca , First, I admit that the idea that it could be as large as a page is my own worst-case interpretation. I did a bit of poking around, which is described later starting at "FWIW some research...", and found that for a sampling of PPC platforms, the reservation granularity and d-cache line size match. Even if it is only cache-line size, it is not something that is necessarily known with 100% certainty at compile time. However, examining the Linux kernel source, it appears never to be larger than 128 and is fixed at that value for all PPC64 platforms but could be 128 or 32 for PPC32, and 16 or 64 for some embedded cpus. So, I guess one should be fine as long as structures are padded/guarded to ensure any writes between LL/SC pairs are no less than 128-bytes away from the word accessed via LL/SC. As for a link to my source of information: I am looking at PowerPC® Microprocessor Family: The Programming Environments Manual for See section 4.2.6 and specifically the "Note" about half way down page 173:
I searched that doc for every instance of "granul" (to catch "granule" and "granularity") and found nothing that was more concrete than "implementation-dependent". See also "Appendix D. Synchronization Programming Examples" (starting on page 619). FWIW some research seems to show that reservation granule and cache-line size match: I can confirm that the granule is a d-cache line size on two POWER7 systems: [phargrov@gcc1-power7 ~]$ od -tu4 /proc/device-tree/cpus/PowerPC,POWER7@0/{d-cache-line-size,reservation-granule-size} {hargrove@vestalac2 ~}$ od -tu4 /proc/device-tree/cpus/PowerPC,POWER7@0/{d-cache-line-size,reservation-granule-size} A little-endian Power8 also gives a granule size and d-cache block size of 128. On an old G4 laptop I get no useful info (only 3 bytes!): The following are from google searches for "device-tree": Here is a portion of a device-tree found on the web showing 32-byte granularity which matches the reported d-cache-block-size A G5 reports the (expected for ppc64) value of 128 for both d-cache-block-size and reservation-granule-size. A G3 and G4 both report 32 for both values, which is what the Linux kernel source leads us to expect for L1 cache line length on most PPC32 systems |
|
Both the fifo and lifo implementation intentionally have only a small number of loads/stores between the LL and the SC. The memory locations being read/written are on the stack and in the opal_list_item being pushed/popped. It is possible that we could get really unlucky and have either the stack addresses or the list item mapped to the same cache line. I do not know if this is enough to cause the reservation to be lost on the fifo/lifo head pointer. Certainly implementing the entire head update in assembly would eliminate this possibility but I would like to hear from a power/ppc expert. |
|
@PHHargrove I will move the missing memory barrier to another pr so that gets in asap. |
|
Hey Nathan, I haven't had time to review this PR yet, but I definitely echo Paul's concerns here. I'm about to head out for a short vacation, but I can probably review it on Wednesday if you'd like another set of eyeballs on it. |
|
@goodell Yes, please take a look. I plan to do a little more research on the livelock issue. I am not convinced it will happen but I am willing to rewrite the code to mitigate the potential issue. |
opal/class/opal_lifo.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty unconventional to have a barrier in the middle of a lwarx/stwcx. pair. The "third party" memory store to item->opal_list_next (which might clear the "reservation") and the wmb (which could also potentially mess with the reservation) are disconcerting.
Do you have a trustworthy source example of this style of unconventional lwarx/stwcx. usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The barrier is meant to ensure the ordering between the store instruction for writing item->opal_list_next and the store conditional. I think there is an example of something like this in the PPC spec Paul pointed out.
I don't think the extra store will clear the reservation. Though I could reorder the loop to do a normal read/store to update opal_list_next. Then do a ldarx/stdcx. pair with a conditional between them that ensures the current head is the same as what was stored in the opal_list_next pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I can't reorder it that way. Would become essentially CAS and then would have the ABA problem.
Thinking about this a little more. There is no ABA problem with the reordering since this is the lifo. ABA problems come up in pop.
|
@jsquyres Thanks. Given my understanding of how the reservations work on PPC/Power I think the code is ok but we really need experts to take a look. If I can get this in (with modifications if necessary) I think it will be a good performance boost for these platforms with MPI_THREAD_MULTIPLE. |
|
I'll ask Nysal to take a look |
|
Nysal's taking a look. Will take a few days at least for him to review and comment. |
|
I've only looked at the LIFO so far. Here is the disassembly for opal_lifo_push_atomic: 100017c0: a8 48 00 7d ldarx r8,0,r9
100017c4: 10 00 0a f9 std r8,16(r10)
100017c8: ac 04 20 7c lwsync
100017cc: ad 49 40 7d stdcx. r10,0,r9
100017d0: 00 00 00 39 li r8,0
100017d4: 08 00 c2 40 bne- 100017dc <thread_test+0x15c>
100017d8: 01 00 08 61 ori r8,r8,1
100017dc: 00 00 88 2f cmpwi cr7,r8,0
100017e0: e0 ff 9e 41 beq cr7,100017c0 <thread_test+0x140>This might lead to a livelock as Paul already mentioned. How about changing the code a little, to move the store before the LL/SC pair, as mentioned in the PowerPC Book 2 (Appendix B.3). However, as you noted in one of the comments above, this will make the code similar to a compare and swap. As a next step I'll gather some performance data on a Power8 node. The performance observed on the VM might not be really representative of the coherency overheads on a real system. I'll probably have to modify the test a little to set the thread affinity, but that shouldn't be too hard. If there is a simple fix for master branch, I'd prefer that is applied first. That gives us more time to evaluate this change. |
|
@nysal There is a fix that will go into master. The performance improvement was something I have been thinking about for some time. Finally had a chance to implement it. |
|
See #770 for the fifo hang fix on its own. A memory barrier was missing. |
|
I modified the opal_lifo test case to set the thread affinity: --- a/test/class/opal_lifo.c
+++ b/test/class/opal_lifo.c
@@ -20,12 +20,15 @@
#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
+#include <string.h>
#include <sys/time.h>
+#include <sched.h>
#define OPAL_LIFO_TEST_THREAD_COUNT 8
#define ITERATIONS 1000000
#define ITEM_COUNT 100
+#define NR_CPUS 256
#if !defined(timersub)
#define timersub(a, b, r) \
@@ -39,11 +42,38 @@
} while (0)
#endif
+static unsigned int cpu_affinities[NR_CPUS];
+static unsigned int next_aff = 0;
+static int use_affinity;
+pthread_mutex_t aff_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static void set_affinity(void)
+{
+ cpu_set_t cpu_mask;
+ int cpu, ret;
+
+ if (!use_affinity)
+ return;
+
+ ret = pthread_mutex_lock(&aff_mutex);
+ cpu = cpu_affinities[next_aff++];
+ ret = pthread_mutex_unlock(&aff_mutex);
+ printf("Setting thread affinity to %d\n",cpu);
+ CPU_ZERO(&cpu_mask);
+ CPU_SET(cpu, &cpu_mask);
+ ret = sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask);
+ if (ret) {
+ perror("Error in sched_setaffinity");
+ exit(-1);
+ }
+}
+
static void *thread_test (void *arg) {
opal_lifo_t *lifo = (opal_lifo_t *) arg;
opal_list_item_t *item;
struct timeval start, stop, total;
double timing;
+ set_affinity();
gettimeofday (&start, NULL);
for (int i = 0 ; i < ITERATIONS ; ++i) {
@@ -82,7 +112,8 @@ int main (int argc, char *argv[]) {
opal_lifo_t lifo;
bool success;
double timing;
- int rc;
+ int rc, ctr = 0;
+ char *aff, *tmp;
rc = opal_init_util (&argc, &argv);
test_verify_int(OPAL_SUCCESS, rc);
@@ -151,7 +182,18 @@ int main (int argc, char *argv[]) {
} else {
test_failure (" lifo push/pop single-threaded with atomics");
}
-
+ // Check if affinity was supplied
+ if(argc == 2) {
+ aff = strdup(argv[1]);
+ tmp = strtok(aff, ",");
+ while (tmp != NULL)
+ {
+ cpu_affinities[ctr++] = atoi(tmp);
+ tmp = strtok(NULL, ",");
+ }
+ free(aff);
+ use_affinity = 1;
+ }
gettimeofday (&start, NULL);
for (int i = 0 ; i < OPAL_LIFO_TEST_THREAD_COUNT ; ++i) {
pthread_create (threads + i, NULL, thread_test, &lifo);I ran the opal_lifo test on a Power 8 system with 20 cores .SMT8 is enabled so there is a total of 160 h/w threads supported. Here is the performance data for the current master branch (includes the recent memory barrier fixes): With your performance improvements: With the following patch on top of your improvements: --- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -29,6 +29,7 @@
#include "opal/sys/atomic.h"
#include "opal/threads/mutex.h"
+#include <poll.h>
BEGIN_C_DECLS
@@ -125,8 +132,13 @@ static inline opal_list_item_t *opal_lifo_push_atomic (opal_lifo_t *lifo,
static inline opal_list_item_t *opal_lifo_pop_atomic (opal_lifo_t* lifo)
{
opal_list_item_t *item, *next;
-
+ int attempt = 0;
do {
+ if(++attempt > 5)
+ {
+ (void) poll(NULL, 0, 1); /* Wait for 1ms */
+ attempt = 0;
+ }
item = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
if (&lifo->opal_lifo_ghost == item) {
return NULL;I see the following: Thoughts? |
|
I saw similar performance improvements for the test when using nanosleep. I was using it for every time a conflict was detected instead of every 5th time. In this case either a poll(0) or nanosleep() has the benefit of changing the timing of the conflicting threads thereby breaking the livelock. Can you run the test again with the following in place of poll: and post the performance. I am curious to see if there is any difference. |
|
nanosleep(10 ns): nanosleep(100 ns): nanosleep(1000 ns): nanosleep(10000 ns): nanosleep(100000 ns): nanosleep(1000000 ns): So the performance numbers with a nanosleep for 1 millisecond is pretty close to what we see with poll. The 10 nanosecond sleep too gives us a huge boost as compared to the original performance. |
|
ok, thats about what I expected. It might be interesting to see how they compare with more/less contention. I will incorporate your patch into the LL/SC implementations of the lifo and fifo. It should meet less resistance than my earlier nanosleep patch because it is actually addressing a potential bug whereas mine only improved the performance of a unit test. |
|
With higher contention, the results are even better. There seems to be an improvement in all cases, even with 2 threads. I reduced the verbosity of the test case, but there is no functional change. A = Current master branch 20 THREADS A: B: C: 4 THREADS A: B: C: 2 THREADS A: B: C: 1 THREAD A: B: C: |
|
To avoid the store in between the LL/SC pair, how about something like this? --- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -112,8 +114,12 @@ static inline opal_list_item_t *opal_lifo_push_atomic (opal_lifo_t *lifo,
opal_list_item_t *next;
do {
- item->opal_list_next = next = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
- opal_atomic_wmb ();
+load_head:
+ item->opal_list_next = lifo->opal_lifo_head.data.item;
+ opal_atomic_wmb();
+ next = (opal_list_item_t *) opal_atomic_ll_ptr (&lifo->opal_lifo_head.data.item);
+ if(next != item->opal_list_next)
+ goto load_head;
} while (!opal_atomic_sc_ptr(&lifo->opal_lifo_head.data.item, item));
return next;The disassembly: The performance is not very different (marginally better than before). |
|
Essentially that is the same as using cswap. With a lifo there really isn't an ABA issue in push (only pop). You can try copying the cswap version of push over the ll/sc one and it should have nearly identical performance as the code you posted. That is probably the right answer though. Use LL/SC to defeat ABA in pop and use cswap in push. That both avoids the potential live-lock and improves the performance on platforms that support the more powerful LL/SC instructions. |
|
Yup. I will push the update on Monday. |
|
should this be pushed back to v2.x? |
|
@hjelmn @jsquyres I'll take a look once the update lands. |
|
@nysal should be good to go now. i prefer nanosleep over poll because it will shorten the delay when contention is hit within Open MPI. my lifo/fifo tests are much harder on the class than we will see in practice within Open MPI. |
|
Using nanosleep is fine. There are some typo's in the update. Do we want the nanosleep for fifo pop too? Please make the following change: --- a/opal/class/opal_lifo.h
+++ b/opal/class/opal_lifo.h
@@ -25,6 +25,7 @@
#define OPAL_LIFO_H_HAS_BEEN_INCLUDED
#include "opal_config.h"
+#include <time.h>
#include "opal/class/opal_list.h"
#include "opal/sys/atomic.h"
@@ -190,8 +191,8 @@ static inline void _opal_lifo_release_cpu (void)
* is a performance improvement for the lifo test when this call is made on detection
* of contention but it may not translate into actually MPI or application performance
* improvements. */
- static struct timespect interval = { .tv_sec = 0, .tv_nsec = 100 };
- nanosleeep (&interval, NULL);
+ static struct timespec interval = { .tv_sec = 0, .tv_nsec = 100 };
+ nanosleep (&interval, NULL);
}
/* Retrieve one element from the LIFO. If we reach the ghost element then the LIFO |
|
Bah. Thanks for spotting the typos. |
This commit adds implementations of opal_atomic_ll_32/64 and opal_atomic_sc_32/64. These atomics can be used to implement more efficient lifo/fifo operations on supported platforms. The only supported platform with this commit is powerpc/power. This commit also adds an implementation of opal_atomic_swap_32/64 for powerpc. Tested with Power8. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
These instructions allow a more efficient implementation of the opal_fifo_pop_atomic function. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds implementations for opal_atomic_lifo_pop and opal_atomic_lifo_push that make use of the load-linked and store-conditional instruction. These instruction allow for a more efficient implementation on supported platforms. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
PowerPC/Power lifo/fifo improvements
|
@nysal Adding the delay to pop would probably have some push-back. I may re-introduce that idea in another PR. |
Fix Java related warnings
No description provided.