New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMIx 1.2 dstore PMIx_Get Performance issue #207
Comments
@jjhursey we just published our tool that was used to measure the performance. Can you give it a try as well? |
Also, what happens if you run 1.2 with dstore disabled? Does it show results comparable to 1.1.5/1.1.2? |
I'm setting up the testbed to try to reproduce |
I'm setting up your performance tool as well. But probably won't get to run that data until Monday. This was on a ppc64le system with 20 cores x 8 hwthreads on each node (18 nodes total, sorry for the odd node scaling). The only thing that changed between the builds was the PMIx library that was linked in. |
But have you tried v1.2 without dstore? Just to 2-check that this is not something else that has changed there? |
Also we haven't tested on PPC before and this probably also may be the reason of this unexpected behavior. |
I've just remeasured the dstore performance on the 32-node 20-core cluster using the tool that we are publishing now with #206, and here is the comparison (click on the image to scale): This data is perfectly consistent with what I saw several months ago and referred on one of the calls. Still it's not clear to me why local Get's take longer but as number of local peers usually scales slower that overall rank count I've pushed that to my backlog. And here is a comparison of the total time to fetch the keys: You can see ~30% improvement that I mentioned on the call. I'm going to fix my tool to make it more convenient - right now all procs just spit their timings and you have to do a post processing. I'll update the PR. Since you can't share your tool now here is the next steps that I suggest to try:
|
Also I just realized that you mentioned several commits. I don't know how exactly you are measuring so it's hard to predict what went wrong but here are few things to keep in mind:
|
P.S. dstore may be improved to address mentioned issues but we need to decide if this use-case is worth to address. |
Today I tried on 2 of our 160-thread PPC machines and I see poor local Get performance observed on Intel is getting much-much worse on PPC so I guess that this is what causing your bad results. I was testing just 2 nodes so the ratio of local Get's was really high which results in serious influence. |
Ok, I found the reason of local-Get performance degradation and fixed it in #210 and #211. Now both local and remote access times are comparable (almost equal). It seems like this bug was hiding from me the real performance of the dstore when I was originally measuring. Currently dstore outperforms message-based approach for smaller core count (less than 160 per node) but it unfortunately it is slower on fully-packed nodes. I'm working on this now and will update later. |
Might I suggest adding a README file to the performance tool directory that explains how to build and run the test? I'd be happy to give it a shot on my box, but I'm not sure I understand how you are building/executing the tests. |
I will provide the README asap |
I fixed the problem with local keys in #213 and #212.
1.2.0 with dstore:
when I enable 4 threads per core results are comparable:
1.2.0 with dstore:
If all 8 threads are used messaging looks better:
1.2.0 with dstore:
It seems to me that the reason is in concurrency and not in the algorithms used in the dstore to fetch the data. If I run 80, 160, 320 and 640 procs on our intel cluster with 1 thread per core there is no difference between Get time for both run's and the time is ~66 us while with PPC it reaches 1300 ms for the same # of procs - so this is not a scaling issue with regard to access algorithm. |
This also can be a numa effect. I'll check it and get back to you. |
I ran some perf numbers yesterday on a single machine with a version of PR #217
|
Performance is good enough for v1.2. Additional improvements will likely go into the v1.2.1. |
Perhaps the problem here is that we are locking at too great a scale. There is no reason to lock the entire dstore whenever we update or add one data element. If we lock only at the level of a given proc's data, or even better at the individual data element, I suspect the penalty would go away? After all, there is no need for a lock on data that has been completely output. |
Just going thru this code, there are some obvious inefficiencies (or maybe I'm just not understanding the intent). We seem to be packing a buffer when storing info into dstore, which makes no sense since we already had the buffer and could just save it, if that's the intent - when it comes to job-level data, there is no reason to separate it out by rank. We are also using a value array (instead of a pointer array) to hold the buffer, which may not be the best choice. I think the attempt to unify some of this code led to some unfortunate loss of optimization for both the messaging and dstore channels. A lot of this code was also inlined, which is probably not ideal as it is fairly long and has lots of malloc's going on. One option is to commit to dstore and streamline the code, eliminating unnecessary pack/unpack operations. Another would be to separate the code paths to accomplish the same goal. In going thru the handshake, I think we can safely eliminate the locks altogether by simply "tagging" the client data objects with a flag to indicate that the data is readable, and having the server send a message when that is the case. If a client wants to access some rank's data and doesn't see the data-readable flag as true, then it sends a message to the server asking for it. In the unusual case where data is modified, the server can send a "non-readable" message to its clients, wait to recv an "ack" from them, update the data, and then send a "readable" message. I know this is lower performance than locks, but it is the highly unusual case and should therefore be treated as an exception while we streamline the primary code path. |
I'd like to get the My assumption is that writing (via The other side of the overhead is interesting too (event lib overhead). I think what is happening there is that we have 1 process per hwthread - each process with 2 threads (main thread, pmix thread). So they are just fighting for the resources too aggressively. I tried playing with the |
@rhc54 I think that the main problem here is thread shifting: ~100ms for _esh_get vs ~500 overall. |
👍 for @jjhursey suggestion about fast search. Add some internal locking to protect that but do that. |
I think that thread-shifting will be the issue with any event lib at max PPN where all hw-threads are busy. We can certainly try though. |
Locking overhead is a known thing: This column of my benchmark was showing that:
This is a number of locks/s |
I slightly changed my locking perf benchmark (artpol84/poc@c0ed956) making each reader to do as much locks as possible:
|
After talking with @hjelmn, I'm going to bring over the OPAL atomics as they are a great deal faster. I'm still working on the framework for this, but it may take me a little longer than hoped as we have to account for backward compatibility. |
Here is the result of the same run but with each reader taking its own lock:
This is |
Also I think that atomics will be good for the independent lock implementation as well, so looking forward to see them in pmix. |
as the baseline here is the same run with no locks:
|
Just a side note: we have noticed a significant improvement in the built-in atomic performance with gcc 6.2. In OMPI at least this puts us on sync with our own atomic support. |
@jjhursey I have some analysis related to your experiments. The main difference between
This means that you were experimenting with direct modex mode. So at least #316 was taking place doubling dstore overhead. I will force the fix process. Also I believe that for PAMI you need the full-modex mode in OMPI because you pre-connect all the processes in MPI_Init so you can't benefit from direct modex. Here is results of your benchmark for dstore/msg with full/direct modex. dstore shows the best result (for the full modex):
I was experimenting with pmix/master, ompi/master and applied the following patch to your original benchmark to enable full-modex: $ diff -Naur pmix_get_test.c_old pmix_get_test.c
--- pmix_get_test.c_old 2017-02-25 01:09:25.610528000 +0200
+++ pmix_get_test.c 2017-02-25 01:24:15.000000000 +0200
@@ -66,7 +66,7 @@
acc_commit = total_commit = 0;
ts_top = ts_start = GET_TS;
- ret = PMIx_Init(&my_proc);
+ ret = PMIx_Init(&my_proc, NULL, 0);
ts_end = GET_TS;
ts_init = ts_end - ts_start;
if( PMIX_SUCCESS != ret ) {
@@ -75,29 +75,29 @@
}
strncpy(proc.nspace, my_proc.nspace, PMIX_MAX_NSLEN);
- proc.rank = 0;
/*
* Emulate some gets that Open MPI often incurs at this point
*/
+ proc.rank = PMIX_RANK_WILDCARD;
get(my_proc, PMIX_LOCAL_RANK);
get(my_proc, PMIX_NODE_RANK);
- get(my_proc, PMIX_UNIV_SIZE);
- get(my_proc, PMIX_JOB_SIZE);
- get(my_proc, PMIX_APPNUM);
- get(my_proc, PMIX_LOCAL_SIZE);
- get(my_proc, PMIX_LOCAL_TOPO);
- get(my_proc, PMIX_LOCAL_PEERS);
- get(my_proc, PMIX_LOCAL_CPUSETS);
+ get(proc, PMIX_UNIV_SIZE);
+ get(proc, PMIX_JOB_SIZE);
+// get(my_proc, PMIX_APPNUM);
+ get(proc, PMIX_LOCAL_SIZE);
+ get(proc, PMIX_LOCAL_TOPO);
+ get(proc, PMIX_LOCAL_PEERS);
+// get(my_proc, PMIX_LOCAL_CPUSETS);
/*
* APP emulation
*/
start_timing = true;
PRINT_STATUS("APP Get...");
- get(my_proc, PMIX_JOB_SIZE);
- get(my_proc, PMIX_LOCAL_SIZE);
- get(my_proc, PMIX_LOCAL_PEERS);
+ get(proc, PMIX_JOB_SIZE);
+ get(proc, PMIX_LOCAL_SIZE);
+ get(proc, PMIX_LOCAL_PEERS);
// buildMapCache()
// foreach proc:
@@ -106,18 +106,22 @@
proc.rank = i;
get(proc, PMIX_HOSTNAME);
}
-
//
// PMIX_Fence()
//
+ bool value1 = true;
+ PMIX_INFO_CREATE(info, 1);
+ (void)strncpy(info->key, PMIX_COLLECT_DATA, PMIX_MAX_KEYLEN);
+ pmix_value_load(&info->value, &value1, PMIX_BOOL);
+ ninfo = 1;
+/*
PRINT_STATUS("APP Fence...");
- proc.rank = PMIX_RANK_WILDCARD;
ts_start = GET_TS;
PMIx_Fence(&proc, 1, info, ninfo);
ts_end = GET_TS;
total_fence++;
acc_fence += (ts_end - ts_start);
-
+*/
//
// Put aaa / bbb
//
@@ -196,7 +200,7 @@
* Cleanup
*/
ts_start = GET_TS;
- ret = PMIx_Finalize();
+ ret = PMIx_Finalize(NULL, 0);
ts_end = GET_TS;
ts_finalize = ts_end - ts_start;
if( PMIX_SUCCESS != ret ) {
@@ -259,7 +263,7 @@
acc_get += (ts_end - ts_start);
}
- if( 0 == strcmp(PMIX_UNIV_SIZE, key) ) {
+ if( 0 == strcmp(PMIX_JOB_SIZE, key) ) {
univ_size = value->data.uint32;
} |
@jjhursey @gpaulsen
However I guess that Spectrum MPI may have different defaults. Can you comment? |
@artpol84 Thanks for the notes.
|
Do you remember why you were using empty fence in your reproducer? |
Followup:
|
All numbers below the |
I didn't get the last comment. |
If this was with respect to Then I meant that it seems that you were trying to keep your reproducer close to SMPI pattern, so the fact that you used Fence without data collection may mean that there was no such flag in SMPI, or otherwise it may be just a typo. |
I re-ran the performance measurement changing the one Here are the updated numbers from the application. 32 nodes x 20 ppn = 640 processesDirect modex (original results)
Full modex
32 nodes x 80 ppn = 2560 processesDirect modex
Full modex
32 nodes x 160 ppn = 5120 processesDirect modex (original results)
Full modex
Summary
I'm going to close this again. The performance issue was with the original application, and that has been resolved. I think there are still places for improvement, but we can open separate tickets for those. |
I just got back some performance results comparing PMIx
1.1.2
,1.1.5
, and1.2.0rc1
(with dstore enabled). I see a performance regression when performing large numbers ofPMIx_Get
calls at high PPN on a few nodes.For the particular test code (which I cannot share, sorry) it makes a number of
PMIx_Get
calls combined with otherPMIx_
calls (e.g., put, fence, commit -- nothing too fancy). In the output below you can see the total time spent inPMIx_Get
then normalized per call. The diff is in comparison to1.1.2
You can see an almost 3x performance lost by using dstore, which is completely unexpected. I want to do some more fine tuned experimentation early next week, but maybe others can take a look before then.
I configured PMIx with
./configure --with-pic --with-platform=optimized --with-libevent=... --with-hwloc=... --with-devel-headers ... checking if want shared memory datastore... yes ...
Then used OMPI v2.0 with the PMIx external component. Configured with the
optimized
platform file.The text was updated successfully, but these errors were encountered: