Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMIx 1.2 dstore PMIx_Get Performance issue #207

Closed
jjhursey opened this issue Nov 3, 2016 · 44 comments
Closed

PMIx 1.2 dstore PMIx_Get Performance issue #207

jjhursey opened this issue Nov 3, 2016 · 44 comments
Assignees
Milestone

Comments

@jjhursey
Copy link
Member

jjhursey commented Nov 3, 2016

I just got back some performance results comparing PMIx 1.1.2, 1.1.5, and 1.2.0rc1 (with dstore enabled). I see a performance regression when performing large numbers of PMIx_Get calls at high PPN on a few nodes.

For the particular test code (which I cannot share, sorry) it makes a number of PMIx_Get calls combined with other PMIx_ calls (e.g., put, fence, commit -- nothing too fancy). In the output below you can see the total time spent in PMIx_Get then normalized per call. The diff is in comparison to 1.1.2

-------------------------------------------------------------------------------------
Nodes    NP       1.1.2  calls    per call       1.1.5  calls    per call    diff    1.2.0rc1  calls    per call    diff 
-------------------------------------------------------------------------------------
    1    80    0.158119    403    0.000392    0.156834    403    0.000389   0.99x    0.505656    403    0.001255   3.20x 
    2   160    0.399714    803    0.000498    0.404362    803    0.000504   1.01x    1.314171    803    0.001637   3.29x 
    3   240    0.656375   1203    0.000546    0.667607   1203    0.000555   1.02x    2.067067   1203    0.001718   3.15x 
    4   320    0.926069   1603    0.000578    0.952027   1603    0.000594   1.03x    2.843061   1603    0.001774   3.07x 
    5   400    1.222624   2003    0.000610    1.233513   2003    0.000616   1.01x    3.529636   2003    0.001762   2.89x 
    6   480    1.538374   2403    0.000640    1.560746   2403    0.000649   1.01x    4.208262   2403    0.001751   2.74x 
    8   640    2.095369   3203    0.000654    2.084200   3203    0.000651   0.99x    5.623388   3203    0.001756   2.68x 
   12   960    2.934969   4803    0.000611    2.995031   4803    0.000624   1.02x    8.383371   4803    0.001745   2.86x 
   16  1280    3.925197   6403    0.000613    3.992772   6403    0.000624   1.02x   11.404139   6403    0.001781   2.91x 
   18  1440    4.416280   7203    0.000613    4.494544   7203    0.000624   1.02x   12.785547   7203    0.001775   2.90x 

You can see an almost 3x performance lost by using dstore, which is completely unexpected. I want to do some more fine tuned experimentation early next week, but maybe others can take a look before then.

I configured PMIx with

./configure --with-pic --with-platform=optimized --with-libevent=... --with-hwloc=... --with-devel-headers
...
checking if want shared memory datastore... yes
...

Then used OMPI v2.0 with the PMIx external component. Configured with the optimized platform file.

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

@jjhursey we just published our tool that was used to measure the performance. Can you give it a try as well?
What is the platform? PPC I guess?

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

Also, what happens if you run 1.2 with dstore disabled? Does it show results comparable to 1.1.5/1.1.2?

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

I'm setting up the testbed to try to reproduce

@jjhursey
Copy link
Member Author

jjhursey commented Nov 3, 2016

I'm setting up your performance tool as well. But probably won't get to run that data until Monday.

This was on a ppc64le system with 20 cores x 8 hwthreads on each node (18 nodes total, sorry for the odd node scaling). The only thing that changed between the builds was the PMIx library that was linked in.

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

But have you tried v1.2 without dstore? Just to 2-check that this is not something else that has changed there?

@artpol84
Copy link
Contributor

artpol84 commented Nov 3, 2016

Also we haven't tested on PPC before and this probably also may be the reason of this unexpected behavior.

@artpol84
Copy link
Contributor

artpol84 commented Nov 4, 2016

I've just remeasured the dstore performance on the 32-node 20-core cluster using the tool that we are publishing now with #206, and here is the comparison (click on the image to scale):

image

This data is perfectly consistent with what I saw several months ago and referred on one of the calls. Still it's not clear to me why local Get's take longer but as number of local peers usually scales slower that overall rank count I've pushed that to my backlog.

And here is a comparison of the total time to fetch the keys:
image

You can see ~30% improvement that I mentioned on the call.
Note that I wasn't using hw threads and it's not clear for me how you were using them. Was it 1 process per hw-thread?

I'm going to fix my tool to make it more convenient - right now all procs just spit their timings and you have to do a post processing. I'll update the PR.

Since you can't share your tool now here is the next steps that I suggest to try:

  1. I will try to test on the PPC cluster.
  2. You will try updated perf tool so we will be on the same page with regard to how and what we measuring.

@artpol84
Copy link
Contributor

artpol84 commented Nov 4, 2016

Also I just realized that you mentioned several commits. I don't know how exactly you are measuring so it's hard to predict what went wrong but here are few things to keep in mind:

  • Currently dstore doesn't garbage-collect. So if you duplicate keys for example this will lead to sufficient fragmentation of the shared memory and may significantly affect performance. As per original statements PMIx is not supposed to be used in the mode where multiple commit/fence's are involved. At least it's not supposed to be performant in this case.
  • Also dstore supposed to work well for the reasonable amount of keys per process (which is expected real use case) . If you will push too much keys for the same rank it also can decrease performance. In conjunction with multiple commits the performance can be seriously decreased.

@artpol84
Copy link
Contributor

artpol84 commented Nov 4, 2016

P.S. dstore may be improved to address mentioned issues but we need to decide if this use-case is worth to address.

@artpol84
Copy link
Contributor

artpol84 commented Nov 4, 2016

Today I tried on 2 of our 160-thread PPC machines and I see poor local Get performance observed on Intel is getting much-much worse on PPC so I guess that this is what causing your bad results. I was testing just 2 nodes so the ratio of local Get's was really high which results in serious influence.
I guess it's time to find out why does it happens. I will check and get back to you.

@artpol84
Copy link
Contributor

artpol84 commented Nov 5, 2016

Ok, I found the reason of local-Get performance degradation and fixed it in #210 and #211. Now both local and remote access times are comparable (almost equal).

It seems like this bug was hiding from me the real performance of the dstore when I was originally measuring. Currently dstore outperforms message-based approach for smaller core count (less than 160 per node) but it unfortunately it is slower on fully-packed nodes.

I'm working on this now and will update later.

@rhc54
Copy link
Contributor

rhc54 commented Nov 5, 2016

Might I suggest adding a README file to the performance tool directory that explains how to build and run the test? I'd be happy to give it a shot on my box, but I'm not sure I understand how you are building/executing the tests.

@artpol84
Copy link
Contributor

artpol84 commented Nov 5, 2016

I will provide the README asap

@artpol84
Copy link
Contributor

artpol84 commented Nov 5, 2016

I fixed the problem with local keys in #213 and #212.
With this fix I'm getting 70% improvement. Basically since number of local ranks is not increasing the data published earlier are still relevant but there is no local access anomaly.
For PPC the issue remains. For my configuration I have 2 nodes with 2 10-core POWER8 processors 8 thread each.
If I use 80 procs with per-node mapping and thread binding (2 threads per core) dstore is 50% better:
1.2.0 with messages:

get: total 0.104085 avg loc 0.000129 rem 0.000131 all 0.000130 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000072 ; fence 0.042906
get:           min loc 0.000127 rem 0.000130 (loc: 52, rem: 2)
get:           max loc 0.000133 rem 0.000136 (loc: 15, rem: 15)

1.2.0 with dstore:

get: total 0.057997 avg loc 0.000072 rem 0.000073 all 0.000072 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000072 ; fence 0.049974
get:           min loc 0.000072 rem 0.000072 (loc: 2, rem: 2)
get:           max loc 0.000073 rem 0.000075 (loc: 0, rem: 0)

when I enable 4 threads per core results are comparable:
1.2.0 with messages:

get: total 0.319693 avg loc 0.000197 rem 0.000202 all 0.000200 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000073 ; fence 0.105437
get:           min loc 0.000197 rem 0.000201 (loc: 3, rem: 3)
get:           max loc 0.000200 rem 0.000205 (loc: 2, rem: 2)

1.2.0 with dstore:

get: total 0.320799 avg loc 0.000191 rem 0.000210 all 0.000200 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000073 ; fence 0.119404
get:           min loc 0.000161 rem 0.000162 (loc: 48, rem: 16)
get:           max loc 0.000247 rem 0.000322 (loc: 14, rem: 11)

If all 8 threads are used messaging looks better:
1.2.0 with messages:

get: total 2.215393 avg loc 0.000681 rem 0.000703 all 0.000692 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000071 ; fence 0.273074
get:           min loc 0.000456 rem 0.000697 (loc: 176, rem: 35)
get:           max loc 0.000686 rem 0.000942 (loc: 80, rem: 176)

1.2.0 with dstore:

get: total 3.543331 avg loc 0.001081 rem 0.001133 all 0.001107 ; put: loc 0.000069 rem 0.000069 ; commit: 0.000072 ; fence 0.331281
get:           min loc 0.000915 rem 0.000911 (loc: 16, rem: 32)
get:           max loc 0.001227 rem 0.001303 (loc: 13, rem: 147)

It seems to me that the reason is in concurrency and not in the algorithms used in the dstore to fetch the data. If I run 80, 160, 320 and 640 procs on our intel cluster with 1 thread per core there is no difference between Get time for both run's and the time is ~66 us while with PPC it reaches 1300 ms for the same # of procs - so this is not a scaling issue with regard to access algorithm.
Need to investigate this further.

@artpol84
Copy link
Contributor

artpol84 commented Nov 5, 2016

This also can be a numa effect. I'll check it and get back to you.

@artpol84
Copy link
Contributor

artpol84 commented Nov 6, 2016

@rhc54 here you are - c926d85.
Let me know if you'll need more details (maybe just ping me in the hangout)

@jjhursey
Copy link
Member Author

I ran some perf numbers yesterday on a single machine with a version of PR #217

@jjhursey
Copy link
Member Author

Performance is good enough for v1.2. Additional improvements will likely go into the v1.2.1.

@jjhursey jjhursey modified the milestones: v1.2.2, v1.2.1 Jan 24, 2017
@jjhursey jjhursey reopened this Feb 23, 2017
@jjhursey jjhursey modified the milestones: v1.2.2, v1.2.1 Feb 23, 2017
@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2017

Perhaps the problem here is that we are locking at too great a scale. There is no reason to lock the entire dstore whenever we update or add one data element. If we lock only at the level of a given proc's data, or even better at the individual data element, I suspect the penalty would go away? After all, there is no need for a lock on data that has been completely output.

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2017

Just going thru this code, there are some obvious inefficiencies (or maybe I'm just not understanding the intent). We seem to be packing a buffer when storing info into dstore, which makes no sense since we already had the buffer and could just save it, if that's the intent - when it comes to job-level data, there is no reason to separate it out by rank. We are also using a value array (instead of a pointer array) to hold the buffer, which may not be the best choice.

I think the attempt to unify some of this code led to some unfortunate loss of optimization for both the messaging and dstore channels. A lot of this code was also inlined, which is probably not ideal as it is fairly long and has lots of malloc's going on. One option is to commit to dstore and streamline the code, eliminating unnecessary pack/unpack operations. Another would be to separate the code paths to accomplish the same goal.

In going thru the handshake, I think we can safely eliminate the locks altogether by simply "tagging" the client data objects with a flag to indicate that the data is readable, and having the server send a message when that is the case. If a client wants to access some rank's data and doesn't see the data-readable flag as true, then it sends a message to the server asking for it.

In the unusual case where data is modified, the server can send a "non-readable" message to its clients, wait to recv an "ack" from them, update the data, and then send a "readable" message. I know this is lower performance than locks, but it is the highly unusual case and should therefore be treated as an exception while we streamline the primary code path.

@jjhursey
Copy link
Member Author

I'd like to get the PMIx_Get path more streamlined for data that is already in the dstore - which is the common case for this benchmark. dstore grabs a read lock every time is searches for a key. I can devise a way to reduce the contention, but if we can eliminate the need for the read lock that would be ideal.

My assumption is that writing (via put, fence, commit) is a relatively rare occurrence, but reading (get) is much more common. So some optimizations for that common case would help this app.

The other side of the overhead is interesting too (event lib overhead). I think what is happening there is that we have 1 process per hwthread - each process with 2 threads (main thread, pmix thread). So they are just fighting for the resources too aggressively. I tried playing with the usleep parameter in the PMIX_WAIT_FOR_COMPLETION and did not see any immediate benefit. But that train of thought got me thinking about trying to make PMIx_Get a bit more eager to avoid the event engine altogether.

@artpol84
Copy link
Contributor

@rhc54 I think that the main problem here is thread shifting: ~100ms for _esh_get vs ~500 overall.

@artpol84
Copy link
Contributor

artpol84 commented Feb 23, 2017

👍 for @jjhursey suggestion about fast search. Add some internal locking to protect that but do that.

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2017

@jjhursey I looked at your suggestion, but it may not work in the current architecture due to threading issues. Adding pthread_locks is a bad idea at this point. If libevent is the problem, then let's get @hjelmn's customized event library instead of revamping the architecture.

@artpol84
Copy link
Contributor

artpol84 commented Feb 23, 2017

I think that thread-shifting will be the issue with any event lib at max PPN where all hw-threads are busy. We can certainly try though.

@artpol84
Copy link
Contributor

artpol84 commented Feb 23, 2017

Locking overhead is a known thing:
#273 (comment)

This column of my benchmark was showing that:

readers    RO:lk/s
9          1228750
19         65024
39         22930
79         9085
159        4019

This is a number of locks/s

@artpol84
Copy link
Contributor

I slightly changed my locking perf benchmark (artpol84/poc@c0ed956) making each reader to do as much locks as possible:

$ ./eval.sh hwthread ./test_pthread_prio
readers WO:ovh  WO:lk/s RO:ovh[avg/min/max]     RO:lk/s RW:wovh RW:wtm  RW:rlk/s[avg/min/max]
9       4e-03   2266855 5e-02/5e-02/5e-02       198098  2e-03   3e-02   98220/73571/181544
19      5e-03   2211996 1e-01/1e-01/1e-01       75197   4e-03   5e-02   76962/64231/118359
39      4e-03   2230954 3e-01/2e-01/3e-01       34518   1e-02   1e-01   28306/26448/30231
79      5e-03   2142312 5e-01/4e-01/5e-01       20319   3e-02   1e-01   13664/10791/15935
159     5e-03   2158900 1e+00/8e-01/1e+00       8715    1e-01   3e-01   4138/3221/5817

RO:lk/s shows number of locks per second for the Read-Only test. max is 8715 which means ~114us per lock/unlock cycle. This is almost consistent with what @jjhursey was observing, I guess in his case contention was less intensive.

@rhc54
Copy link
Contributor

rhc54 commented Feb 24, 2017

After talking with @hjelmn, I'm going to bring over the OPAL atomics as they are a great deal faster. I'm still working on the framework for this, but it may take me a little longer than hoped as we have to account for backward compatibility.

@artpol84
Copy link
Contributor

artpol84 commented Feb 24, 2017

Here is the result of the same run but with each reader taking its own lock:

$ ./eval.sh core ./test_pthread_prio
readers WO:ovh  WO:lk/s         RO:ovh[avg/min/max]     RO:lk/s 
9       0e+00   4294967295      9e-03/3e-03/1e-02       1165375
19      0e+00   4294967295      1e-02/4e-03/2e-02       800979
39      0e+00   4294967295      1e-02/4e-03/2e-02       981155
79      0e+00   4294967295      1e-02/8e-03/2e-02       813853
159     0e+00   4294967295      2e-02/8e-03/2e-02       625558

This is 625558 / 8715 = 71x improvement and it would be easy to implement.
I was planning to use atomics + conditions as a next step so having them available would be a good thing. The implementation will be more complicated though. So I suggest to optimize step-by step.
I'll create a PR with independent locks and we will re-evaluate what we see. If I am right in Josh case overhead will reduce from 7s to 7 s / 70 = 0.1s and locking won't be a mager contributor and I'll concentrate on the another 10s portion. Once we done - I'll return to atomic-based implementation.

@artpol84
Copy link
Contributor

Also I think that atomics will be good for the independent lock implementation as well, so looking forward to see them in pmix.

@artpol84
Copy link
Contributor

artpol84 commented Feb 24, 2017

as the baseline here is the same run with no locks:

$ ./eval.sh core ./test_dummy
readers WO:ovh  WO:lk/s         RO:ovh[avg/min/max]     RO:lk/s
9       0e+00   4294967295      6e-04/6e-04/6e-04       16.182.655
19      0e+00   4294967295      6e-04/6e-04/6e-04       16.190.060
39      0e+00   4294967295      1e-03/8e-04/1e-03       8.757.592
79      0e+00   4294967295      2e-03/2e-03/2e-03       5.525.658
159     0e+00   4294967295      3e-03/3e-03/4e-03       2.906.758

@bosilca
Copy link
Contributor

bosilca commented Feb 24, 2017

Just a side note: we have noticed a significant improvement in the built-in atomic performance with gcc 6.2. In OMPI at least this puts us on sync with our own atomic support.

@artpol84
Copy link
Contributor

artpol84 commented Feb 24, 2017

@jjhursey I have some analysis related to your experiments.

The main difference between pmix_perf and your app is that you are doing Fences without data collection, while pmix_perf does that:

This means that you were experimenting with direct modex mode. So at least #316 was taking place doubling dstore overhead. I will force the fix process.
Also in direct modex case there is no surprise in time growth along with node count and ppn.

Also I believe that for PAMI you need the full-modex mode in OMPI because you pre-connect all the processes in MPI_Init so you can't benefit from direct modex.

Here is results of your benchmark for dstore/msg with full/direct modex. dstore shows the best result (for the full modex):
UPDATE: I was experimenting on 3 POWER8 nodes with 160 hw thread each, so ./run.sh 480 launches 3 fully-packed POWER8 nodes.

# dstore/pthread/full_modex
$ ./run.sh 480
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
  0/480) Total Time: 1.583986
  0/480) Get       : 0.862791 ( 54.47%) [0.000257 /  3363]
  0/480) Put       : 0.000393 (  0.02%) [0.000066 /     6]
  0/480) Fence     : 0.695426 ( 43.90%) [0.347713 /     2]
  0/480) Commit    : 0.000297 (  0.02%) [0.000297 /     1]

#dstore/pthread/direct_modex
$ ./run.sh 480
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
  0/480) Total Time: 7.669019
  0/480) Get       : 7.076788 ( 92.28%) [0.002104 /  3363]
  0/480) Put       : 0.000356 (  0.00%) [0.000059 /     6]
  0/480) Fence     : 0.566812 (  7.39%) [0.283406 /     2]
  0/480) Commit    : 0.000236 (  0.00%) [0.000236 /     1]


# msg/full_modex
$ ./run.sh 480
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
  0/480) Total Time: 3.352131
  0/480) Get       : 2.193313 ( 65.43%) [0.000652 /  3363]
  0/480) Put       : 0.000359 (  0.01%) [0.000060 /     6]
  0/480) Fence     : 1.092014 ( 32.58%) [0.546007 /     2]
  0/480) Commit    : 0.000132 (  0.00%) [0.000132 /     1]

#msg/direct_modex
$ ./run.sh 480
libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory
  0/480) Total Time: 3.740087
  0/480) Get       : 2.626587 ( 70.23%) [0.000781 /  3363]
  0/480) Put       : 0.000365 (  0.01%) [0.000061 /     6]
  0/480) Fence     : 1.042588 ( 27.88%) [0.521294 /     2]
  0/480) Commit    : 0.000129 (  0.00%) [0.000129 /     1]

I was experimenting with pmix/master, ompi/master and applied the following patch to your original benchmark to enable full-modex:

$ diff -Naur pmix_get_test.c_old pmix_get_test.c
--- pmix_get_test.c_old 2017-02-25 01:09:25.610528000 +0200
+++ pmix_get_test.c     2017-02-25 01:24:15.000000000 +0200
@@ -66,7 +66,7 @@
     acc_commit = total_commit = 0;

     ts_top = ts_start = GET_TS;
-    ret = PMIx_Init(&my_proc);
+    ret = PMIx_Init(&my_proc, NULL, 0);
     ts_end = GET_TS;
     ts_init = ts_end - ts_start;
     if( PMIX_SUCCESS != ret ) {
@@ -75,29 +75,29 @@
     }

     strncpy(proc.nspace, my_proc.nspace, PMIX_MAX_NSLEN);
-    proc.rank = 0;

     /*
      * Emulate some gets that Open MPI often incurs at this point
      */
+    proc.rank = PMIX_RANK_WILDCARD;
     get(my_proc, PMIX_LOCAL_RANK);
     get(my_proc, PMIX_NODE_RANK);
-    get(my_proc, PMIX_UNIV_SIZE);
-    get(my_proc, PMIX_JOB_SIZE);
-    get(my_proc, PMIX_APPNUM);
-    get(my_proc, PMIX_LOCAL_SIZE);
-    get(my_proc, PMIX_LOCAL_TOPO);
-    get(my_proc, PMIX_LOCAL_PEERS);
-    get(my_proc, PMIX_LOCAL_CPUSETS);
+    get(proc, PMIX_UNIV_SIZE);
+    get(proc, PMIX_JOB_SIZE);
+//    get(my_proc, PMIX_APPNUM);
+    get(proc, PMIX_LOCAL_SIZE);
+    get(proc, PMIX_LOCAL_TOPO);
+    get(proc, PMIX_LOCAL_PEERS);
+//    get(my_proc, PMIX_LOCAL_CPUSETS);

     /*
      * APP emulation
      */
     start_timing = true;
     PRINT_STATUS("APP Get...");
-    get(my_proc, PMIX_JOB_SIZE);
-    get(my_proc, PMIX_LOCAL_SIZE);
-    get(my_proc, PMIX_LOCAL_PEERS);
+    get(proc, PMIX_JOB_SIZE);
+    get(proc, PMIX_LOCAL_SIZE);
+    get(proc, PMIX_LOCAL_PEERS);

     // buildMapCache()
     //  foreach proc:
@@ -106,18 +106,22 @@
         proc.rank = i;
         get(proc, PMIX_HOSTNAME);
     }
-
     //
     // PMIX_Fence()
     //
+    bool value1 = true;
+    PMIX_INFO_CREATE(info, 1);
+    (void)strncpy(info->key, PMIX_COLLECT_DATA, PMIX_MAX_KEYLEN);
+    pmix_value_load(&info->value, &value1, PMIX_BOOL);
+    ninfo = 1;
+/*
     PRINT_STATUS("APP Fence...");
-    proc.rank = PMIX_RANK_WILDCARD;
     ts_start = GET_TS;
     PMIx_Fence(&proc, 1, info, ninfo);
     ts_end = GET_TS;
     total_fence++;
     acc_fence += (ts_end - ts_start);
-
+*/
     //
     // Put aaa / bbb
     //
@@ -196,7 +200,7 @@
      * Cleanup
      */
     ts_start = GET_TS;
-    ret = PMIx_Finalize();
+    ret = PMIx_Finalize(NULL, 0);
     ts_end = GET_TS;
     ts_finalize = ts_end - ts_start;
     if( PMIX_SUCCESS != ret ) {
@@ -259,7 +263,7 @@
         acc_get += (ts_end - ts_start);
     }

-    if( 0 == strcmp(PMIX_UNIV_SIZE, key) ) {
+    if( 0 == strcmp(PMIX_JOB_SIZE, key) ) {
         univ_size = value->data.uint32;
     }

@artpol84
Copy link
Contributor

@jjhursey @gpaulsen
I checked recent OMPI v2.x and data collection flag seems to be on:

           MCA pmix base: parameter "pmix_base_collect_data" (current value: "true", data source: default, level: 9 dev/all, type: bool)
                          Collect all data during modex
                          Valid values: 0: f|false|disabled|no, 1: t|true|enabled|yes

However I guess that Spectrum MPI may have different defaults. Can you comment?

@jjhursey
Copy link
Member Author

@artpol84 Thanks for the notes.

  • I'll try changing the PMIx_Fence and see if it helps. I think the full modex makes sense for this app.
  • We are not necessarily preconnecting everything in startup, but we are collecting all of the connectivity information at the moment.
  • In SMPI the default for pmix_base_collect_data is the same as OMPI.

@artpol84
Copy link
Contributor

Do you remember why you were using empty fence in your reproducer?

@artpol84
Copy link
Contributor

artpol84 commented Feb 25, 2017

Followup:

  • @jjhursey I think that the times that we considered as thread-shifting can sometimes include times to request remote rank blobs which is expected to have higher delays.
  • The 500 -> 185 improvement when you implemented direct access without thread-shifting needs further analysis as I'm not quite understand why you gain improvement from that, but I haven't seen the code.
  • However the fact of direct modex explains errors that you mentioned for both of your lock-free and short-patch optimizations.
  • With direct modex dstore has much higher overhead for following reasons:
    • It has to execute the same path as with messaging - message still needs to be send so we have message overhead.
    • we search dstore twice at the first access because of PMIx_Get logic refactoring in v1.2.2 #316
    • we search dstore one more time when we receive the completion from the server that says that data was published in the dstore.
    • In case of messaging we avoid 3 dstore lookups but have one larger message which shouldn't affect much. We may want to revisit this, however in direct modex case quite a few keys expected to be retrieved as opposed to the full-modex case.
  • Finally: in direct-modex case you were dealing with reader/writer contention as opposed to reader-only contention as we assumed originally.

@jjhursey
Copy link
Member Author

All numbers below the Benchmark Application header were from the benchmark that I posted. That way folks can try to reproduce.

@artpol84
Copy link
Contributor

I didn't get the last comment.
If this is with respect to my comment about '500 -> 180' optimization, then I meant that I haven't seen the optimization code in PMIx.

@artpol84
Copy link
Contributor

If this was with respect to
'Do you remember why you were using empty fence in your reproducer?'

Then I meant that it seems that you were trying to keep your reproducer close to SMPI pattern, so the fact that you used Fence without data collection may mean that there was no such flag in SMPI, or otherwise it may be just a typo.

@jjhursey
Copy link
Member Author

I re-ran the performance measurement changing the one PMIx_Fence to be a full modex. That helped performance quite a bit.

Here are the updated numbers from the application.

32 nodes x 20 ppn = 640 processes

Direct modex (original results)

1.1.2        0.000192 x 3840 = 0.738289
1.2.1 wo     0.000217 x 3840 = 0.831681
1.2.1 with   0.000498 x 3840 = 1.911862

Full modex

1.1.2        0.000088 x 3840 = 0.33792
1.2.1 wo     0.000088 x 3840 = 0.33792
1.2.1 with   0.000065 x 3840 = 0.2496

32 nodes x 80 ppn = 2560 processes

Direct modex

1.1.2        0.000335 x 15360 = 5.1456
1.2.1 wo     0.000333 x 15360 = 5.11488
1.2.1 with   0.000647 x 15360 = 9.93792

Full modex

1.1.2        0.000204 x 15360 = 3.13344
1.2.1 wo     0.000202 x 15360 = 3.10272
1.2.1 with   0.000097 x 15360 = 1.48992

32 nodes x 160 ppn = 5120 processes

Direct modex (original results)

1.1.2        0.000642 x 30720 = 19.733954
1.2.1 wo     0.000660 x 30720 = 20.267527
1.2.1 with   0.001719 x 30720 = 52.802807

Full modex

1.1.2        0.000432 x 30720 = 13.27104
1.2.1 wo     0.000433 x 30720 = 13.30176
1.2.1 with   0.000255 x 30720 = 7.8336

Summary

  • Switching to a full modex helped PMIx_Get performance considerably. Thanks!
  • Using dstore with a fully loaded machine showed improvement over the messaging approach.
  • I did not have time to rerun the detail analysis of where the hot spots are, even with the full modex. But I would guess there is room for further improvement. I'll try to do that at some point soon, and open new tickets of places where I think we can improve things further.
  • There is no documentation that I could find on the difference between direct modex and full modex. The documentation in the PMIx_Fence does not use those terms. It would be good to have some documentation on that somewhere searchable.
    • The original reason we were using the direct modex instead of the full modex was a miscommunication between the teams. I didn't even catch it when making the benchmark last week, even though it seems obvious in retrospect. Thanks for the help identifying that.

I'm going to close this again. The performance issue was with the original application, and that has been resolved. I think there are still places for improvement, but we can open separate tickets for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants