Still, Too slow warp speed in MacOS #1955

xiaoqianWX · 2018-10-09T02:16:09Z

Principia Version: Descartes
MacOS Version: 10.13.6
KSP Version: 1.3.1
It has about 30-50% improvement after last update, but still, with only 3 spacecraft, RSS+RO+16K texture, I can only get 7 fps on KSC page with no time warping, before the update, it's 4-5fps. Warping to 10,000x is the maximum speed you can get without getting less than 1fps. In Windows 10, the same computer, same GameData, same KSP version, it could reach as high fps as I set in setting.

xiaoqianWX · 2018-10-09T02:16:55Z

#1908

pleroy · 2018-10-14T12:38:40Z

Added support for benchmarking on Linux and macOS in #1960 and #1961.
On Windows:

Run on (4 X 3310 MHz CPU s)
10/14/18 14:04:33
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   97897958 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2   72153857 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3   44194468 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4   44230655 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5   43977030 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

On Linux:

Run on (4 X 3361.62 MHz CPU s)
2018-10-14 12:34:57
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1  105878364 ns      77017 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2   97582405 ns      69892 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3   71511675 ns      66142 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4   72273303 ns      57613 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5   71218125 ns      64278 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

On macOS:

Run on (4 X 2500 MHz CPU s)
2018-10-14 14:52:30
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   78955858 ns      63060 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/2   63007856 ns      52350 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/3  141527206 ns      45900 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/4  149050177 ns      44210 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/5  146021077 ns      44110 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

One of these is not like the others.

(Edited on 2018-11-12 to use the optimized numbers, not the debug numbers for macOS.)

pleroy · 2018-10-14T18:42:54Z

The benchmark in #1962 shows clear problems with std::mutex on macOS. According to these threads, macOS doesn't use spinlocks/futex but always locks a kernel object. That would explain the numbers that we observe, in particular the fact that things are slowing down considerably when the numbers of threads increases, as there is probably quite a bit of contention on the thread pool's queue.

I am not sure how to fix this: it seems intrinsic to macOS, and we are not going to come up with our own mutex implementation. One possibility would be to use absl::Mutex but that would be a major effort.

ts826848 · 2018-10-15T19:40:49Z

It appears there may be a way to make macOS mutexes unfair? From this Mozilla blog:

What’s more, the developers of OS X have recognized this and added a way to make their mutexes non-fair. In <pthread_spis.h>, there’s a OS X-only function, pthread_mutexattr_setpolicy_np. (pthread mutex attributes control various qualities of pthread mutexes: normal, recursively acquirable, etc. This particular function, supported since OS X 10.7, enables setting the fairness policy of mutexes to either _PTHREAD_MUTEX_POLICY_FAIRSHARE (the default) or _PTHREAD_MUTEX_POLICY_FIRSTFIT.

The author saw a roughly order of magnitude increase in a lock fairness benchmark ported from Webkit to use raw pthreads on macOS 10.10.5 on a Mac mini. I don't know what libc++ uses for std::mutex/std::shared_mutex on macOS, but if it's pthreads then there may be hope.

pleroy · 2018-11-12T21:08:25Z

After #1978, the (optimized) numbers for macOS are as follow:

2018-11-12 21:45:54
Running bin/benchmark
Run on (4 X 2500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   79742933 ns      63410 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/2   61625369 ns      52760 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/3   55095339 ns      41910 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/4   55820635 ns      40140 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/5   55846511 ns      37670 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

Compare this with the macOS numbers above: we have 3 vessels; for 1 and 2 threads in the pool, the elapsed times are the same because there is no contention; for 3 threads or more, the native macOS mutex goes into contention and fairness madness, whereas the absl::Mutex implementation yields a performance independent from the number of threads. The bottom line is that the performance gain is around 2.7×. For processors with larger number of cores, the difference should be even bigger when there are many vessels (which is nearly always the case because of asteroids).

The change will be in Erdős, to be released around December 7th. I am closing this bug. Please comment once Erdős is released on the performance gain (or not).

pleroy · 2018-12-06T18:53:17Z

Erdős is now available. When you have a chance, please tell us how the performance looks on macOS (our mac is a rather weak machine, so it's a bit hard to tell).

xiaoqianWX · 2018-12-18T14:48:15Z

@pleroy I've been testing Erdos on my MacBook Pro 13' for 3 days, now, even with this machine, I can get up to 30 fps, and it basically didn't have much impact on CPU compares to not installing Principia. Because I am not at home currently, so I can't tell the difference between MacOS and Windows, but current testing indicates a very good result, after I come home, I would test on the same machine I used on earlier tests.
My MacBook Pro 13':
CPU: i7-7660U
Mem: LPDDR3 2133MHZ 8GB
Graphics: Inter Iris Plus Graphics 640

pleroy · 2018-12-18T15:35:05Z

Thanks for the info @EthanWang706. Looking forward to hearing about the performance on your home machine.

pleroy added the bug label Oct 14, 2018

pleroy mentioned this issue Oct 14, 2018

A benchmark for ThreadPool #1962

Merged

pleroy mentioned this issue Oct 14, 2018

Not Normal Warp Speed In RSS #1908

Closed

pleroy mentioned this issue Nov 11, 2018

Replace all usages of std::mutex by absl::Mutex #1978

Merged

pleroy closed this as completed Nov 12, 2018

pleroy added this to the Erdős milestone Dec 8, 2018

rnlahaye mentioned this issue Feb 24, 2021

Principia is slow on macOS (possibly due to Unity's allocator) #2899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still, Too slow warp speed in MacOS #1955

Still, Too slow warp speed in MacOS #1955

xiaoqianWX commented Oct 9, 2018

xiaoqianWX commented Oct 9, 2018

pleroy commented Oct 14, 2018 •

edited

Loading

pleroy commented Oct 14, 2018

ts826848 commented Oct 15, 2018

pleroy commented Nov 12, 2018

pleroy commented Dec 6, 2018

xiaoqianWX commented Dec 18, 2018 •

edited

Loading

pleroy commented Dec 18, 2018

Still, Too slow warp speed in MacOS #1955

Still, Too slow warp speed in MacOS #1955

Comments

xiaoqianWX commented Oct 9, 2018

xiaoqianWX commented Oct 9, 2018

pleroy commented Oct 14, 2018 • edited Loading

pleroy commented Oct 14, 2018

ts826848 commented Oct 15, 2018

pleroy commented Nov 12, 2018

pleroy commented Dec 6, 2018

xiaoqianWX commented Dec 18, 2018 • edited Loading

pleroy commented Dec 18, 2018

pleroy commented Oct 14, 2018 •

edited

Loading

xiaoqianWX commented Dec 18, 2018 •

edited

Loading