Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still, Too slow warp speed in MacOS #1955

Closed
xiaoqianWX opened this issue Oct 9, 2018 · 8 comments
Closed

Still, Too slow warp speed in MacOS #1955

xiaoqianWX opened this issue Oct 9, 2018 · 8 comments
Labels
Milestone

Comments

@xiaoqianWX
Copy link

Principia Version: Descartes
MacOS Version: 10.13.6
KSP Version: 1.3.1
It has about 30-50% improvement after last update, but still, with only 3 spacecraft, RSS+RO+16K texture, I can only get 7 fps on KSC page with no time warping, before the update, it's 4-5fps. Warping to 10,000x is the maximum speed you can get without getting less than 1fps. In Windows 10, the same computer, same GameData, same KSP version, it could reach as high fps as I set in setting.

@xiaoqianWX
Copy link
Author

#1908

@pleroy
Copy link
Member

pleroy commented Oct 14, 2018

Added support for benchmarking on Linux and macOS in #1960 and #1961.
On Windows:

Run on (4 X 3310 MHz CPU s)
10/14/18 14:04:33
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   97897958 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2   72153857 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3   44194468 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4   44230655 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5   43977030 ns          0 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

On Linux:

Run on (4 X 3361.62 MHz CPU s)
2018-10-14 12:34:57
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1  105878364 ns      77017 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/2   97582405 ns      69892 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/3   71511675 ns      66142 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/4   72273303 ns      57613 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m
BM_EphemerisMultithreadingBenchmark/3/5   71218125 ns      64278 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m

On macOS:

Run on (4 X 2500 MHz CPU s)
2018-10-14 14:52:30
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   78955858 ns      63060 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/2   63007856 ns      52350 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/3  141527206 ns      45900 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/4  149050177 ns      44210 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/5  146021077 ns      44110 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 

One of these is not like the others.

(Edited on 2018-11-12 to use the optimized numbers, not the debug numbers for macOS.)

@pleroy
Copy link
Member

pleroy commented Oct 14, 2018

The benchmark in #1962 shows clear problems with std::mutex on macOS. According to these threads, macOS doesn't use spinlocks/futex but always locks a kernel object. That would explain the numbers that we observe, in particular the fact that things are slowing down considerably when the numbers of threads increases, as there is probably quite a bit of contention on the thread pool's queue.

I am not sure how to fix this: it seems intrinsic to macOS, and we are not going to come up with our own mutex implementation. One possibility would be to use absl::Mutex but that would be a major effort.

@ts826848
Copy link
Contributor

It appears there may be a way to make macOS mutexes unfair? From this Mozilla blog:

What’s more, the developers of OS X have recognized this and added a way to make their mutexes non-fair. In <pthread_spis.h>, there’s a OS X-only function, pthread_mutexattr_setpolicy_np. (pthread mutex attributes control various qualities of pthread mutexes: normal, recursively acquirable, etc. This particular function, supported since OS X 10.7, enables setting the fairness policy of mutexes to either _PTHREAD_MUTEX_POLICY_FAIRSHARE (the default) or _PTHREAD_MUTEX_POLICY_FIRSTFIT.

The author saw a roughly order of magnitude increase in a lock fairness benchmark ported from Webkit to use raw pthreads on macOS 10.10.5 on a Mac mini. I don't know what libc++ uses for std::mutex/std::shared_mutex on macOS, but if it's pthreads then there may be hope.

@pleroy
Copy link
Member

pleroy commented Nov 12, 2018

After #1978, the (optimized) numbers for macOS are as follow:

2018-11-12 21:45:54
Running bin/benchmark
Run on (4 X 2500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-------------------------------------------------------------------------------
Benchmark                                        Time           CPU Iterations
-------------------------------------------------------------------------------
BM_EphemerisMultithreadingBenchmark/3/1   79742933 ns      63410 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/2   61625369 ns      52760 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/3   55095339 ns      41910 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/4   55820635 ns      40140 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 
BM_EphemerisMultithreadingBenchmark/3/5   55846511 ns      37670 ns        100 +9.98765338665172644e+06 m +1.99941207398748249e+07 m +2.99993670217528194e+07 m 

Compare this with the macOS numbers above: we have 3 vessels; for 1 and 2 threads in the pool, the elapsed times are the same because there is no contention; for 3 threads or more, the native macOS mutex goes into contention and fairness madness, whereas the absl::Mutex implementation yields a performance independent from the number of threads. The bottom line is that the performance gain is around 2.7×. For processors with larger number of cores, the difference should be even bigger when there are many vessels (which is nearly always the case because of asteroids).

The change will be in Erdős, to be released around December 7th. I am closing this bug. Please comment once Erdős is released on the performance gain (or not).

@pleroy pleroy closed this as completed Nov 12, 2018
@pleroy
Copy link
Member

pleroy commented Dec 6, 2018

Erdős is now available. When you have a chance, please tell us how the performance looks on macOS (our mac is a rather weak machine, so it's a bit hard to tell).

@pleroy pleroy added this to the Erdős milestone Dec 8, 2018
@xiaoqianWX
Copy link
Author

xiaoqianWX commented Dec 18, 2018

@pleroy I've been testing Erdos on my MacBook Pro 13' for 3 days, now, even with this machine, I can get up to 30 fps, and it basically didn't have much impact on CPU compares to not installing Principia. Because I am not at home currently, so I can't tell the difference between MacOS and Windows, but current testing indicates a very good result, after I come home, I would test on the same machine I used on earlier tests.
My MacBook Pro 13':
CPU: i7-7660U
Mem: LPDDR3 2133MHZ 8GB
Graphics: Inter Iris Plus Graphics 640

@pleroy
Copy link
Member

pleroy commented Dec 18, 2018

Thanks for the info @EthanWang706. Looking forward to hearing about the performance on your home machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants