New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into consideration SMT threads when blocking for L1 cache #54

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@antonblanchard

antonblanchard commented Jul 20, 2018

SMT threads share an L1 cache, so to avoid spilling out of the
L1 cache we need to divide the cache size by the number of threads.

Performance on a POWER9 running in SMT4 mode improves 12% with this patch.

Signed-off-by: Anton Blanchard anton@ozlabs.org

Take into consideration SMT threads when blocking for L1 cache
SMT threads share an L1 cache, so to avoid spilling out of the
L1 cache we need to divide the cache size by the number of threads.

Performance on a POWER9 running in SMT4 mode improves 12% with this patch.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 20, 2018

Owner

Thanks for your pull request!

I actually had the same idea (sieveSize = cacheSize / number of threads per CPU core) about a year ago and I implemented and tested it on my Intel Skylake CPU and I guess on other CPU architectures as well. My benchmark results at the time were mixed: the code ran slightly faster on some CPUs but on other CPUs it ran slightly slower.

The conclusion I drew from my benchmarks a year ago was that it is tricky: dividing the cache size by the number of SMT threads per core is not reliably faster on all CPU architectures. Also using a smaller sieve size (than the CPU's cache size) usually decreases performance slightly in single-thread mode. For these reasons I decided a year ago that this code change was not worth it and I rejected the idea.

But maybe I am wrong and it is time to test and benchmark this again and more thoroughly. I remember having read that y-cruncher already takes into account SMT threads when doing cache blocking and apparently this is faster for y-cruncher.

primesieve currently uses both the L1 and the L2 caches sizes for prime sieving. The L1 cache size is used in the EratSmall algorithm and the L2 cache sizes is used in the EratMedium and EratBig algorithms. It is worth benchmarking if dividing the L2 cache size by the number of SMT threads per CPU core is also faster.

I will run extensive benchmarks on different CPU architectures over the next week or so and publish the results here when done. Then I will decide how to proceed with your pull request.

Owner

kimwalisch commented Jul 20, 2018

Thanks for your pull request!

I actually had the same idea (sieveSize = cacheSize / number of threads per CPU core) about a year ago and I implemented and tested it on my Intel Skylake CPU and I guess on other CPU architectures as well. My benchmark results at the time were mixed: the code ran slightly faster on some CPUs but on other CPUs it ran slightly slower.

The conclusion I drew from my benchmarks a year ago was that it is tricky: dividing the cache size by the number of SMT threads per core is not reliably faster on all CPU architectures. Also using a smaller sieve size (than the CPU's cache size) usually decreases performance slightly in single-thread mode. For these reasons I decided a year ago that this code change was not worth it and I rejected the idea.

But maybe I am wrong and it is time to test and benchmark this again and more thoroughly. I remember having read that y-cruncher already takes into account SMT threads when doing cache blocking and apparently this is faster for y-cruncher.

primesieve currently uses both the L1 and the L2 caches sizes for prime sieving. The L1 cache size is used in the EratSmall algorithm and the L2 cache sizes is used in the EratMedium and EratBig algorithms. It is worth benchmarking if dividing the L2 cache size by the number of SMT threads per CPU core is also faster.

I will run extensive benchmarks on different CPU architectures over the next week or so and publish the results here when done. Then I will decide how to proceed with your pull request.

@kimwalisch kimwalisch self-assigned this Jul 20, 2018

@antonblanchard

This comment has been minimized.

Show comment
Hide comment
@antonblanchard

antonblanchard Jul 20, 2018

Thanks for the information, I'll take a look at the L2 blocked benchmarks next.

FYI here are the results on POWER9 before and after my patch, for all SMT levels. Interesting that SMT2 is the best performing.

./primesieve 1e12 --quiet --time

Time in seconds:

SMT level       baseline        patched
1               14.728          14.899
2               15.362          14.040
4               19.489          17.458

antonblanchard commented Jul 20, 2018

Thanks for the information, I'll take a look at the L2 blocked benchmarks next.

FYI here are the results on POWER9 before and after my patch, for all SMT levels. Interesting that SMT2 is the best performing.

./primesieve 1e12 --quiet --time

Time in seconds:

SMT level       baseline        patched
1               14.728          14.899
2               15.362          14.040
4               19.489          17.458
@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 21, 2018

Owner
$ ./benchmark.sh
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU cores: 4
Number of threads: 8
L1 data cache size: 32K
L2 cache size: 256K

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  5.052 sec
primesieve 1e12 --dist=1e11:  6.876 sec
primesieve 1e15 --dist=1e11: 11.165 sec
primesieve 1e17 --dist=1e11: 14.724 sec
primesieve 1e19 --dist=1e11: 24.839 sec # fastest

=== smallSieve=32KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  4.937 sec # fastest
primesieve 1e12 --dist=1e11:  6.718 sec # fastest
primesieve 1e15 --dist=1e11: 10.857 sec # fastest
primesieve 1e17 --dist=1e11: 14.090 sec # fastest
primesieve 1e19 --dist=1e11: 30.694 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  5.142 sec
primesieve 1e12 --dist=1e11:  7.195 sec
primesieve 1e15 --dist=1e11: 11.611 sec
primesieve 1e17 --dist=1e11: 14.573 sec
primesieve 1e19 --dist=1e11: 25.059 sec

=== smallSieve=16KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  4.953 sec
primesieve 1e12 --dist=1e11:  6.859 sec
primesieve 1e15 --dist=1e11: 10.974 sec
primesieve 1e17 --dist=1e11: 14.120 sec
primesieve 1e19 --dist=1e11: 30.764 sec

SMT cache blocking gives slightly better performance for largeSieve. For smallSieve SMT cache blocking does not meaningfully change performance.

Owner

kimwalisch commented Jul 21, 2018

$ ./benchmark.sh
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU cores: 4
Number of threads: 8
L1 data cache size: 32K
L2 cache size: 256K

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  5.052 sec
primesieve 1e12 --dist=1e11:  6.876 sec
primesieve 1e15 --dist=1e11: 11.165 sec
primesieve 1e17 --dist=1e11: 14.724 sec
primesieve 1e19 --dist=1e11: 24.839 sec # fastest

=== smallSieve=32KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  4.937 sec # fastest
primesieve 1e12 --dist=1e11:  6.718 sec # fastest
primesieve 1e15 --dist=1e11: 10.857 sec # fastest
primesieve 1e17 --dist=1e11: 14.090 sec # fastest
primesieve 1e19 --dist=1e11: 30.694 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  5.142 sec
primesieve 1e12 --dist=1e11:  7.195 sec
primesieve 1e15 --dist=1e11: 11.611 sec
primesieve 1e17 --dist=1e11: 14.573 sec
primesieve 1e19 --dist=1e11: 25.059 sec

=== smallSieve=16KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  4.953 sec
primesieve 1e12 --dist=1e11:  6.859 sec
primesieve 1e15 --dist=1e11: 10.974 sec
primesieve 1e17 --dist=1e11: 14.120 sec
primesieve 1e19 --dist=1e11: 30.764 sec

SMT cache blocking gives slightly better performance for largeSieve. For smallSieve SMT cache blocking does not meaningfully change performance.

@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 21, 2018

Owner
$ ./benchmark.sh 1e12
AMD EPYC 7401P 24-Core Processor
CPU cores: 24
Number of threads: 48
L1 data cache size: 32K
L2 cache size: 512K

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12: 15.559 sec
primesieve 1e12 --dist=1e12: 17.869 sec
primesieve 1e15 --dist=1e12: 29.413 sec
primesieve 1e17 --dist=1e12: 38.953 sec
primesieve 1e19 --dist=1e12: 62.289 sec # fastest

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e12: 16.056 sec
primesieve 1e12 --dist=1e12: 17.806 sec
primesieve 1e15 --dist=1e12: 29.948 sec
primesieve 1e17 --dist=1e12: 39.915 sec
primesieve 1e19 --dist=1e12: 68.782 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12: 15.930 sec
primesieve 1e12 --dist=1e12: 17.761 sec
primesieve 1e15 --dist=1e12: 28.868 sec # fastest
primesieve 1e17 --dist=1e12: 38.509 sec # fastest
primesieve 1e19 --dist=1e12: 63.463 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e12: 15.521 sec # fastest
primesieve 1e12 --dist=1e12: 17.391 sec # fastest
primesieve 1e15 --dist=1e12: 28.996 sec
primesieve 1e17 --dist=1e12: 38.977 sec
primesieve 1e19 --dist=1e12: 68.323 sec

SMT cache blocking gives slightly better performance (1% - 2%).

Owner

kimwalisch commented Jul 21, 2018

$ ./benchmark.sh 1e12
AMD EPYC 7401P 24-Core Processor
CPU cores: 24
Number of threads: 48
L1 data cache size: 32K
L2 cache size: 512K

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12: 15.559 sec
primesieve 1e12 --dist=1e12: 17.869 sec
primesieve 1e15 --dist=1e12: 29.413 sec
primesieve 1e17 --dist=1e12: 38.953 sec
primesieve 1e19 --dist=1e12: 62.289 sec # fastest

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e12: 16.056 sec
primesieve 1e12 --dist=1e12: 17.806 sec
primesieve 1e15 --dist=1e12: 29.948 sec
primesieve 1e17 --dist=1e12: 39.915 sec
primesieve 1e19 --dist=1e12: 68.782 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12: 15.930 sec
primesieve 1e12 --dist=1e12: 17.761 sec
primesieve 1e15 --dist=1e12: 28.868 sec # fastest
primesieve 1e17 --dist=1e12: 38.509 sec # fastest
primesieve 1e19 --dist=1e12: 63.463 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e12: 15.521 sec # fastest
primesieve 1e12 --dist=1e12: 17.391 sec # fastest
primesieve 1e15 --dist=1e12: 28.996 sec
primesieve 1e17 --dist=1e12: 38.977 sec
primesieve 1e19 --dist=1e12: 68.323 sec

SMT cache blocking gives slightly better performance (1% - 2%).

@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 21, 2018

Owner
$ ./benchmark.sh 1e12
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
CPU cores:  8
Number of threads: 16
L1 data cache size: 32K
L2 cache size: 1024K

=== smallSieve=32KB, largeSieve=1024KB ===

primesieve 1e0  --dist=1e12:  35.069 sec
primesieve 1e12 --dist=1e12:  40.516 sec
primesieve 1e15 --dist=1e12:  71.349 sec
primesieve 1e17 --dist=1e12:  96.797 sec
primesieve 1e19 --dist=1e12: 130.357 sec

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12:  31.013 sec
primesieve 1e12 --dist=1e12:  33.267 sec # fastest
primesieve 1e15 --dist=1e12:  60.488 sec # fastest
primesieve 1e17 --dist=1e12:  75.310 sec # fastest
primesieve 1e19 --dist=1e12: 110.033 sec # fastest

=== smallSieve=16KB, largeSieve=1024KB ===

primesieve 1e0  --dist=1e12:  40.691 sec
primesieve 1e12 --dist=1e12:  44.387 sec
primesieve 1e15 --dist=1e12:  75.691 sec
primesieve 1e17 --dist=1e12: 100.728 sec
primesieve 1e19 --dist=1e12: 132.698 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12:  29.407 sec # fastest
primesieve 1e12 --dist=1e12:  34.170 sec
primesieve 1e15 --dist=1e12:  61.284 sec
primesieve 1e17 --dist=1e12:  83.831 sec
primesieve 1e19 --dist=1e12: 110.908 sec

The Intel(R) Xeon(R) Platinum 8124M is a special CPU because it is Intel's first CPU architecture which has 1 megabyte of L2 cache per core.

SMT cache blocking gives significantly better performance (up to 30%).

Owner

kimwalisch commented Jul 21, 2018

$ ./benchmark.sh 1e12
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
CPU cores:  8
Number of threads: 16
L1 data cache size: 32K
L2 cache size: 1024K

=== smallSieve=32KB, largeSieve=1024KB ===

primesieve 1e0  --dist=1e12:  35.069 sec
primesieve 1e12 --dist=1e12:  40.516 sec
primesieve 1e15 --dist=1e12:  71.349 sec
primesieve 1e17 --dist=1e12:  96.797 sec
primesieve 1e19 --dist=1e12: 130.357 sec

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12:  31.013 sec
primesieve 1e12 --dist=1e12:  33.267 sec # fastest
primesieve 1e15 --dist=1e12:  60.488 sec # fastest
primesieve 1e17 --dist=1e12:  75.310 sec # fastest
primesieve 1e19 --dist=1e12: 110.033 sec # fastest

=== smallSieve=16KB, largeSieve=1024KB ===

primesieve 1e0  --dist=1e12:  40.691 sec
primesieve 1e12 --dist=1e12:  44.387 sec
primesieve 1e15 --dist=1e12:  75.691 sec
primesieve 1e17 --dist=1e12: 100.728 sec
primesieve 1e19 --dist=1e12: 132.698 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e12:  29.407 sec # fastest
primesieve 1e12 --dist=1e12:  34.170 sec
primesieve 1e15 --dist=1e12:  61.284 sec
primesieve 1e17 --dist=1e12:  83.831 sec
primesieve 1e19 --dist=1e12: 110.908 sec

The Intel(R) Xeon(R) Platinum 8124M is a special CPU because it is Intel's first CPU architecture which has 1 megabyte of L2 cache per core.

SMT cache blocking gives significantly better performance (up to 30%).

@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 21, 2018

Owner
$ ./benchmark.sh 
IBM POWER8
CPU cores: 10
Number of threads: 80
L1 data cache size: 64K
L2 cache size: 512K

=== smallSieve=64KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.484 sec
primesieve 1e12 --dist=1e11:  5.362 sec
primesieve 1e15 --dist=1e11:  9.339 sec
primesieve 1e17 --dist=1e11: 13.011 sec
primesieve 1e19 --dist=1e11: 23.648 sec

=== smallSieve=64KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  3.554 sec
primesieve 1e12 --dist=1e11:  4.830 sec
primesieve 1e15 --dist=1e11:  6.841 sec # fastest
primesieve 1e17 --dist=1e11: 11.821 sec
primesieve 1e19 --dist=1e11: 24.257 sec

=== smallSieve=64KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.514 sec
primesieve 1e12 --dist=1e11:  5.465 sec
primesieve 1e15 --dist=1e11:  9.310 sec
primesieve 1e17 --dist=1e11:  9.916 sec # fastest
primesieve 1e19 --dist=1e11: 23.459 sec # fastest

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.759 sec
primesieve 1e12 --dist=1e11:  4.673 sec # fastest
primesieve 1e15 --dist=1e11:  9.805 sec
primesieve 1e17 --dist=1e11: 13.541 sec
primesieve 1e19 --dist=1e11: 27.200 sec

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  3.389 sec # fastest
primesieve 1e12 --dist=1e11:  6.504 sec
primesieve 1e15 --dist=1e11:  9.158 sec
primesieve 1e17 --dist=1e11: 10.902 sec
primesieve 1e19 --dist=1e11: 26.091 sec

=== smallSieve=32KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.432 sec
primesieve 1e12 --dist=1e11:  4.829 sec
primesieve 1e15 --dist=1e11:  9.917 sec
primesieve 1e17 --dist=1e11: 12.107 sec
primesieve 1e19 --dist=1e11: 26.659 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.397 sec
primesieve 1e12 --dist=1e11:  4.799 sec
primesieve 1e15 --dist=1e11:  7.978 sec
primesieve 1e17 --dist=1e11: 11.363 sec
primesieve 1e19 --dist=1e11: 32.422 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  4.239 sec
primesieve 1e12 --dist=1e11:  6.568 sec
primesieve 1e15 --dist=1e11: 10.266 sec
primesieve 1e17 --dist=1e11: 13.909 sec
primesieve 1e19 --dist=1e11: 35.253 sec

=== smallSieve=16KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.653 sec
primesieve 1e12 --dist=1e11:  5.061 sec
primesieve 1e15 --dist=1e11:  9.989 sec
primesieve 1e17 --dist=1e11: 11.171 sec
primesieve 1e19 --dist=1e11: 31.780 sec

SMT cache blocking can significantly improve performance (up to 30%) but it is tricky to find the best setting for smallSieve and largeSieve.

Owner

kimwalisch commented Jul 21, 2018

$ ./benchmark.sh 
IBM POWER8
CPU cores: 10
Number of threads: 80
L1 data cache size: 64K
L2 cache size: 512K

=== smallSieve=64KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.484 sec
primesieve 1e12 --dist=1e11:  5.362 sec
primesieve 1e15 --dist=1e11:  9.339 sec
primesieve 1e17 --dist=1e11: 13.011 sec
primesieve 1e19 --dist=1e11: 23.648 sec

=== smallSieve=64KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  3.554 sec
primesieve 1e12 --dist=1e11:  4.830 sec
primesieve 1e15 --dist=1e11:  6.841 sec # fastest
primesieve 1e17 --dist=1e11: 11.821 sec
primesieve 1e19 --dist=1e11: 24.257 sec

=== smallSieve=64KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.514 sec
primesieve 1e12 --dist=1e11:  5.465 sec
primesieve 1e15 --dist=1e11:  9.310 sec
primesieve 1e17 --dist=1e11:  9.916 sec # fastest
primesieve 1e19 --dist=1e11: 23.459 sec # fastest

=== smallSieve=32KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.759 sec
primesieve 1e12 --dist=1e11:  4.673 sec # fastest
primesieve 1e15 --dist=1e11:  9.805 sec
primesieve 1e17 --dist=1e11: 13.541 sec
primesieve 1e19 --dist=1e11: 27.200 sec

=== smallSieve=32KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  3.389 sec # fastest
primesieve 1e12 --dist=1e11:  6.504 sec
primesieve 1e15 --dist=1e11:  9.158 sec
primesieve 1e17 --dist=1e11: 10.902 sec
primesieve 1e19 --dist=1e11: 26.091 sec

=== smallSieve=32KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.432 sec
primesieve 1e12 --dist=1e11:  4.829 sec
primesieve 1e15 --dist=1e11:  9.917 sec
primesieve 1e17 --dist=1e11: 12.107 sec
primesieve 1e19 --dist=1e11: 26.659 sec

=== smallSieve=16KB, largeSieve=512KB ===

primesieve 1e0  --dist=1e11:  3.397 sec
primesieve 1e12 --dist=1e11:  4.799 sec
primesieve 1e15 --dist=1e11:  7.978 sec
primesieve 1e17 --dist=1e11: 11.363 sec
primesieve 1e19 --dist=1e11: 32.422 sec

=== smallSieve=16KB, largeSieve=256KB ===

primesieve 1e0  --dist=1e11:  4.239 sec
primesieve 1e12 --dist=1e11:  6.568 sec
primesieve 1e15 --dist=1e11: 10.266 sec
primesieve 1e17 --dist=1e11: 13.909 sec
primesieve 1e19 --dist=1e11: 35.253 sec

=== smallSieve=16KB, largeSieve=128KB ===

primesieve 1e0  --dist=1e11:  3.653 sec
primesieve 1e12 --dist=1e11:  5.061 sec
primesieve 1e15 --dist=1e11:  9.989 sec
primesieve 1e17 --dist=1e11: 11.171 sec
primesieve 1e19 --dist=1e11: 31.780 sec

SMT cache blocking can significantly improve performance (up to 30%) but it is tricky to find the best setting for smallSieve and largeSieve.

@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 22, 2018

Owner

Here is my conclusion

Adding SMT cache blocking to primesieve would make it faster on nearly all CPUs with simultaneous multithreading (SMT).

SMT cache blocking would give small speed ups (1% - 3%) for:

  • All Intel x64 consumer CPUs
  • All AMD CPUs

SMT cache blocking would give large speed ups (up to 30%) for:

  • The latest Intel Skylake-X CPUs
  • IBM POWER CPUs

Doing SMT cache blocking for the L2 cache is more important than doing SMT cache blocking for the L1 cache. There is one performance pitfall: above 1e18 doing SMT cache blocking will significantly deteriorate performance if the size of the L2 cache is less than 1024 KB per CPU core. (The reason being that in this case there is a performance critical array inside EratBig that will not fit in the L2 cache.)

I will give SMT cache blocking a try and implement it in a new branch (e.g. SmtCacheBlocking). The implementation has to be carefully designed as otherwise primesieve will run slower in some circumstances:

  • In single thread mode SMT cache blocking must be disabled.
  • Above 1e18 SMT cache blocking must be disabled if L2 cache size < 1024 KB.

I will publish new benchmarks here when done.

Owner

kimwalisch commented Jul 22, 2018

Here is my conclusion

Adding SMT cache blocking to primesieve would make it faster on nearly all CPUs with simultaneous multithreading (SMT).

SMT cache blocking would give small speed ups (1% - 3%) for:

  • All Intel x64 consumer CPUs
  • All AMD CPUs

SMT cache blocking would give large speed ups (up to 30%) for:

  • The latest Intel Skylake-X CPUs
  • IBM POWER CPUs

Doing SMT cache blocking for the L2 cache is more important than doing SMT cache blocking for the L1 cache. There is one performance pitfall: above 1e18 doing SMT cache blocking will significantly deteriorate performance if the size of the L2 cache is less than 1024 KB per CPU core. (The reason being that in this case there is a performance critical array inside EratBig that will not fit in the L2 cache.)

I will give SMT cache blocking a try and implement it in a new branch (e.g. SmtCacheBlocking). The implementation has to be carefully designed as otherwise primesieve will run slower in some circumstances:

  • In single thread mode SMT cache blocking must be disabled.
  • Above 1e18 SMT cache blocking must be disabled if L2 cache size < 1024 KB.

I will publish new benchmarks here when done.

@antonblanchard

This comment has been minimized.

Show comment
Hide comment
@antonblanchard

antonblanchard Jul 26, 2018

Interesting results! I tried all combinations of SMT levels, L1, L2 and L2 private settings on POWER9 and will attach them.

antonblanchard commented Jul 26, 2018

Interesting results! I tried all combinations of SMT levels, L1, L2 and L2 private settings on POWER9 and will attach them.

@antonblanchard

This comment has been minimized.

Show comment
Hide comment
@antonblanchard

antonblanchard Jul 26, 2018

# sudo ppc64_cpu --smt=1
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.641
./primesieve 1e12  --dist=1e11:   2.106
./primesieve 1e15  --dist=1e11:   3.264
./primesieve 1e17  --dist=1e11:   5.355
./primesieve 1e19  --dist=1e11:  16.440
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.699
./primesieve 1e12  --dist=1e11:   2.218
./primesieve 1e15  --dist=1e11:   3.335
./primesieve 1e17  --dist=1e11:   5.452
./primesieve 1e19  --dist=1e11:  16.416
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.692
./primesieve 1e12  --dist=1e11:   2.105
./primesieve 1e15  --dist=1e11:   3.279
./primesieve 1e17  --dist=1e11:   5.344
./primesieve 1e19  --dist=1e11:  16.225
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.449
./primesieve 1e12  --dist=1e11:   1.916
./primesieve 1e15  --dist=1e11:   3.067
./primesieve 1e17  --dist=1e11:   4.541
./primesieve 1e19  --dist=1e11:  12.206
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.370
./primesieve 1e12  --dist=1e11:   2.010
./primesieve 1e15  --dist=1e11:   2.974
./primesieve 1e17  --dist=1e11:   4.691
./primesieve 1e19  --dist=1e11:  12.660
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.470
./primesieve 1e12  --dist=1e11:   1.909
./primesieve 1e15  --dist=1e11:   3.015
./primesieve 1e17  --dist=1e11:   4.666
./primesieve 1e19  --dist=1e11:  12.227
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.249
./primesieve 1e12  --dist=1e11:   1.848
./primesieve 1e15  --dist=1e11:   2.800
./primesieve 1e17  --dist=1e11:   4.008
./primesieve 1e19  --dist=1e11:  10.004
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.234
./primesieve 1e12  --dist=1e11:   1.750
./primesieve 1e15  --dist=1e11:   3.015
./primesieve 1e17  --dist=1e11:   4.023
./primesieve 1e19  --dist=1e11:  10.145
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.323
./primesieve 1e12  --dist=1e11:   1.705
./primesieve 1e15  --dist=1e11:   2.768
./primesieve 1e17  --dist=1e11:   4.124
./primesieve 1e19  --dist=1e11:  10.040
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.319
./primesieve 1e12  --dist=1e11:   1.764
./primesieve 1e15  --dist=1e11:   2.705
./primesieve 1e17  --dist=1e11:   3.786
./primesieve 1e19  --dist=1e11:   8.647
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.343
./primesieve 1e12  --dist=1e11:   1.602
./primesieve 1e15  --dist=1e11:   2.777
./primesieve 1e17  --dist=1e11:   3.860
./primesieve 1e19  --dist=1e11:   7.953
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.298
./primesieve 1e12  --dist=1e11:   1.678
./primesieve 1e15  --dist=1e11:   2.713
./primesieve 1e17  --dist=1e11:   3.707
./primesieve 1e19  --dist=1e11:   7.947
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.128
./primesieve 1e12  --dist=1e11:   1.626
./primesieve 1e15  --dist=1e11:   2.662
./primesieve 1e17  --dist=1e11:   3.820
./primesieve 1e19  --dist=1e11:   8.215
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.230
./primesieve 1e12  --dist=1e11:   1.515
./primesieve 1e15  --dist=1e11:   2.560
./primesieve 1e17  --dist=1e11:   3.644
./primesieve 1e19  --dist=1e11:   8.352
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.278
./primesieve 1e12  --dist=1e11:   1.511 # overall fastest
./primesieve 1e15  --dist=1e11:   2.605
./primesieve 1e17  --dist=1e11:   3.542 # overall fastest
./primesieve 1e19  --dist=1e11:   7.725
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.124 # overall fastest
./primesieve 1e12  --dist=1e11:   1.704
./primesieve 1e15  --dist=1e11:   2.608
./primesieve 1e17  --dist=1e11:   3.766
./primesieve 1e19  --dist=1e11:   8.507
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.267
./primesieve 1e12  --dist=1e11:   1.577
./primesieve 1e15  --dist=1e11:   2.581
./primesieve 1e17  --dist=1e11:   3.564
./primesieve 1e19  --dist=1e11:   7.702 # overall fastest
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.236
./primesieve 1e12  --dist=1e11:   1.604
./primesieve 1e15  --dist=1e11:   2.506 # overall fastest
./primesieve 1e17  --dist=1e11:   3.673
./primesieve 1e19  --dist=1e11:   7.979

# sudo ppc64_cpu --smt=2
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.786
./primesieve 1e12  --dist=1e11:   1.913
./primesieve 1e15  --dist=1e11:   2.970
./primesieve 1e17  --dist=1e11:   5.815
./primesieve 1e19  --dist=1e11:  21.917
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.565
./primesieve 1e12  --dist=1e11:   1.893
./primesieve 1e15  --dist=1e11:   2.937
./primesieve 1e17  --dist=1e11:   5.757
./primesieve 1e19  --dist=1e11:  21.576
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.501
./primesieve 1e12  --dist=1e11:   1.789
./primesieve 1e15  --dist=1e11:   3.007
./primesieve 1e17  --dist=1e11:   5.815
./primesieve 1e19  --dist=1e11:  22.688
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.444
./primesieve 1e12  --dist=1e11:   1.692
./primesieve 1e15  --dist=1e11:   2.741
./primesieve 1e17  --dist=1e11:   4.868
./primesieve 1e19  --dist=1e11:  16.216
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.485
./primesieve 1e12  --dist=1e11:   1.775
./primesieve 1e15  --dist=1e11:   2.842
./primesieve 1e17  --dist=1e11:   4.962
./primesieve 1e19  --dist=1e11:  15.371
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.360
./primesieve 1e12  --dist=1e11:   1.771
./primesieve 1e15  --dist=1e11:   3.003
./primesieve 1e17  --dist=1e11:   4.835
./primesieve 1e19  --dist=1e11:  15.954
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.450
./primesieve 1e12  --dist=1e11:   1.906
./primesieve 1e15  --dist=1e11:   2.968
./primesieve 1e17  --dist=1e11:   4.520
./primesieve 1e19  --dist=1e11:  13.229
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.475
./primesieve 1e12  --dist=1e11:   1.909
./primesieve 1e15  --dist=1e11:   2.787
./primesieve 1e17  --dist=1e11:   5.062
./primesieve 1e19  --dist=1e11:  12.962
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.461
./primesieve 1e12  --dist=1e11:   1.921
./primesieve 1e15  --dist=1e11:   2.908
./primesieve 1e17  --dist=1e11:   4.381
./primesieve 1e19  --dist=1e11:  12.980
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.263
./primesieve 1e12  --dist=1e11:   1.726
./primesieve 1e15  --dist=1e11:   2.708
./primesieve 1e17  --dist=1e11:   4.027
./primesieve 1e19  --dist=1e11:  10.173
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.403
./primesieve 1e12  --dist=1e11:   1.718
./primesieve 1e15  --dist=1e11:   2.740
./primesieve 1e17  --dist=1e11:   4.294
./primesieve 1e19  --dist=1e11:  10.012
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.619
./primesieve 1e12  --dist=1e11:   1.905
./primesieve 1e15  --dist=1e11:   3.018
./primesieve 1e17  --dist=1e11:   4.332
./primesieve 1e19  --dist=1e11:  10.420
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.283
./primesieve 1e12  --dist=1e11:   1.759
./primesieve 1e15  --dist=1e11:   2.587
./primesieve 1e17  --dist=1e11:   4.295
./primesieve 1e19  --dist=1e11:  10.263
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.444
./primesieve 1e12  --dist=1e11:   1.661
./primesieve 1e15  --dist=1e11:   2.848
./primesieve 1e17  --dist=1e11:   4.055
./primesieve 1e19  --dist=1e11:  10.121
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.645
./primesieve 1e12  --dist=1e11:   1.987
./primesieve 1e15  --dist=1e11:   2.923
./primesieve 1e17  --dist=1e11:   4.125
./primesieve 1e19  --dist=1e11:  10.009
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.517
./primesieve 1e12  --dist=1e11:   1.674
./primesieve 1e15  --dist=1e11:   2.897
./primesieve 1e17  --dist=1e11:   4.119
./primesieve 1e19  --dist=1e11:  10.635
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.597
./primesieve 1e12  --dist=1e11:   1.855
./primesieve 1e15  --dist=1e11:   2.846
./primesieve 1e17  --dist=1e11:   4.252
./primesieve 1e19  --dist=1e11:  10.096
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.641
./primesieve 1e12  --dist=1e11:   1.933
./primesieve 1e15  --dist=1e11:   2.978
./primesieve 1e17  --dist=1e11:   4.221
./primesieve 1e19  --dist=1e11:  10.199

# sudo ppc64_cpu --smt=4
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.871
./primesieve 1e12  --dist=1e11:   2.341
./primesieve 1e15  --dist=1e11:   3.729
./primesieve 1e17  --dist=1e11:   7.558
./primesieve 1e19  --dist=1e11:  39.200
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.875
./primesieve 1e12  --dist=1e11:   2.460
./primesieve 1e15  --dist=1e11:   3.736
./primesieve 1e17  --dist=1e11:   7.747
./primesieve 1e19  --dist=1e11:  42.139
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.968
./primesieve 1e12  --dist=1e11:   2.431
./primesieve 1e15  --dist=1e11:   3.734
./primesieve 1e17  --dist=1e11:   7.668
./primesieve 1e19  --dist=1e11:  39.941
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.822
./primesieve 1e12  --dist=1e11:   2.271
./primesieve 1e15  --dist=1e11:   3.692
./primesieve 1e17  --dist=1e11:   6.424
./primesieve 1e19  --dist=1e11:  25.715
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.737
./primesieve 1e12  --dist=1e11:   2.410
./primesieve 1e15  --dist=1e11:   3.535
./primesieve 1e17  --dist=1e11:   6.459
./primesieve 1e19  --dist=1e11:  25.787
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.955
./primesieve 1e12  --dist=1e11:   2.250
./primesieve 1e15  --dist=1e11:   3.564
./primesieve 1e17  --dist=1e11:   6.622
./primesieve 1e19  --dist=1e11:  25.313
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.821
./primesieve 1e12  --dist=1e11:   2.306
./primesieve 1e15  --dist=1e11:   3.480
./primesieve 1e17  --dist=1e11:   5.988
./primesieve 1e19  --dist=1e11:  18.983
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.791
./primesieve 1e12  --dist=1e11:   2.308
./primesieve 1e15  --dist=1e11:   3.540
./primesieve 1e17  --dist=1e11:   6.027
./primesieve 1e19  --dist=1e11:  19.278
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.781
./primesieve 1e12  --dist=1e11:   2.225
./primesieve 1e15  --dist=1e11:   3.489
./primesieve 1e17  --dist=1e11:   6.021
./primesieve 1e19  --dist=1e11:  19.015
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.489 # smt4 fastest
./primesieve 1e12  --dist=1e11:   2.039 # smt4 fastest
./primesieve 1e15  --dist=1e11:   3.399
./primesieve 1e17  --dist=1e11:   5.326 # smt4 fastest
./primesieve 1e19  --dist=1e11:  14.308 # smt4 fastest
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.804
./primesieve 1e12  --dist=1e11:   2.078
./primesieve 1e15  --dist=1e11:   3.524
./primesieve 1e17  --dist=1e11:   5.444
./primesieve 1e19  --dist=1e11:  14.416
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.781
./primesieve 1e12  --dist=1e11:   2.111
./primesieve 1e15  --dist=1e11:   3.360 # smt4 fastest
./primesieve 1e17  --dist=1e11:   5.509
./primesieve 1e19  --dist=1e11:  14.707
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.576
./primesieve 1e12  --dist=1e11:   2.092
./primesieve 1e15  --dist=1e11:   3.370
./primesieve 1e17  --dist=1e11:   5.483
./primesieve 1e19  --dist=1e11:  14.596
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.852
./primesieve 1e12  --dist=1e11:   2.117
./primesieve 1e15  --dist=1e11:   3.506
./primesieve 1e17  --dist=1e11:   5.557
./primesieve 1e19  --dist=1e11:  14.471
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.807
./primesieve 1e12  --dist=1e11:   2.346
./primesieve 1e15  --dist=1e11:   3.563
./primesieve 1e17  --dist=1e11:   5.559
./primesieve 1e19  --dist=1e11:  14.578
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.681
./primesieve 1e12  --dist=1e11:   2.118
./primesieve 1e15  --dist=1e11:   3.427
./primesieve 1e17  --dist=1e11:   5.582
./primesieve 1e19  --dist=1e11:  15.309
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.764
./primesieve 1e12  --dist=1e11:   2.185
./primesieve 1e15  --dist=1e11:   3.518
./primesieve 1e17  --dist=1e11:   5.657
./primesieve 1e19  --dist=1e11:  14.983
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.863
./primesieve 1e12  --dist=1e11:   2.162
./primesieve 1e15  --dist=1e11:   3.613
./primesieve 1e17  --dist=1e11:   5.630
./primesieve 1e19  --dist=1e11:  15.007

antonblanchard commented Jul 26, 2018

# sudo ppc64_cpu --smt=1
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.641
./primesieve 1e12  --dist=1e11:   2.106
./primesieve 1e15  --dist=1e11:   3.264
./primesieve 1e17  --dist=1e11:   5.355
./primesieve 1e19  --dist=1e11:  16.440
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.699
./primesieve 1e12  --dist=1e11:   2.218
./primesieve 1e15  --dist=1e11:   3.335
./primesieve 1e17  --dist=1e11:   5.452
./primesieve 1e19  --dist=1e11:  16.416
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.692
./primesieve 1e12  --dist=1e11:   2.105
./primesieve 1e15  --dist=1e11:   3.279
./primesieve 1e17  --dist=1e11:   5.344
./primesieve 1e19  --dist=1e11:  16.225
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.449
./primesieve 1e12  --dist=1e11:   1.916
./primesieve 1e15  --dist=1e11:   3.067
./primesieve 1e17  --dist=1e11:   4.541
./primesieve 1e19  --dist=1e11:  12.206
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.370
./primesieve 1e12  --dist=1e11:   2.010
./primesieve 1e15  --dist=1e11:   2.974
./primesieve 1e17  --dist=1e11:   4.691
./primesieve 1e19  --dist=1e11:  12.660
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.470
./primesieve 1e12  --dist=1e11:   1.909
./primesieve 1e15  --dist=1e11:   3.015
./primesieve 1e17  --dist=1e11:   4.666
./primesieve 1e19  --dist=1e11:  12.227
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.249
./primesieve 1e12  --dist=1e11:   1.848
./primesieve 1e15  --dist=1e11:   2.800
./primesieve 1e17  --dist=1e11:   4.008
./primesieve 1e19  --dist=1e11:  10.004
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.234
./primesieve 1e12  --dist=1e11:   1.750
./primesieve 1e15  --dist=1e11:   3.015
./primesieve 1e17  --dist=1e11:   4.023
./primesieve 1e19  --dist=1e11:  10.145
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.323
./primesieve 1e12  --dist=1e11:   1.705
./primesieve 1e15  --dist=1e11:   2.768
./primesieve 1e17  --dist=1e11:   4.124
./primesieve 1e19  --dist=1e11:  10.040
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.319
./primesieve 1e12  --dist=1e11:   1.764
./primesieve 1e15  --dist=1e11:   2.705
./primesieve 1e17  --dist=1e11:   3.786
./primesieve 1e19  --dist=1e11:   8.647
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.343
./primesieve 1e12  --dist=1e11:   1.602
./primesieve 1e15  --dist=1e11:   2.777
./primesieve 1e17  --dist=1e11:   3.860
./primesieve 1e19  --dist=1e11:   7.953
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.298
./primesieve 1e12  --dist=1e11:   1.678
./primesieve 1e15  --dist=1e11:   2.713
./primesieve 1e17  --dist=1e11:   3.707
./primesieve 1e19  --dist=1e11:   7.947
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.128
./primesieve 1e12  --dist=1e11:   1.626
./primesieve 1e15  --dist=1e11:   2.662
./primesieve 1e17  --dist=1e11:   3.820
./primesieve 1e19  --dist=1e11:   8.215
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.230
./primesieve 1e12  --dist=1e11:   1.515
./primesieve 1e15  --dist=1e11:   2.560
./primesieve 1e17  --dist=1e11:   3.644
./primesieve 1e19  --dist=1e11:   8.352
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.278
./primesieve 1e12  --dist=1e11:   1.511 # overall fastest
./primesieve 1e15  --dist=1e11:   2.605
./primesieve 1e17  --dist=1e11:   3.542 # overall fastest
./primesieve 1e19  --dist=1e11:   7.725
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.124 # overall fastest
./primesieve 1e12  --dist=1e11:   1.704
./primesieve 1e15  --dist=1e11:   2.608
./primesieve 1e17  --dist=1e11:   3.766
./primesieve 1e19  --dist=1e11:   8.507
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.267
./primesieve 1e12  --dist=1e11:   1.577
./primesieve 1e15  --dist=1e11:   2.581
./primesieve 1e17  --dist=1e11:   3.564
./primesieve 1e19  --dist=1e11:   7.702 # overall fastest
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.236
./primesieve 1e12  --dist=1e11:   1.604
./primesieve 1e15  --dist=1e11:   2.506 # overall fastest
./primesieve 1e17  --dist=1e11:   3.673
./primesieve 1e19  --dist=1e11:   7.979

# sudo ppc64_cpu --smt=2
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.786
./primesieve 1e12  --dist=1e11:   1.913
./primesieve 1e15  --dist=1e11:   2.970
./primesieve 1e17  --dist=1e11:   5.815
./primesieve 1e19  --dist=1e11:  21.917
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.565
./primesieve 1e12  --dist=1e11:   1.893
./primesieve 1e15  --dist=1e11:   2.937
./primesieve 1e17  --dist=1e11:   5.757
./primesieve 1e19  --dist=1e11:  21.576
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.501
./primesieve 1e12  --dist=1e11:   1.789
./primesieve 1e15  --dist=1e11:   3.007
./primesieve 1e17  --dist=1e11:   5.815
./primesieve 1e19  --dist=1e11:  22.688
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.444
./primesieve 1e12  --dist=1e11:   1.692
./primesieve 1e15  --dist=1e11:   2.741
./primesieve 1e17  --dist=1e11:   4.868
./primesieve 1e19  --dist=1e11:  16.216
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.485
./primesieve 1e12  --dist=1e11:   1.775
./primesieve 1e15  --dist=1e11:   2.842
./primesieve 1e17  --dist=1e11:   4.962
./primesieve 1e19  --dist=1e11:  15.371
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.360
./primesieve 1e12  --dist=1e11:   1.771
./primesieve 1e15  --dist=1e11:   3.003
./primesieve 1e17  --dist=1e11:   4.835
./primesieve 1e19  --dist=1e11:  15.954
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.450
./primesieve 1e12  --dist=1e11:   1.906
./primesieve 1e15  --dist=1e11:   2.968
./primesieve 1e17  --dist=1e11:   4.520
./primesieve 1e19  --dist=1e11:  13.229
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.475
./primesieve 1e12  --dist=1e11:   1.909
./primesieve 1e15  --dist=1e11:   2.787
./primesieve 1e17  --dist=1e11:   5.062
./primesieve 1e19  --dist=1e11:  12.962
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.461
./primesieve 1e12  --dist=1e11:   1.921
./primesieve 1e15  --dist=1e11:   2.908
./primesieve 1e17  --dist=1e11:   4.381
./primesieve 1e19  --dist=1e11:  12.980
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.263
./primesieve 1e12  --dist=1e11:   1.726
./primesieve 1e15  --dist=1e11:   2.708
./primesieve 1e17  --dist=1e11:   4.027
./primesieve 1e19  --dist=1e11:  10.173
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.403
./primesieve 1e12  --dist=1e11:   1.718
./primesieve 1e15  --dist=1e11:   2.740
./primesieve 1e17  --dist=1e11:   4.294
./primesieve 1e19  --dist=1e11:  10.012
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.619
./primesieve 1e12  --dist=1e11:   1.905
./primesieve 1e15  --dist=1e11:   3.018
./primesieve 1e17  --dist=1e11:   4.332
./primesieve 1e19  --dist=1e11:  10.420
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.283
./primesieve 1e12  --dist=1e11:   1.759
./primesieve 1e15  --dist=1e11:   2.587
./primesieve 1e17  --dist=1e11:   4.295
./primesieve 1e19  --dist=1e11:  10.263
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.444
./primesieve 1e12  --dist=1e11:   1.661
./primesieve 1e15  --dist=1e11:   2.848
./primesieve 1e17  --dist=1e11:   4.055
./primesieve 1e19  --dist=1e11:  10.121
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.645
./primesieve 1e12  --dist=1e11:   1.987
./primesieve 1e15  --dist=1e11:   2.923
./primesieve 1e17  --dist=1e11:   4.125
./primesieve 1e19  --dist=1e11:  10.009
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.517
./primesieve 1e12  --dist=1e11:   1.674
./primesieve 1e15  --dist=1e11:   2.897
./primesieve 1e17  --dist=1e11:   4.119
./primesieve 1e19  --dist=1e11:  10.635
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.597
./primesieve 1e12  --dist=1e11:   1.855
./primesieve 1e15  --dist=1e11:   2.846
./primesieve 1e17  --dist=1e11:   4.252
./primesieve 1e19  --dist=1e11:  10.096
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.641
./primesieve 1e12  --dist=1e11:   1.933
./primesieve 1e15  --dist=1e11:   2.978
./primesieve 1e17  --dist=1e11:   4.221
./primesieve 1e19  --dist=1e11:  10.199

# sudo ppc64_cpu --smt=4
# ./run.sh
L1=8k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.871
./primesieve 1e12  --dist=1e11:   2.341
./primesieve 1e15  --dist=1e11:   3.729
./primesieve 1e17  --dist=1e11:   7.558
./primesieve 1e19  --dist=1e11:  39.200
L1=8k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.875
./primesieve 1e12  --dist=1e11:   2.460
./primesieve 1e15  --dist=1e11:   3.736
./primesieve 1e17  --dist=1e11:   7.747
./primesieve 1e19  --dist=1e11:  42.139
L1=8k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.968
./primesieve 1e12  --dist=1e11:   2.431
./primesieve 1e15  --dist=1e11:   3.734
./primesieve 1e17  --dist=1e11:   7.668
./primesieve 1e19  --dist=1e11:  39.941
L1=16k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.822
./primesieve 1e12  --dist=1e11:   2.271
./primesieve 1e15  --dist=1e11:   3.692
./primesieve 1e17  --dist=1e11:   6.424
./primesieve 1e19  --dist=1e11:  25.715
L1=16k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.737
./primesieve 1e12  --dist=1e11:   2.410
./primesieve 1e15  --dist=1e11:   3.535
./primesieve 1e17  --dist=1e11:   6.459
./primesieve 1e19  --dist=1e11:  25.787
L1=16k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.955
./primesieve 1e12  --dist=1e11:   2.250
./primesieve 1e15  --dist=1e11:   3.564
./primesieve 1e17  --dist=1e11:   6.622
./primesieve 1e19  --dist=1e11:  25.313
L1=32k L2=128k private_l2=0
./primesieve 1e0  --dist=1e11:    1.821
./primesieve 1e12  --dist=1e11:   2.306
./primesieve 1e15  --dist=1e11:   3.480
./primesieve 1e17  --dist=1e11:   5.988
./primesieve 1e19  --dist=1e11:  18.983
L1=32k L2=256k private_l2=0
./primesieve 1e0  --dist=1e11:    1.791
./primesieve 1e12  --dist=1e11:   2.308
./primesieve 1e15  --dist=1e11:   3.540
./primesieve 1e17  --dist=1e11:   6.027
./primesieve 1e19  --dist=1e11:  19.278
L1=32k L2=512k private_l2=0
./primesieve 1e0  --dist=1e11:    1.781
./primesieve 1e12  --dist=1e11:   2.225
./primesieve 1e15  --dist=1e11:   3.489
./primesieve 1e17  --dist=1e11:   6.021
./primesieve 1e19  --dist=1e11:  19.015
L1=8k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.489 # smt4 fastest
./primesieve 1e12  --dist=1e11:   2.039 # smt4 fastest
./primesieve 1e15  --dist=1e11:   3.399
./primesieve 1e17  --dist=1e11:   5.326 # smt4 fastest
./primesieve 1e19  --dist=1e11:  14.308 # smt4 fastest
L1=8k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.804
./primesieve 1e12  --dist=1e11:   2.078
./primesieve 1e15  --dist=1e11:   3.524
./primesieve 1e17  --dist=1e11:   5.444
./primesieve 1e19  --dist=1e11:  14.416
L1=8k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.781
./primesieve 1e12  --dist=1e11:   2.111
./primesieve 1e15  --dist=1e11:   3.360 # smt4 fastest
./primesieve 1e17  --dist=1e11:   5.509
./primesieve 1e19  --dist=1e11:  14.707
L1=16k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.576
./primesieve 1e12  --dist=1e11:   2.092
./primesieve 1e15  --dist=1e11:   3.370
./primesieve 1e17  --dist=1e11:   5.483
./primesieve 1e19  --dist=1e11:  14.596
L1=16k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.852
./primesieve 1e12  --dist=1e11:   2.117
./primesieve 1e15  --dist=1e11:   3.506
./primesieve 1e17  --dist=1e11:   5.557
./primesieve 1e19  --dist=1e11:  14.471
L1=16k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.807
./primesieve 1e12  --dist=1e11:   2.346
./primesieve 1e15  --dist=1e11:   3.563
./primesieve 1e17  --dist=1e11:   5.559
./primesieve 1e19  --dist=1e11:  14.578
L1=32k L2=128k private_l2=1
./primesieve 1e0  --dist=1e11:    1.681
./primesieve 1e12  --dist=1e11:   2.118
./primesieve 1e15  --dist=1e11:   3.427
./primesieve 1e17  --dist=1e11:   5.582
./primesieve 1e19  --dist=1e11:  15.309
L1=32k L2=256k private_l2=1
./primesieve 1e0  --dist=1e11:    1.764
./primesieve 1e12  --dist=1e11:   2.185
./primesieve 1e15  --dist=1e11:   3.518
./primesieve 1e17  --dist=1e11:   5.657
./primesieve 1e19  --dist=1e11:  14.983
L1=32k L2=512k private_l2=1
./primesieve 1e0  --dist=1e11:    1.863
./primesieve 1e12  --dist=1e11:   2.162
./primesieve 1e15  --dist=1e11:   3.613
./primesieve 1e17  --dist=1e11:   5.630
./primesieve 1e19  --dist=1e11:  15.007
@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Jul 28, 2018

Owner

Interesting results! I tried all combinations of SMT levels, L1, L2 and L2 private settings on POWER9 and will attach them.

So far I have implemented SMT cache blocking for the L2 cache only because on Intel and AMD CPUs this gives the best overall performance. The code for this is on the SmtCacheBlocking branch.

I have also added a new --cpu-info command-line option which shows what CPU properties primesieve has been able to detect. The new more advanced CPU detection code has already been implemented for Linux and macOS. Windows support will be implemented in the next days.

But after reviewing your POWER9 benchmark results it seems like POWER9 would gain most from L1 cache blocking so I will rerun more L1 cache blocking benchmarks and then decide if I will also implement L1 cache blocking in primesieve.

@antonblanchard Do you know any cloud service provider that offers IBM POWER9 servers that I could use? For POWER8 I am using nimbix.net but they are currently not offering POWER9 servers (will be available in a few weeks though).

Owner

kimwalisch commented Jul 28, 2018

Interesting results! I tried all combinations of SMT levels, L1, L2 and L2 private settings on POWER9 and will attach them.

So far I have implemented SMT cache blocking for the L2 cache only because on Intel and AMD CPUs this gives the best overall performance. The code for this is on the SmtCacheBlocking branch.

I have also added a new --cpu-info command-line option which shows what CPU properties primesieve has been able to detect. The new more advanced CPU detection code has already been implemented for Linux and macOS. Windows support will be implemented in the next days.

But after reviewing your POWER9 benchmark results it seems like POWER9 would gain most from L1 cache blocking so I will rerun more L1 cache blocking benchmarks and then decide if I will also implement L1 cache blocking in primesieve.

@antonblanchard Do you know any cloud service provider that offers IBM POWER9 servers that I could use? For POWER8 I am using nimbix.net but they are currently not offering POWER9 servers (will be available in a few weeks though).

@kimwalisch

This comment has been minimized.

Show comment
Hide comment
@kimwalisch

kimwalisch Aug 14, 2018

Owner

After running lots of benchmarks (and starring at the timings for far too long) I have finally found a new default sieve size that performs better across a wide variety of different CPUs.

Initially @antonblanchard had the idea to use sieveSize = (L2CacheSize / threadsPerCore) in order to reduce the number of L2 cache misses on CPUs with simultaneous multi-threading. Indeed this runs slightly faster on all Intel CPUs and on IBM POWER9 CPUs.

But then I found that using sieveSize = (L2CacheSize / 2) also ran faster on Intel CPUs without simultaneous multi-threading. Now I realised that if I use the entire L2 cache for the sieve array then other frequently used arrays and data structures will not fit into the L2 cache and they will have to be loaded from the slower L3 cache or even worse from main memory.

Hence in order to improve primesieve's cache efficiency it is best to use a sieve size that is slightly smaller than the L2 cache size so that other important data structures can also fit into the L2 cache. Since primesieve requires a sieve size that is a power of 2 the new sieve size is:

  • sieveSize = (L2CacheSize / 2)

This change has now been implemented in the master branch (along with improved CPU detection and a new --cpu-infocommand-line option).

Thanks @antonblanchard for the initial idea and the detailed IBM POWER9 benchmark data.

Owner

kimwalisch commented Aug 14, 2018

After running lots of benchmarks (and starring at the timings for far too long) I have finally found a new default sieve size that performs better across a wide variety of different CPUs.

Initially @antonblanchard had the idea to use sieveSize = (L2CacheSize / threadsPerCore) in order to reduce the number of L2 cache misses on CPUs with simultaneous multi-threading. Indeed this runs slightly faster on all Intel CPUs and on IBM POWER9 CPUs.

But then I found that using sieveSize = (L2CacheSize / 2) also ran faster on Intel CPUs without simultaneous multi-threading. Now I realised that if I use the entire L2 cache for the sieve array then other frequently used arrays and data structures will not fit into the L2 cache and they will have to be loaded from the slower L3 cache or even worse from main memory.

Hence in order to improve primesieve's cache efficiency it is best to use a sieve size that is slightly smaller than the L2 cache size so that other important data structures can also fit into the L2 cache. Since primesieve requires a sieve size that is a power of 2 the new sieve size is:

  • sieveSize = (L2CacheSize / 2)

This change has now been implemented in the master branch (along with improved CPU detection and a new --cpu-infocommand-line option).

Thanks @antonblanchard for the initial idea and the detailed IBM POWER9 benchmark data.

@kimwalisch kimwalisch closed this Aug 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment