Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

wolfpld · 2020-06-18T09:45:32Z

Details in #73.

wolfpld · 2020-06-18T11:03:39Z

Should be ready for merge.

martinus · 2020-06-18T11:31:24Z

thanks!

martinus · 2020-06-20T15:41:27Z

I just noticed a regression on my Intel i7-8700 with that PR. The first benchmark executed with -tc=bench_quick_overall_map_flat, which does mostly iterating, has degraded from 11.33 op/sec to 7.62 op/sec. So it seems intel is slowed down quite a bit with _tzcnt_u64.

Is there a way to detect ZEN architecture with a macro?

wolfpld · 2020-06-20T16:03:08Z

That's quite a hit :( I don't think there is a way to determine target arch using macros. Maybe let's make something that user can define themselves?

Where can you find this benchmark?

martinus · 2020-06-20T18:38:10Z

I compiled this repository with this:

cmake -G Ninja -DCMAKE_CXX_FLAGS="-O3 -march=native" -DRH_cxx_standard=17 -DCMAKE_BUILD_TYPE=Release ../..
ninja
./rh -ns -tc=bench_quick_overall_map_flat

This executes a few benchmarks, the first one relies heavily on the CLZ operation., which is done in the fastForward()

wolfpld · 2020-06-20T19:00:49Z

These are my results on Ryzen 3900X:

gcc on wsl:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,786,200.00 |               18.94 |    0.2% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,493,200.00 |               11.98 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       48,770,500.00 |               20.50 |    0.2% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,677,300.00 |               17.64 |    0.1% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,000,500.00 |                2.25 |    0.1% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      370,457,500.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.112282

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       81,490,500.00 |               12.27 |    0.1% |      0.90 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,758,900.00 |               11.94 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       49,213,700.00 |               20.32 |    0.1% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       88,954,600.00 |               11.24 |    0.2% |      0.98 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      452,119,900.00 |                2.21 |    0.1% |      4.97 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      371,017,100.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.13082

MSVC:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|      104,920,500.00 |                9.53 |    0.2% |      1.15 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       77,855,300.00 |               12.84 |    0.1% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,434,500.00 |               16.02 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|      109,994,600.00 |                9.09 |    0.1% |      1.21 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      587,536,800.00 |                1.70 |    1.0% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      410,914,400.00 |                2.43 |    0.3% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.154391

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       86,302,500.00 |               11.59 |    0.1% |      0.95 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       78,075,200.00 |               12.81 |    0.2% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,216,300.00 |               16.07 |    0.1% |      0.68 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       90,891,800.00 |               11.00 |    0.2% |      1.00 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      602,306,000.00 |                1.66 |    1.0% |      6.60 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      408,184,100.00 |                2.45 |    0.0% |      4.49 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.145191

This is very interesting. I was originally looking only at the MSVC results (this is what I can currently profile). It seems that __builtin_ctzll is doing something completely different, which is worthy of investigation and maybe backporting for MSVC.

martinus · 2020-06-20T19:17:11Z

Can you update and try this benchmark again? I've fixed this regression for me. Most of the time the whole 8 bytes are empty, so I don't always need to do the CLZ.

wolfpld · 2020-06-20T19:27:49Z

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,687,700.00 |               18.98 |    0.1% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|      130,305,400.00 |                7.67 |    0.2% |      1.43 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       62,951,900.00 |               15.89 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,682,300.00 |               17.64 |    0.2% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,239,600.00 |                2.25 |    0.0% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      365,043,000.00 |                2.74 |    0.3% |      4.02 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.125849

MSVC:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       61,015,900.00 |               16.39 |    0.2% |      0.67 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|      135,066,500.00 |                7.40 |    0.9% |      1.49 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       89,823,200.00 |               11.13 |    0.1% |      0.99 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       65,740,500.00 |               15.21 |    0.2% |      0.72 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      586,923,200.00 |                1.70 |    0.6% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      409,865,900.00 |                2.44 |    0.1% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.150686

Is the speed decrease in other tests caused by the new hash?

martinus · 2020-06-20T19:44:38Z

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency? How did you compile with gcc?

wolfpld · 2020-06-20T19:53:04Z

Some more thoughts on that. I have checked fastForward() in compiler explorer and the code generated on gcc/clang is identical, regardless if _tzcnt_u64, or __builtin_ctzll is called. On MSVC the code differs.

While this is true when the function is looked at in isolation, putting things into more open context may open up some more optimization possibilities. clang is very good at it, MSVC is very bad at it. I think that doing the zero check beforehand could be one of these optimizations, which the compiler would do in __builtin_ctzll case, but didn't in the _tzcnt_u64 one. Overall the change is good, as it improves performance on MSVC.

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency?

No. Didn't see a significant difference over multiple runs.

How did you compile with gcc?

Using the exact commands you pasted above.

martinus · 2020-06-21T11:36:44Z

I've again replaced the hash, this time with hardware CRC. On my PC it's about the same speed as the old hash. I think the benchmarks are not optimal though.

wolfpld · 2020-06-21T11:42:55Z

I think you forgot to push the changes.

Also, have you considered using xxh3? There's a very nice discussion of speed and latency of the algorithm, even with small inputs: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

martinus · 2020-06-21T13:38:08Z

xxh3 excellent, but pretty slow for just mixing 64bit values. I've done some work on finding a fast mixer here: https://github.com/martinus/better-faster-stronger-mixer

wolfpld · 2020-06-21T13:50:53Z

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       56,421,100.00 |               17.72 |    0.3% |      0.62 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       93,140,400.00 |               10.74 |    0.1% |      1.03 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       60,295,500.00 |               16.58 |    0.2% |      0.66 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       57,060,000.00 |               17.53 |    0.3% |      0.63 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      454,988,800.00 |                2.20 |    0.2% |      5.01 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      369,313,600.00 |                2.71 |    0.2% |      4.06 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.120346

MSVC doesn't compile.

__SSE4_2__ is not defined by MSVC.

Use tzcnt instead of bsf for better performance on ZEN/ZEN2.

4b990cd

martinus merged commit f2fa128 into martinus:master Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

wolfpld commented Jun 18, 2020

wolfpld commented Jun 18, 2020

martinus commented Jun 18, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 21, 2020

wolfpld commented Jun 21, 2020

martinus commented Jun 21, 2020

wolfpld commented Jun 21, 2020

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

Conversation

wolfpld commented Jun 18, 2020

wolfpld commented Jun 18, 2020

martinus commented Jun 18, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 20, 2020

wolfpld commented Jun 20, 2020

martinus commented Jun 21, 2020

wolfpld commented Jun 21, 2020

martinus commented Jun 21, 2020

wolfpld commented Jun 21, 2020