Skip to content
This repository has been archived by the owner on Dec 26, 2023. It is now read-only.

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

Merged
merged 1 commit into from
Jun 18, 2020
Merged

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

merged 1 commit into from
Jun 18, 2020

Conversation

wolfpld
Copy link
Contributor

@wolfpld wolfpld commented Jun 18, 2020

Details in #73.

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 18, 2020

Should be ready for merge.

@martinus
Copy link
Owner

thanks!

@martinus martinus merged commit f2fa128 into martinus:master Jun 18, 2020
@martinus
Copy link
Owner

I just noticed a regression on my Intel i7-8700 with that PR. The first benchmark executed with -tc=bench_quick_overall_map_flat, which does mostly iterating, has degraded from 11.33 op/sec to 7.62 op/sec. So it seems intel is slowed down quite a bit with _tzcnt_u64.

Is there a way to detect ZEN architecture with a macro?

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 20, 2020

That's quite a hit :( I don't think there is a way to determine target arch using macros. Maybe let's make something that user can define themselves?

Where can you find this benchmark?

@martinus
Copy link
Owner

I compiled this repository with this:

cmake -G Ninja -DCMAKE_CXX_FLAGS="-O3 -march=native" -DRH_cxx_standard=17 -DCMAKE_BUILD_TYPE=Release ../..
ninja
./rh -ns -tc=bench_quick_overall_map_flat

This executes a few benchmarks, the first one relies heavily on the CLZ operation., which is done in the fastForward()

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 20, 2020

These are my results on Ryzen 3900X:

gcc on wsl:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,786,200.00 |               18.94 |    0.2% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,493,200.00 |               11.98 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       48,770,500.00 |               20.50 |    0.2% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,677,300.00 |               17.64 |    0.1% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,000,500.00 |                2.25 |    0.1% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      370,457,500.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.112282

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       81,490,500.00 |               12.27 |    0.1% |      0.90 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,758,900.00 |               11.94 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       49,213,700.00 |               20.32 |    0.1% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       88,954,600.00 |               11.24 |    0.2% |      0.98 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      452,119,900.00 |                2.21 |    0.1% |      4.97 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      371,017,100.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.13082

MSVC:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|      104,920,500.00 |                9.53 |    0.2% |      1.15 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       77,855,300.00 |               12.84 |    0.1% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,434,500.00 |               16.02 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|      109,994,600.00 |                9.09 |    0.1% |      1.21 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      587,536,800.00 |                1.70 |    1.0% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      410,914,400.00 |                2.43 |    0.3% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.154391

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       86,302,500.00 |               11.59 |    0.1% |      0.95 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       78,075,200.00 |               12.81 |    0.2% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,216,300.00 |               16.07 |    0.1% |      0.68 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       90,891,800.00 |               11.00 |    0.2% |      1.00 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      602,306,000.00 |                1.66 |    1.0% |      6.60 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      408,184,100.00 |                2.45 |    0.0% |      4.49 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.145191

This is very interesting. I was originally looking only at the MSVC results (this is what I can currently profile). It seems that __builtin_ctzll is doing something completely different, which is worthy of investigation and maybe backporting for MSVC.

@martinus
Copy link
Owner

Can you update and try this benchmark again? I've fixed this regression for me. Most of the time the whole 8 bytes are empty, so I don't always need to do the CLZ.

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 20, 2020

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,687,700.00 |               18.98 |    0.1% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|      130,305,400.00 |                7.67 |    0.2% |      1.43 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       62,951,900.00 |               15.89 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,682,300.00 |               17.64 |    0.2% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,239,600.00 |                2.25 |    0.0% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      365,043,000.00 |                2.74 |    0.3% |      4.02 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.125849

MSVC:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       61,015,900.00 |               16.39 |    0.2% |      0.67 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|      135,066,500.00 |                7.40 |    0.9% |      1.49 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       89,823,200.00 |               11.13 |    0.1% |      0.99 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       65,740,500.00 |               15.21 |    0.2% |      0.72 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      586,923,200.00 |                1.70 |    0.6% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      409,865,900.00 |                2.44 |    0.1% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.150686

Is the speed decrease in other tests caused by the new hash?

@martinus
Copy link
Owner

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency? How did you compile with gcc?

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 20, 2020

Some more thoughts on that. I have checked fastForward() in compiler explorer and the code generated on gcc/clang is identical, regardless if _tzcnt_u64, or __builtin_ctzll is called. On MSVC the code differs.

While this is true when the function is looked at in isolation, putting things into more open context may open up some more optimization possibilities. clang is very good at it, MSVC is very bad at it. I think that doing the zero check beforehand could be one of these optimizations, which the compiler would do in __builtin_ctzll case, but didn't in the _tzcnt_u64 one. Overall the change is good, as it improves performance on MSVC.

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency?

No. Didn't see a significant difference over multiple runs.

How did you compile with gcc?

Using the exact commands you pasted above.

@martinus
Copy link
Owner

I've again replaced the hash, this time with hardware CRC. On my PC it's about the same speed as the old hash. I think the benchmarks are not optimal though.

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 21, 2020

I think you forgot to push the changes.

Also, have you considered using xxh3? There's a very nice discussion of speed and latency of the algorithm, even with small inputs: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

@martinus
Copy link
Owner

xxh3 excellent, but pretty slow for just mixing 64bit values. I've done some work on finding a fast mixer here: https://github.com/martinus/better-faster-stronger-mixer

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 21, 2020

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       56,421,100.00 |               17.72 |    0.3% |      0.62 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       93,140,400.00 |               10.74 |    0.1% |      1.03 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       60,295,500.00 |               16.58 |    0.2% |      0.66 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       57,060,000.00 |               17.53 |    0.3% |      0.63 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      454,988,800.00 |                2.20 |    0.2% |      5.01 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      369,313,600.00 |                2.71 |    0.2% |      4.06 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.120346

MSVC doesn't compile.

__SSE4_2__ is not defined by MSVC.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants