-
-
Notifications
You must be signed in to change notification settings - Fork 141
Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74
Conversation
Should be ready for merge. |
thanks! |
I just noticed a regression on my Intel i7-8700 with that PR. The first benchmark executed with Is there a way to detect ZEN architecture with a macro? |
That's quite a hit :( I don't think there is a way to determine target arch using macros. Maybe let's make something that user can define themselves? Where can you find this benchmark? |
I compiled this repository with this:
This executes a few benchmarks, the first one relies heavily on the CLZ operation., which is done in the |
These are my results on Ryzen 3900X: gcc on wsl: bsf:
tzcnt:
MSVC: bsf:
tzcnt:
This is very interesting. I was originally looking only at the MSVC results (this is what I can currently profile). It seems that |
Can you update and try this benchmark again? I've fixed this regression for me. Most of the time the whole 8 bytes are empty, so I don't always need to do the CLZ. |
gcc:
MSVC:
Is the speed decrease in other tests caused by the new hash? |
Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency? How did you compile with gcc? |
Some more thoughts on that. I have checked While this is true when the function is looked at in isolation, putting things into more open context may open up some more optimization possibilities. clang is very good at it, MSVC is very bad at it. I think that doing the zero check beforehand could be one of these optimizations, which the compiler would do in
No. Didn't see a significant difference over multiple runs.
Using the exact commands you pasted above. |
I've again replaced the hash, this time with hardware CRC. On my PC it's about the same speed as the old hash. I think the benchmarks are not optimal though. |
I think you forgot to push the changes. Also, have you considered using xxh3? There's a very nice discussion of speed and latency of the algorithm, even with small inputs: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html |
xxh3 excellent, but pretty slow for just mixing 64bit values. I've done some work on finding a fast mixer here: https://github.com/martinus/better-faster-stronger-mixer |
gcc:
MSVC doesn't compile.
|
Details in #73.