New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rejigger bit counting intrinsics #898
Conversation
Fix #867 Optimize software fallback routines. Delete some faulty (and dead?) MSVC big endian code.
Could you publish some benchmark and/or profile results |
The new little endian fallback is unlikely to show up in any "fair" benchmark. Static analysis (https://godbolt.org/z/cYqxTW) shows the new proposed method to have lower latency.
LLVM-MCA assigns a latency of 5 cycles to the table lookup. I have no clue how a typical big endian machine behaves. The proposed 64-bit big endian method would be an "obvious" win, The proposed 32-bit big endian method has good reciprocal throughput - when unrolled on x64 - but |
Analysis is fine. This could be completed with more dynamic measurements, with a tool like
This part of the code, counting bytes to extract the match length, is among the hottest ones. If I do remember correctly, the difference between hardware vs software code paths for byte count was measurable. Also, it would provide a way to ensure that the code is indeed correct.
There is no such thing. If the benefit seems good enough, this code can be merged. Also, sometimes, some of these "improvements" introduce other concerns, such as complexity, code size or portability limitations. |
make CFLAGS=-DLZ4_FORCE_SW_BITCOUNT LZ4_NbCommonBytes |
This looks good. For the big-endian one, Hence I was wondering if an alternative using less table space would be achievable ? |
MSVC debug mode complains
Nothing good comes to mind. The table size could be halved at the cost of 3 instructions (1 on the critical path):
|
Integrating into a feature branch, |
Fix #867
Optimize software fallback routines.
Delete some faulty (and dead?) MSVC big endian code.