- Naively checking each bit
- Lookup table for each byte
- SWAR bit counting
I tested three different bitcounting strategies: a Naive count of each bit, a lookup table for each byte, and SWAR bit counting. The Lookup Table outperformed SWAR bit counting by a small margin, and as expected, the naive bit counting strategy performed significantly worse than the other two.
For the first time, I was bitten by branch-prediction in the Naive implementation of a bit counter.
The first implementation was the following:
uint8_t count_bits(uint32_t n) {
uint8_t counter = 0;
for( ; n; n>>=1 )
if(n & 1)
++counter;
return counter;
}
This proved to process a sorted array much faster than an unsorted array. A bit of googling took me to this SO question where it is determined that branch prediction is the culprit.
Linux's perf stat
tool showed indeed a higher percentage of branch-misses in for the unsorted case.
Case | Branch-miss rate |
---|---|
Sorted array | 0.01% |
Constant array | 0.07% |
Shuffled array | 0.18% |
I tried to fight branch misprediction with the following:
uint8_t count_bits(uint32_t n) {
uint8_t counter = 0;
for( ; n; n>>=1 )
counter += (n & 1);
return counter;
}
But much to my surprise this didn't solve the discrepancy in processing time for sorted and unsorted arrays of uint32_t
. perf stat
yields exactly the same results. In fact, branch prediction also happens in the for
loop whenever the break condition n
is evaluated.
The third attempt was:
uint8_t count_bits(uint32_t n) {
uint8_t counter = 0;
for(int i = 0; i < 32; ++i, n>>=1 )
counter += (n & 1);
return counter;
}
This successfully eliminates the discrepancies in processing time for sorted/unsorted arrays, at the cost of worse performance for the sorted case (in terms of total time, measured locally).
Case | Branch-miss rate |
---|---|
Sorted array | 0.01% |
Constant array | 0.07% |
Shuffled array | 0.07% |
Manually unrolling the loop seemed to work though!
uint8_t count_bits(uint32_t n) {
uint8_t counter = 0;
counter += (n & 1); n>>=1;
counter += (n & 1); n>>=1;
counter += (n & 1); n>>=1;
(repeat 32 times in total...)
return counter;
}
Using gcc flags for loop unrolling (__attribute__((optimize("unroll-loops")))
) helped, but actually performed worse than manually unwinding the loop.
Of course, manual unwinding like this is not practical, so my workaround was:
#define REPEAT4(x) x;x;x;x;
#define REPEAT2(x) x;x;
#define REPEAT32(x) REPEAT2(REPEAT4(REPEAT4(x)))
uint8_t count_bits(uint32_t n) {
uint8_t counter = 0;
REPEAT32(counter += (n & 1);n>>=1;);
return counter;
}
Not pretty, but hey, it beats everything else I've tried!