Methods

Naively checking each bit
Lookup table for each byte
SWAR bit counting

Comparison

I tested three different bitcounting strategies: a Naive count of each bit, a lookup table for each byte, and SWAR bit counting. The Lookup Table outperformed SWAR bit counting by a small margin, and as expected, the naive bit counting strategy performed significantly worse than the other two.

Bitten by my CPU's Branch Prediction

For the first time, I was bitten by branch-prediction in the Naive implementation of a bit counter.

The first implementation was the following:

 uint8_t count_bits(uint32_t n) {
         
          uint8_t counter = 0;
          for( ; n; n>>=1 )
            if(n & 1)
              ++counter;
            
          return counter;
 }

This proved to process a sorted array much faster than an unsorted array. A bit of googling took me to this SO question where it is determined that branch prediction is the culprit.

Linux's perf stat tool showed indeed a higher percentage of branch-misses in for the unsorted case.

Case	Branch-miss rate
Sorted array	0.01%
Constant array	0.07%
Shuffled array	0.18%

I tried to fight branch misprediction with the following:

uint8_t count_bits(uint32_t n) {
        
         uint8_t counter = 0;
         for( ; n; n>>=1 )
           counter += (n & 1);
           
         return counter;
}

But much to my surprise this didn't solve the discrepancy in processing time for sorted and unsorted arrays of uint32_t. perf stat yields exactly the same results. In fact, branch prediction also happens in the for loop whenever the break condition n is evaluated.

The third attempt was:

uint8_t count_bits(uint32_t n) {
        
         uint8_t counter = 0;
         for(int i = 0; i < 32; ++i, n>>=1 )
           counter += (n & 1);
           
         return counter;
}

This successfully eliminates the discrepancies in processing time for sorted/unsorted arrays, at the cost of worse performance for the sorted case (in terms of total time, measured locally).

Case	Branch-miss rate
Sorted array	0.01%
Constant array	0.07%
Shuffled array	0.07%

Manually unrolling the loop seemed to work though!

uint8_t count_bits(uint32_t n) {
        
         uint8_t counter = 0;
         
         counter += (n & 1); n>>=1;
         counter += (n & 1); n>>=1;
         counter += (n & 1); n>>=1;
         (repeat 32 times in total...)
         
         return counter;
}

Using gcc flags for loop unrolling (__attribute__((optimize("unroll-loops")))) helped, but actually performed worse than manually unwinding the loop.

Of course, manual unwinding like this is not practical, so my workaround was:

#define REPEAT4(x) x;x;x;x;
#define REPEAT2(x) x;x;
#define REPEAT32(x) REPEAT2(REPEAT4(REPEAT4(x)))

uint8_t count_bits(uint32_t n) {
         
          uint8_t counter = 0;
          
          REPEAT32(counter += (n & 1);n>>=1;);
          
          return counter;
 }

Not pretty, but hey, it beats everything else I've tried!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Methods

Comparison

Bitten by my CPU's Branch Prediction

About

Releases

Packages

Languages

miguel-r-s/BitCounting

Folders and files

Latest commit

History

Repository files navigation

Methods

Comparison

Bitten by my CPU's Branch Prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages