Use Sparse bitsets instead of uint32 bitsets #15

mratsim · 2019-11-17T16:48:17Z

Timings:

Lazy

real    0m0.181s
user    0m6.076s
sys     0m0.007s

Eager

real    0m0.438s
user    0m15.242s
sys     0m0.014s

To be compared with #13

There is no performance difference.
It's not limited to 32 victims but there is extra heap allocation/management and the memory required is much bigger than a bitset so a comparison with a multilimb bitset is needed.

mratsim · 2019-11-17T21:33:38Z

After evaluating multilimb bitset random number picking, it seems like sparse sets are better for the following reasons:

There is a way to do random selection via selecting the k ranked set bits:
http://graphics.stanford.edu/~seander/bithacks.html#SelectPosFromMSBRank

  uint64_t v;          // Input value to find position with rank r.
  unsigned int r;      // Input: bit's desired rank [1-64].
  unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
  uint64_t a, b, c, d; // Intermediate temporaries for bit count.
  unsigned int t;      // Bit count temporary.

  // Do a normal parallel bit count for a 64-bit integer,                     
  // but store all intermediate steps.                                        
  // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
  a =  v - ((v >> 1) & ~0UL/3);
  // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
  b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
  // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
  c = (b + (b >> 4)) & ~0UL/0x11;
  // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
  d = (c + (c >> 8)) & ~0UL/0x101;
  t = (d >> 32) + (d >> 48);
  // Now do branchless select!                                                
  s  = 64;

  s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
  t  = (d >> (s - 16)) & 0xff;

  s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
  t  = (c >> (s - 8)) & 0xf;

  s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
  t  = (b >> (s - 4)) & 0x7;

  s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
  t  = (a >> (s - 2)) & 0x3;

  s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
  t  = (v >> (s - 1)) & 0x1;

  s -= ((t - r) & 256) >> 8;
  s = 65 - s;

or

  uint64_t v;          // Input value to find position with rank r.
  unsigned int r;      // Input: bit's desired rank [1-64].
  unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
  uint64_t a, b, c, d; // Intermediate temporaries for bit count.
  unsigned int t;      // Bit count temporary.

  // Do a normal parallel bit count for a 64-bit integer,                     
  // but store all intermediate steps.                                        
  // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
  a =  v - ((v >> 1) & ~0UL/3);
  // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
  b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
  // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
  c = (b + (b >> 4)) & ~0UL/0x11;
  // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
  d = (c + (c >> 8)) & ~0UL/0x101;
  t = (d >> 32) + (d >> 48);
  // Now do branchless select!                                                
  s  = 64;
  if (r > t) {s -= 32; r -= t;}
  if (r > t) {s -= 16; r -= t;}
  if (r > t) {s -= 8; r -= t;}
  if (r > t) {s -= 4; r -= t;}
  if (r > t) {s -= 2; r -= t;}
  if (r > t) s--;

However when you need to select from multiple limbs, say 4 uint64 (for 256 workers) this become very tedious, and it probably requires a lot of instructions.

Uncompressing requires a lot of stack space as well when the max number of workers is high or heap allocation which is slow.

In comparison, whatever the number of workers, the number of operations is constant. The space increases linearly with the number of workers so low core CPU don't over-reserve for 256 workers. Even with a Network-on-Chip CPU with 1024 cores like Adapteva, it takes sizeof(int16) * 1024 * 2 + 1 = 4097B per core. This is a fair chunk of the L1 cache but we can reasonably assume that the more cores there are the more L1 cache there is per core (as CPU is probably higher end).

mratsim added the don't merge 🚧 label Nov 17, 2019

Use Sparse bitsets instead of uint32 bitsets

b614138

mratsim force-pushed the sparsesets branch from 50eda5b to b614138 Compare November 17, 2019 21:20

Fix post rebase

395848b

mratsim removed the don't merge 🚧 label Nov 17, 2019

mratsim changed the title ~~[Don't merge] Use Sparse bitsets instead of uint32 bitsets~~ Use Sparse bitsets instead of uint32 bitsets Nov 17, 2019

mratsim merged commit 5a8f182 into master Nov 17, 2019

mratsim mentioned this pull request Nov 17, 2019

Refactor the bitset data structure #5

Closed

mratsim deleted the sparsesets branch November 30, 2019 13:56

mratsim mentioned this pull request Dec 4, 2019

Perf regression on SPC bench #25

Closed

mratsim mentioned this pull request Jan 3, 2020

Distributed computing #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Sparse bitsets instead of uint32 bitsets #15

Use Sparse bitsets instead of uint32 bitsets #15

mratsim commented Nov 17, 2019

mratsim commented Nov 17, 2019

Use Sparse bitsets instead of uint32 bitsets #15

Use Sparse bitsets instead of uint32 bitsets #15

Conversation

mratsim commented Nov 17, 2019

mratsim commented Nov 17, 2019