Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Sparse bitsets instead of uint32 bitsets #15

Merged
merged 2 commits into from
Nov 17, 2019
Merged

Use Sparse bitsets instead of uint32 bitsets #15

merged 2 commits into from
Nov 17, 2019

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Nov 17, 2019

Timings:

Lazy

real    0m0.181s
user    0m6.076s
sys     0m0.007s

Eager

real    0m0.438s
user    0m15.242s
sys     0m0.014s

To be compared with #13

There is no performance difference.
It's not limited to 32 victims but there is extra heap allocation/management and the memory required is much bigger than a bitset so a comparison with a multilimb bitset is needed.

@mratsim mratsim changed the title [Don't merge] Use Sparse bitsets instead of uint32 bitsets Use Sparse bitsets instead of uint32 bitsets Nov 17, 2019
@mratsim
Copy link
Owner Author

mratsim commented Nov 17, 2019

After evaluating multilimb bitset random number picking, it seems like sparse sets are better for the following reasons:

There is a way to do random selection via selecting the k ranked set bits:
http://graphics.stanford.edu/~seander/bithacks.html#SelectPosFromMSBRank

  uint64_t v;          // Input value to find position with rank r.
  unsigned int r;      // Input: bit's desired rank [1-64].
  unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
  uint64_t a, b, c, d; // Intermediate temporaries for bit count.
  unsigned int t;      // Bit count temporary.

  // Do a normal parallel bit count for a 64-bit integer,                     
  // but store all intermediate steps.                                        
  // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
  a =  v - ((v >> 1) & ~0UL/3);
  // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
  b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
  // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
  c = (b + (b >> 4)) & ~0UL/0x11;
  // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
  d = (c + (c >> 8)) & ~0UL/0x101;
  t = (d >> 32) + (d >> 48);
  // Now do branchless select!                                                
  s  = 64;

  s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
  t  = (d >> (s - 16)) & 0xff;

  s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
  t  = (c >> (s - 8)) & 0xf;

  s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
  t  = (b >> (s - 4)) & 0x7;

  s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
  t  = (a >> (s - 2)) & 0x3;

  s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
  t  = (v >> (s - 1)) & 0x1;

  s -= ((t - r) & 256) >> 8;
  s = 65 - s;

or

  uint64_t v;          // Input value to find position with rank r.
  unsigned int r;      // Input: bit's desired rank [1-64].
  unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
  uint64_t a, b, c, d; // Intermediate temporaries for bit count.
  unsigned int t;      // Bit count temporary.

  // Do a normal parallel bit count for a 64-bit integer,                     
  // but store all intermediate steps.                                        
  // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
  a =  v - ((v >> 1) & ~0UL/3);
  // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
  b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
  // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
  c = (b + (b >> 4)) & ~0UL/0x11;
  // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
  d = (c + (c >> 8)) & ~0UL/0x101;
  t = (d >> 32) + (d >> 48);
  // Now do branchless select!                                                
  s  = 64;
  if (r > t) {s -= 32; r -= t;}
  if (r > t) {s -= 16; r -= t;}
  if (r > t) {s -= 8; r -= t;}
  if (r > t) {s -= 4; r -= t;}
  if (r > t) {s -= 2; r -= t;}
  if (r > t) s--;

However when you need to select from multiple limbs, say 4 uint64 (for 256 workers) this become very tedious, and it probably requires a lot of instructions.

Uncompressing requires a lot of stack space as well when the max number of workers is high or heap allocation which is slow.

In comparison, whatever the number of workers, the number of operations is constant. The space increases linearly with the number of workers so low core CPU don't over-reserve for 256 workers. Even with a Network-on-Chip CPU with 1024 cores like Adapteva, it takes sizeof(int16) * 1024 * 2 + 1 = 4097B per core. This is a fair chunk of the L1 cache but we can reasonably assume that the more cores there are the more L1 cache there is per core (as CPU is probably higher end).

@mratsim mratsim merged commit 5a8f182 into master Nov 17, 2019
@mratsim mratsim deleted the sparsesets branch November 30, 2019 13:56
@mratsim mratsim mentioned this pull request Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant