Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use usize as the Block instead of u32 #74

Merged
merged 5 commits into from
Jan 9, 2023
Merged

Conversation

james7132
Copy link
Collaborator

@james7132 james7132 commented Mar 11, 2022

Changed Block to be an alias on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.

This partially addresses #73. Though full SIMD support is probably more desirable here.

@james7132 james7132 changed the title Use usize as Block and optimize Intersection/Union for denser bitsets Multiple Optimizations Mar 11, 2022
@james7132
Copy link
Collaborator Author

Reran the benchmarks. Some insights:

  • Contains is definitely slower due to the extra on-the-stack vs allocated check smallvec has to do.
  • As expected a nearly 2x speedup on 64-bit machines when iterating over more sparse bitsets.
  • Moving the smallvec check outside of the inner loop of insert_range keeps it's performance largely the same.
  • The slower time when dealing iterating over dense bitsets is unexpected. Needs more investigation.
 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,217           1,289                    72    5.92%   x 0.94
 bench_insert_range_using_loop                  793,870         1,141,606           347,736   43.80%   x 0.70
 bench_iter_ones_all_ones                       314,546         442,401             127,855   40.65%   x 0.71
 bench_iter_ones_all_zeros                      8,505           4,510                -3,995  -46.97%   x 1.89
 bench_iter_ones_using_contains_all_ones        544,155         543,391                -764   -0.14%   x 1.00
 bench_iter_ones_using_contains_all_zeros       543,840         516,121             -27,719   -5.10%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,640         446,340              29,700    7.13%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,513           4,267                -4,246  -49.88%   x 2.00

@Calandiel
Copy link

Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck.

@james7132
Copy link
Collaborator Author

@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the insert and all_ones cases.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,260           1,212                   -48   -3.81%   x 1.04
 bench_insert_range_using_loop                  794,435         962,799             168,364   21.19%   x 0.83
 bench_iter_ones_all_ones                       314,486         422,858             108,372   34.46%   x 0.74
 bench_iter_ones_all_zeros                      8,503           4,277                -4,226  -49.70%   x 1.99
 bench_iter_ones_using_contains_all_ones        544,015         518,847             -25,168   -4.63%   x 1.05
 bench_iter_ones_using_contains_all_zeros       543,870         518,844             -25,026   -4.60%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,365         448,484              32,119    7.71%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,504           4,280                -4,224  -49.67%   x 1.99

@james7132
Copy link
Collaborator Author

@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it.

@jrraymond
Copy link
Collaborator

Thanks James. I see about the same speedups:

  usize deviation u32  
bench_insert_range 3,862 416 3,807 0.99
bench_insert_range_using_loop 2,089,174 17,710 2,165,208 1.04
bench_iter_ones_all_ones 710,358 4,944 793,514 1.12
bench_iter_ones_all_zeros 4,911 107 9,790 1.99
bench_iter_ones_using_contains_all_ones 543,766 13,625 523,399 0.96
bench_iter_ones_using_contains_all_zeros 543,237 5,127 523,860 0.96
bench_iter_ones_using_slice_directly_all_ones 414,222 2,289 367,529 0.89
bench_iter_ones_using_slice_directly_all_zero 4,909 99 9,797 2.00

How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module?
As is the PR is pretty huge

@james7132
Copy link
Collaborator Author

@jrraymond I've updated this PR to only include the usize changes. Please take another look.

@jrraymond
Copy link
Collaborator

jrraymond commented Jan 7, 2023

Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:

benchmark old dev new dev pct change
test bench_insert_range 3,915 ns/iter (+/- 322) 3,828 ns/iter (+/- 1,328) 2.22%
test bench_insert_range_using_loop 2,087,287 ns/iter (+/- 5,895) 2,087,487 ns/iter (+/- 11,232) -0.01%
test bench_iter_ones_all_ones 791,916 ns/iter (+/- 11,288) 709,237 ns/iter (+/- 1,731) 10.44%
test bench_iter_ones_all_zeros 9,804 ns/iter (+/- 237) 4,905 ns/iter (+/- 57) 49.97%
test bench_iter_ones_using_contains_all_ones 525,739 ns/iter (+/- 7,121) 542,470 ns/iter (+/- 6,671) -3.18%
test bench_iter_ones_using_contains_all_zeros 524,045 ns/iter (+/- 10,135) 542,270 ns/iter (+/- 3,156) -3.48%
test bench_iter_ones_using_slice_directly_all_ones 368,727 ns/iter (+/- 6,322) 437,931 ns/iter (+/- 2,819) -18.77%
test bench_iter_ones_using_slice_directly_all_zero 9,798 ns/iter (+/- 144) 4,906 ns/iter (+/- 36) 49.93%

@jrraymond
Copy link
Collaborator

I've been looking at the generated assembly and the differences are exactly what I would expect: the usize version loads qwords instead of dword, increments the counter by 8 instead of 4 shifts the indexes by different amounts.

   1  example::iter_ones_using_contains:                                                                                               |    1 example::iter_ones_using_contains:
    2         push    rbx                                                                                                              |    2         push    rbx
    3         mov     r9, qword ptr [rdi + 24]                                                                                         |    3         mov     r9, qword ptr [rdi + 24]
    4         test    r9, r9                                                                                                           |    4         test    r9, r9
    5         je      .LBB31_1                                                                                                         |    5         je      .LBB32_1
    6         mov     r8, qword ptr [rdi]                                                                                              |    6         mov     r8, qword ptr [rdi]
    7         mov     r11, qword ptr [rdi + 16]                                                                                        |    7         mov     r11, qword ptr [rdi + 16]
    8         cmp     r9, 1                                                                                                            |    8         cmp     r9, 1
    9         jne     .LBB31_8                                                                                                         |    9         jne     .LBB32_8
   10         xor     eax, eax                                                                                                         |   10         xor     eax, eax
   11         xor     edx, edx                                                                                                         |   11         xor     edx, edx
   12 .LBB31_4:                                                                                                                        |   12 .LBB32_4:
   13         test    r9b, 1                                                                                                           |   13         test    r9b, 1
   14         je      .LBB31_7                                                                                                         |   14         je      .LBB32_7
   15         mov     rsi, rdx                                                                                                         |   15         mov     rsi, rdx
   16         shr     rsi, 6                                                                                                           |   16         shr     rsi, 5
   17         cmp     rsi, r11                                                                                                         |   17         cmp     rsi, r11
   18         jae     .LBB31_7                                                                                                         |   18         jae     .LBB32_7
   19         mov     edi, 1                                                                                                           |   19         mov     edi, 1
   20         mov     ecx, edx                                                                                                         |   20         mov     ecx, edx
   21         shl     rdi, cl                                                                                                          |   21         shl     edi, cl
   22         and     rdi, qword ptr [r8 + 8*rsi]                                                                                      |   22         and     edi, dword ptr [r8 + 4*rsi]
   23         cmp     rdi, 1                                                                                                           |   23         cmp     edi, 1
   24         sbb     rax, -1                                                                                                          |   24         sbb     rax, -1
   25 .LBB31_7:                                                                                                                        |   25 .LBB32_7:
   26         pop     rbx                                                                                                              |   26         pop     rbx
   27         ret                                                                                                                      |   27         ret
   28 .LBB31_1:                                                                                                                        |   28 .LBB32_1:
   29         xor     eax, eax                                                                                                         |   29         xor     eax, eax
   30         pop     rbx                                                                                                              |   30         pop     rbx
   31         ret                                                                                                                      |   31         ret
   32 .LBB31_8:                                                                                                                        |   32 .LBB32_8:
   33         mov     r10, r9                                                                                                          |   33         mov     r10, r9
   34         and     r10, -2                                                                                                          |   34         and     r10, -2
   35         xor     eax, eax                                                                                                         |   35         xor     eax, eax
   36         xor     esi, esi                                                                                                         |   36         xor     esi, esi
   37         jmp     .LBB31_9                                                                                                         |   37         jmp     .LBB32_9
   38 .LBB31_13:                                                                                                                       |   38 .LBB32_13:
   39         mov     rsi, rdx                                                                                                         |   39         mov     rsi, rdx
   40         cmp     r10, rdx                                                                                                         |   40         cmp     r10, rdx
   41         je      .LBB31_4                                                                                                         |   41         je      .LBB32_4
   42 .LBB31_9:                                                                                                                        |   42 .LBB32_9:
   43         mov     rdi, rsi                                                                                                         |   43         mov     rdi, rsi
   44         shr     rdi, 6                                                                                                           |   44         shr     rdi, 5
   45         cmp     rdi, r11                                                                                                         |   45         cmp     rdi, r11
   46         jae     .LBB31_11                                                                                                        |   46         jae     .LBB32_11
   47         mov     ecx, esi                                                                                                         |   47         mov     ecx, esi
   48         and     cl, 62                                                                                                           |   48         and     cl, 30
   49         mov     edx, 1                                                                                                           |   49         mov     edx, 1
   50         shl     rdx, cl                                                                                                          |   50         shl     edx, cl
   51         and     rdx, qword ptr [r8 + 8*rdi]                                                                                      |   51         and     edx, dword ptr [r8 + 4*rdi]
   52         cmp     rdx, 1                                                                                                           |   52         cmp     edx, 1
   53         sbb     rax, -1                                                                                                          |   53         sbb     rax, -1
   54 .LBB31_11:                                                                                                                       |   54 .LBB32_11:
   55         lea     rdx, [rsi + 2]                                                                                                   |   55         lea     rdx, [rsi + 2]
   56         cmp     rdi, r11                                                                                                         |   56         cmp     rdi, r11
   57         jae     .LBB31_13                                                                                                        |   57         jae     .LBB32_13
   58         and     sil, 62                                                                                                          |   58         and     sil, 30
   59         or      sil, 1                                                                                                           |   59         or      sil, 1
   60         mov     ebx, 1                                                                                                           |   60         mov     ebx, 1
   61         mov     ecx, esi                                                                                                         |   61         mov     ecx, esi
   62         shl     rbx, cl                                                                                                          |   62         shl     ebx, cl
   63         and     rbx, qword ptr [r8 + 8*rdi]                                                                                      |   63         and     ebx, dword ptr [r8 + 4*rdi]
   64         cmp     rbx, 1                                                                                                           |   64         cmp     ebx, 1
   65         sbb     rax, -1                                                                                                          |   65         sbb     rax, -1
   66         jmp     .LBB31_13                                                                                                        |   66         jmp     .LBB32_13                                                 

@james7132
Copy link
Collaborator Author

james7132 commented Jan 8, 2023

Thanks for the deep dive into this.

What's the policy on this crate for the use of unsafe? We can convert a &[usize] into an equivalent &[u32] and then use that to load the target u32. This would retain gains for the batch/set operations without causing a regression in contains, but would necessitate the use of more unsafe.

I've also, in the meantime, experimented with explicitly vectorized operations, using __m128i/__m256i over usize on targets that support it, and we can net a 2-4x speedup over the existing gains on those targets. I won't add it to this PR, but that also would add a significant amount of unsafe to the crate, so I wanted to ask for an opinion ahead of time.

@jrraymond
Copy link
Collaborator

I think we should use unsafe here if it is faster because the scope is narrow and the correctness is easily verifiable. I have no objection to unsafe if there aren't alternatives, so unsafe simd is fine with me.

I took a stab at this to see if this makes a difference:

        type SubBlock = u32;
        const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
        const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
        let block_ix = bit / BITS;
        let block_offset = bit % BITS;
        let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK + block_offset / SUB_BLOCK_BITS;
        let offset = bit % SUB_BLOCK_BITS;
        let blocks: &[usize] = &self.data;
        let sublocks: &[SubBlock] = unsafe {
            let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
            std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
         };
        match sublocks.get(sub_block_ix) {
            None => false,
            Some(b) => {
                (b & (1 << offset)) != 0
            }
        }

The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks, contains doesn't show a regression any more (I think I upgraded nightly so maybe something changed there).

master var usize u64 (w subblocks) u32 u16 u8
test bench_insert_range 2,469 (+/- 293) 2,859 -15.80% 2,766 -12.03% 1,978 19.89% 2,805 -13.61% 2,764 0.07%
test bench_insert_range_using_loop 2,202,318 (+/- 32,273) 2,203,539 -0.06% 2,206,400 -0.19% 2,203,542 -0.06% 2,202,270 0.00% 2,205,833 0.03%
test bench_iter_ones_all_ones 791,858 (+/- 3,819) 710,383 10.29% 709,991 10.34% 710,495 10.27% 709,458 10.41% 710,395 -0.06%
test bench_iter_ones_all_zeros 9,782 (+/- 41) 4,910 49.81% 4,905 49.86% 4,908 49.83% 4,903 49.88% 4,901 0.08%
test bench_iter_ones_using_contains_all_ones 597,916 (+/- 20,551) 598,262 -0.06% 597,900 0.00% 599,137 -0.20% 666,197 -11.42% 665,485 -11.30%
test bench_iter_ones_using_contains_all_zeros 597,850 (+/- 6,897) 598,477 -0.10% 597,916 -0.01% 599,802 -0.33% 666,289 -11.45% 666,522 -11.47%
test bench_iter_ones_using_slice_directly_all_ones 367,193 (+/- 8,208) 438,308 -19.37% 438,979 -19.55% 449,346 -22.37% 443,264 -20.72% 438,887 0.02%
test bench_iter_ones_using_slice_directly_all_zero 9,794 (+/- 611) 4,905 49.92% 4,908 49.89% 4,915 49.82% 4,909 49.88% 4,914 -0.12%

I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets:

usize::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 5
        cmp     rax, qword ptr [rdi + 16]
        jae     .LBB10_1
        mov     rcx, qword ptr [rdi]
        mov     eax, dword ptr [rcx + 4*rax]
        bt      eax, esi
        setb    al
        ret
        
sublocks::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 3
        mov     rcx, qword ptr [rdi + 16]
        shl     rcx, 3
        cmp     rax, rcx
        jae     .LBB9_1
        mov     rcx, qword ptr [rdi]
        movzx   eax, byte ptr [rcx + rax]
        and     esi, 7
        bt      eax, esi
        setb    al
        ret

Given this I'd say this lgtm.

@james7132
Copy link
Collaborator Author

Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results.

benches/benches.rs Outdated Show resolved Hide resolved
benches/benches.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
@@ -274,12 +274,13 @@ impl FixedBitSet {
/// **Panics** if the range extends past the end of the bitset.
#[inline]
pub fn set_range<T: IndexRange>(&mut self, range: T, enabled: bool) {
let slice = self.data.as_mut_slice();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a small optimization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was an attempt to try to resolve the smallvec issues. This has since been reverted.

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
while let Some(nxt) = self.iter.next() {
if self.other.contains(nxt) {
return Some(nxt);
while self.bitset == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this yield a measurable speedup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually this only served to make it slower, though not by much. Ultimately reverted this.

@jrraymond
Copy link
Collaborator

jrraymond commented Jan 9, 2023

lgtm. please squash commits then merge.

@james7132 james7132 changed the title Multiple Optimizations Use usize as the Block instead of u32 Jan 9, 2023
@james7132 james7132 merged commit 5cf099a into petgraph:master Jan 9, 2023
@james7132 james7132 deleted the usize branch January 9, 2023 18:35
@james7132 james7132 mentioned this pull request Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants