Use usize as the Block instead of u32 #74

james7132 · 2022-03-11T17:40:05Z

Changed Block to be an alias on usize instead of u32, which should use 2x fewer instructions for similar length bit sets on 64-bit machines. This allows us to skip more in sparse bitsets.

This partially addresses #73. Though full SIMD support is probably more desirable here.

james7132 · 2022-03-11T20:14:55Z

Reran the benchmarks. Some insights:

Contains is definitely slower due to the extra on-the-stack vs allocated check smallvec has to do.
As expected a nearly 2x speedup on 64-bit machines when iterating over more sparse bitsets.
Moving the smallvec check outside of the inner loop of insert_range keeps it's performance largely the same.
The slower time when dealing iterating over dense bitsets is unexpected. Needs more investigation.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,217           1,289                    72    5.92%   x 0.94
 bench_insert_range_using_loop                  793,870         1,141,606           347,736   43.80%   x 0.70
 bench_iter_ones_all_ones                       314,546         442,401             127,855   40.65%   x 0.71
 bench_iter_ones_all_zeros                      8,505           4,510                -3,995  -46.97%   x 1.89
 bench_iter_ones_using_contains_all_ones        544,155         543,391                -764   -0.14%   x 1.00
 bench_iter_ones_using_contains_all_zeros       543,840         516,121             -27,719   -5.10%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,640         446,340              29,700    7.13%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,513           4,267                -4,246  -49.88%   x 2.00

Calandiel · 2022-05-07T06:46:16Z

Feels like this should be behind a feature (?). For some applications performance of "contains" may be the main bottleneck.

james7132 · 2022-06-08T12:39:36Z

@Calandiel ended up removing the smallvec changes, it's indeed an unacceptable overhead. Reran benchmarks, seems to address the insert and all_ones cases.

 name                                           master ns/iter  usize ns/iter  diff ns/iter   diff %  speedup
 bench_insert_range                             1,260           1,212                   -48   -3.81%   x 1.04
 bench_insert_range_using_loop                  794,435         962,799             168,364   21.19%   x 0.83
 bench_iter_ones_all_ones                       314,486         422,858             108,372   34.46%   x 0.74
 bench_iter_ones_all_zeros                      8,503           4,277                -4,226  -49.70%   x 1.99
 bench_iter_ones_using_contains_all_ones        544,015         518,847             -25,168   -4.63%   x 1.05
 bench_iter_ones_using_contains_all_zeros       543,870         518,844             -25,026   -4.60%   x 1.05
 bench_iter_ones_using_slice_directly_all_ones  416,365         448,484              32,119    7.71%   x 0.93
 bench_iter_ones_using_slice_directly_all_zero  8,504           4,280                -4,224  -49.67%   x 1.99

james7132 · 2022-06-08T13:16:14Z

@jrraymond mind taking a look? This is being used as a core component in Bevy's ECS scheduler, and this optimizes many of the operations behind it.

jrraymond · 2022-06-12T14:08:19Z

Thanks James. I see about the same speedups:

	usize	deviation	u32
bench_insert_range	3,862	416	3,807	0.99
bench_insert_range_using_loop	2,089,174	17,710	2,165,208	1.04
bench_iter_ones_all_ones	710,358	4,944	793,514	1.12
bench_iter_ones_all_zeros	4,911	107	9,790	1.99
bench_iter_ones_using_contains_all_ones	543,766	13,625	523,399	0.96
bench_iter_ones_using_contains_all_zeros	543,237	5,127	523,860	0.96
bench_iter_ones_using_slice_directly_all_ones	414,222	2,289	367,529	0.89
bench_iter_ones_using_slice_directly_all_zero	4,909	99	9,797	2.00

How do you feel about splitting this up into smaller PRs for 1) u32->usize 2) adding 0s iterator 3) moving tests to seperate module?
As is the PR is pretty huge

james7132 · 2022-11-27T21:28:15Z

@jrraymond I've updated this PR to only include the usize changes. Please take another look.

jrraymond · 2023-01-07T23:43:39Z

Before I merge this, I would like to resolve the performance regressions on the contains benchmarks:

benchmark	old	dev	new	dev	pct change
test bench_insert_range	3,915 ns/iter	(+/- 322)	3,828 ns/iter	(+/- 1,328)	2.22%
test bench_insert_range_using_loop	2,087,287 ns/iter	(+/- 5,895)	2,087,487 ns/iter	(+/- 11,232)	-0.01%
test bench_iter_ones_all_ones	791,916 ns/iter	(+/- 11,288)	709,237 ns/iter	(+/- 1,731)	10.44%
test bench_iter_ones_all_zeros	9,804 ns/iter	(+/- 237)	4,905 ns/iter	(+/- 57)	49.97%
test bench_iter_ones_using_contains_all_ones	525,739 ns/iter	(+/- 7,121)	542,470 ns/iter	(+/- 6,671)	-3.18%
test bench_iter_ones_using_contains_all_zeros	524,045 ns/iter	(+/- 10,135)	542,270 ns/iter	(+/- 3,156)	-3.48%
test bench_iter_ones_using_slice_directly_all_ones	368,727 ns/iter	(+/- 6,322)	437,931 ns/iter	(+/- 2,819)	-18.77%
test bench_iter_ones_using_slice_directly_all_zero	9,798 ns/iter	(+/- 144)	4,906 ns/iter	(+/- 36)	49.93%

jrraymond · 2023-01-08T04:20:36Z

I've been looking at the generated assembly and the differences are exactly what I would expect: the usize version loads qwords instead of dword, increments the counter by 8 instead of 4 shifts the indexes by different amounts.

   1  example::iter_ones_using_contains:                                                                                               |    1 example::iter_ones_using_contains:
    2         push    rbx                                                                                                              |    2         push    rbx
    3         mov     r9, qword ptr [rdi + 24]                                                                                         |    3         mov     r9, qword ptr [rdi + 24]
    4         test    r9, r9                                                                                                           |    4         test    r9, r9
    5         je      .LBB31_1                                                                                                         |    5         je      .LBB32_1
    6         mov     r8, qword ptr [rdi]                                                                                              |    6         mov     r8, qword ptr [rdi]
    7         mov     r11, qword ptr [rdi + 16]                                                                                        |    7         mov     r11, qword ptr [rdi + 16]
    8         cmp     r9, 1                                                                                                            |    8         cmp     r9, 1
    9         jne     .LBB31_8                                                                                                         |    9         jne     .LBB32_8
   10         xor     eax, eax                                                                                                         |   10         xor     eax, eax
   11         xor     edx, edx                                                                                                         |   11         xor     edx, edx
   12 .LBB31_4:                                                                                                                        |   12 .LBB32_4:
   13         test    r9b, 1                                                                                                           |   13         test    r9b, 1
   14         je      .LBB31_7                                                                                                         |   14         je      .LBB32_7
   15         mov     rsi, rdx                                                                                                         |   15         mov     rsi, rdx
   16         shr     rsi, 6                                                                                                           |   16         shr     rsi, 5
   17         cmp     rsi, r11                                                                                                         |   17         cmp     rsi, r11
   18         jae     .LBB31_7                                                                                                         |   18         jae     .LBB32_7
   19         mov     edi, 1                                                                                                           |   19         mov     edi, 1
   20         mov     ecx, edx                                                                                                         |   20         mov     ecx, edx
   21         shl     rdi, cl                                                                                                          |   21         shl     edi, cl
   22         and     rdi, qword ptr [r8 + 8*rsi]                                                                                      |   22         and     edi, dword ptr [r8 + 4*rsi]
   23         cmp     rdi, 1                                                                                                           |   23         cmp     edi, 1
   24         sbb     rax, -1                                                                                                          |   24         sbb     rax, -1
   25 .LBB31_7:                                                                                                                        |   25 .LBB32_7:
   26         pop     rbx                                                                                                              |   26         pop     rbx
   27         ret                                                                                                                      |   27         ret
   28 .LBB31_1:                                                                                                                        |   28 .LBB32_1:
   29         xor     eax, eax                                                                                                         |   29         xor     eax, eax
   30         pop     rbx                                                                                                              |   30         pop     rbx
   31         ret                                                                                                                      |   31         ret
   32 .LBB31_8:                                                                                                                        |   32 .LBB32_8:
   33         mov     r10, r9                                                                                                          |   33         mov     r10, r9
   34         and     r10, -2                                                                                                          |   34         and     r10, -2
   35         xor     eax, eax                                                                                                         |   35         xor     eax, eax
   36         xor     esi, esi                                                                                                         |   36         xor     esi, esi
   37         jmp     .LBB31_9                                                                                                         |   37         jmp     .LBB32_9
   38 .LBB31_13:                                                                                                                       |   38 .LBB32_13:
   39         mov     rsi, rdx                                                                                                         |   39         mov     rsi, rdx
   40         cmp     r10, rdx                                                                                                         |   40         cmp     r10, rdx
   41         je      .LBB31_4                                                                                                         |   41         je      .LBB32_4
   42 .LBB31_9:                                                                                                                        |   42 .LBB32_9:
   43         mov     rdi, rsi                                                                                                         |   43         mov     rdi, rsi
   44         shr     rdi, 6                                                                                                           |   44         shr     rdi, 5
   45         cmp     rdi, r11                                                                                                         |   45         cmp     rdi, r11
   46         jae     .LBB31_11                                                                                                        |   46         jae     .LBB32_11
   47         mov     ecx, esi                                                                                                         |   47         mov     ecx, esi
   48         and     cl, 62                                                                                                           |   48         and     cl, 30
   49         mov     edx, 1                                                                                                           |   49         mov     edx, 1
   50         shl     rdx, cl                                                                                                          |   50         shl     edx, cl
   51         and     rdx, qword ptr [r8 + 8*rdi]                                                                                      |   51         and     edx, dword ptr [r8 + 4*rdi]
   52         cmp     rdx, 1                                                                                                           |   52         cmp     edx, 1
   53         sbb     rax, -1                                                                                                          |   53         sbb     rax, -1
   54 .LBB31_11:                                                                                                                       |   54 .LBB32_11:
   55         lea     rdx, [rsi + 2]                                                                                                   |   55         lea     rdx, [rsi + 2]
   56         cmp     rdi, r11                                                                                                         |   56         cmp     rdi, r11
   57         jae     .LBB31_13                                                                                                        |   57         jae     .LBB32_13
   58         and     sil, 62                                                                                                          |   58         and     sil, 30
   59         or      sil, 1                                                                                                           |   59         or      sil, 1
   60         mov     ebx, 1                                                                                                           |   60         mov     ebx, 1
   61         mov     ecx, esi                                                                                                         |   61         mov     ecx, esi
   62         shl     rbx, cl                                                                                                          |   62         shl     ebx, cl
   63         and     rbx, qword ptr [r8 + 8*rdi]                                                                                      |   63         and     ebx, dword ptr [r8 + 4*rdi]
   64         cmp     rbx, 1                                                                                                           |   64         cmp     ebx, 1
   65         sbb     rax, -1                                                                                                          |   65         sbb     rax, -1
   66         jmp     .LBB31_13                                                                                                        |   66         jmp     .LBB32_13

james7132 · 2023-01-08T06:54:34Z

Thanks for the deep dive into this.

What's the policy on this crate for the use of unsafe? We can convert a &[usize] into an equivalent &[u32] and then use that to load the target u32. This would retain gains for the batch/set operations without causing a regression in contains, but would necessitate the use of more unsafe.

I've also, in the meantime, experimented with explicitly vectorized operations, using __m128i/__m256i over usize on targets that support it, and we can net a 2-4x speedup over the existing gains on those targets. I won't add it to this PR, but that also would add a significant amount of unsafe to the crate, so I wanted to ask for an opinion ahead of time.

jrraymond · 2023-01-08T19:05:47Z

I think we should use unsafe here if it is faster because the scope is narrow and the correctness is easily verifiable. I have no objection to unsafe if there aren't alternatives, so unsafe simd is fine with me.

I took a stab at this to see if this makes a difference:

        type SubBlock = u32;
        const SUB_BLOCK_BITS: usize = std::mem::size_of::<SubBlock>() * 8;
        const SUB_BLOCKS_PER_BLOCK: usize = BITS / SUB_BLOCK_BITS;
        let block_ix = bit / BITS;
        let block_offset = bit % BITS;
        let sub_block_ix = block_ix * SUB_BLOCKS_PER_BLOCK + block_offset / SUB_BLOCK_BITS;
        let offset = bit % SUB_BLOCK_BITS;
        let blocks: &[usize] = &self.data;
        let sublocks: &[SubBlock] = unsafe {
            let p: *const SubBlock = std::mem::transmute(blocks.as_ptr());
            std::slice::from_raw_parts(p, blocks.len() * SUB_BLOCKS_PER_BLOCK)
         };
        match sublocks.get(sub_block_ix) {
            None => false,
            Some(b) => {
                (b & (1 << offset)) != 0
            }
        }

The additional logic to compute the indexes only adds overhead :(. However, in my latest round of benchmarks, contains doesn't show a regression any more (I think I upgraded nightly so maybe something changed there).

	master	var	usize		u64 (w subblocks)		u32		u16		u8
test bench_insert_range	2,469	(+/- 293)	2,859	-15.80%	2,766	-12.03%	1,978	19.89%	2,805	-13.61%	2,764	0.07%
test bench_insert_range_using_loop	2,202,318	(+/- 32,273)	2,203,539	-0.06%	2,206,400	-0.19%	2,203,542	-0.06%	2,202,270	0.00%	2,205,833	0.03%
test bench_iter_ones_all_ones	791,858	(+/- 3,819)	710,383	10.29%	709,991	10.34%	710,495	10.27%	709,458	10.41%	710,395	-0.06%
test bench_iter_ones_all_zeros	9,782	(+/- 41)	4,910	49.81%	4,905	49.86%	4,908	49.83%	4,903	49.88%	4,901	0.08%
test bench_iter_ones_using_contains_all_ones	597,916	(+/- 20,551)	598,262	-0.06%	597,900	0.00%	599,137	-0.20%	666,197	-11.42%	665,485	-11.30%
test bench_iter_ones_using_contains_all_zeros	597,850	(+/- 6,897)	598,477	-0.10%	597,916	-0.01%	599,802	-0.33%	666,289	-11.45%	666,522	-11.47%
test bench_iter_ones_using_slice_directly_all_ones	367,193	(+/- 8,208)	438,308	-19.37%	438,979	-19.55%	449,346	-22.37%	443,264	-20.72%	438,887	0.02%
test bench_iter_ones_using_slice_directly_all_zero	9,794	(+/- 611)	4,905	49.92%	4,908	49.89%	4,915	49.82%	4,909	49.88%	4,914	-0.12%

I think the overhead of computing the additional offsets outweighs any benefits, although I'm sure you could improve my logic for calculating sub block offsets:

usize::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 5
        cmp     rax, qword ptr [rdi + 16]
        jae     .LBB10_1
        mov     rcx, qword ptr [rdi]
        mov     eax, dword ptr [rcx + 4*rax]
        bt      eax, esi
        setb    al
        ret
        
sublocks::FixedBitSet::contains:
        mov     rax, rsi
        shr     rax, 3
        mov     rcx, qword ptr [rdi + 16]
        shl     rcx, 3
        cmp     rax, rcx
        jae     .LBB9_1
        mov     rcx, qword ptr [rdi]
        movzx   eax, byte ptr [rcx + rax]
        and     esi, 7
        bt      eax, esi
        setb    al
        ret

Given this I'd say this lgtm.

james7132 · 2023-01-09T03:50:33Z

Ultimately I think we should improve the benchmark coverage, and maybe convert them to Criterion for more stable results.

benches/benches.rs

src/lib.rs

jrraymond · 2022-06-12T13:30:18Z

src/lib.rs

@@ -274,12 +274,13 @@ impl FixedBitSet {
    /// **Panics** if the range extends past the end of the bitset.
    #[inline]
    pub fn set_range<T: IndexRange>(&mut self, range: T, enabled: bool) {
+        let slice = self.data.as_mut_slice();


was a small optimization?

I think this was an attempt to try to resolve the smallvec issues. This has since been reverted.

src/lib.rs

jrraymond · 2022-06-12T14:09:40Z

src/lib.rs

-        while let Some(nxt) = self.iter.next() {
-            if self.other.contains(nxt) {
-                return Some(nxt);
+        while self.bitset == 0 {


Does this yield a measurable speedup?

No actually this only served to make it slower, though not by much. Ultimately reverted this.

jrraymond · 2023-01-09T12:53:41Z

lgtm. please squash commits then merge.

james7132 force-pushed the usize branch from 5be3bd3 to ac6483a Compare March 11, 2022 18:35

james7132 changed the title ~~Use usize as Block and optimize Intersection/Union for denser bitsets~~ Multiple Optimizations Mar 11, 2022

Use usize as Block

8288a2e

james7132 force-pushed the usize branch from 5c28b4a to 8288a2e Compare June 24, 2022 08:44

james7132 added 2 commits June 24, 2022 02:05

Speed up Ones

64f9add

Get rid of hard coded 32s

4e2cd04

james7132 mentioned this pull request Dec 24, 2022

SIMD Accelerated Operations #73

Closed

Merge branch 'master' into usize

7a15c0d

james7132 requested a review from jrraymond January 9, 2023 03:54

Fix tests

fff81ec

james7132 added the enhancement label Jan 9, 2023

jrraymond approved these changes Jan 9, 2023

View reviewed changes

james7132 changed the title ~~Multiple Optimizations~~ Use usize as the Block instead of u32 Jan 9, 2023

james7132 merged commit 5cf099a into petgraph:master Jan 9, 2023

james7132 deleted the usize branch January 9, 2023 18:35

james7132 mentioned this pull request Jan 9, 2023

Add SSE2/AVX2/WASM SIMD support #86

Merged

grovesNL mentioned this pull request Feb 12, 2023

Possible performance optimizations #57

Open

james7132 mentioned this pull request Feb 17, 2023

Cut a release? #93

Closed

james7132 mentioned this pull request Feb 26, 2024

Fix Serde implementation #99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use usize as the Block instead of u32 #74

Use usize as the Block instead of u32 #74

james7132 commented Mar 11, 2022 •

edited

Loading

james7132 commented Mar 11, 2022

Calandiel commented May 7, 2022

james7132 commented Jun 8, 2022

james7132 commented Jun 8, 2022

jrraymond commented Jun 12, 2022

james7132 commented Nov 27, 2022

jrraymond commented Jan 7, 2023 •

edited

Loading

jrraymond commented Jan 8, 2023

james7132 commented Jan 8, 2023 •

edited

Loading

jrraymond commented Jan 8, 2023

james7132 commented Jan 9, 2023

jrraymond Jun 12, 2022

james7132 Jan 9, 2023

jrraymond Jun 12, 2022

james7132 Jan 9, 2023

jrraymond commented Jan 9, 2023 •

edited

Loading

Use usize as the Block instead of u32 #74

Use usize as the Block instead of u32 #74

Conversation

james7132 commented Mar 11, 2022 • edited Loading

james7132 commented Mar 11, 2022

Calandiel commented May 7, 2022

james7132 commented Jun 8, 2022

james7132 commented Jun 8, 2022

jrraymond commented Jun 12, 2022

james7132 commented Nov 27, 2022

jrraymond commented Jan 7, 2023 • edited Loading

jrraymond commented Jan 8, 2023

james7132 commented Jan 8, 2023 • edited Loading

jrraymond commented Jan 8, 2023

james7132 commented Jan 9, 2023

jrraymond Jun 12, 2022

Choose a reason for hiding this comment

james7132 Jan 9, 2023

Choose a reason for hiding this comment

jrraymond Jun 12, 2022

Choose a reason for hiding this comment

james7132 Jan 9, 2023

Choose a reason for hiding this comment

jrraymond commented Jan 9, 2023 • edited Loading

james7132 commented Mar 11, 2022 •

edited

Loading

jrraymond commented Jan 7, 2023 •

edited

Loading

james7132 commented Jan 8, 2023 •

edited

Loading

jrraymond commented Jan 9, 2023 •

edited

Loading