Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ones using trailing_zeros #18

Merged
merged 2 commits into from
Jan 23, 2021
Merged

Conversation

vks
Copy link
Contributor

@vks vks commented Feb 22, 2018

The iter_ones_all_zeros benchmark improved a lot, the results for iter_ones_all_ones are too noisy on my machine to tell. Will update when I have better numbers.

Fixes #17.

@vks
Copy link
Contributor Author

vks commented Feb 22, 2018

I ported the benchmarks to criterion (see #19), hoping to get less noisy results. I still don't trust them, but here they are:

iter ones: all zeros/default                        
                        time:   [10.350 us 10.395 us 10.440 us]
                        change: [-57.692% -57.491% -57.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

iter ones: all ones/default                        
                        time:   [433.47 us 433.99 us 434.56 us]
                        change: [+19.300% +21.654% +23.611%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

iter ones: random/default                        
                        time:   [447.96 us 448.54 us 449.20 us]
                        change: [-83.733% -83.682% -83.634%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

The actual noise is at least 2%. The last benchmark is newly introduced and uses (uniformly) random bits. For the "all ones" benchmark, the slice-based implementation is best:

iter ones: all ones/slice directly                        
                        time:   [367.65 us 368.20 us 368.80 us]
                        change: [-1.6927% -1.4273% -1.1618%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

(Note that the change in performance is bogus, the code was not changed.)

@vks vks mentioned this pull request Feb 22, 2018
}
self.bitset = self.remaining_blocks[0];
Copy link

@akraines akraines Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of immediately setting bitset to remaining_blocks[0] and updating remaining blocks, why don't you search for the first none zero block and only then update remaining_blocks? (just to save some assignments)

Copy link

@akraines akraines Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the PR has been opened for almost 3 years, so I guess nothing needs be done until the maintainer shows interest, but I figured I'd add the comment in case something comes of it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is faster, because it might mess up vectorization. Could be interesting to benchmark.

@bluss
Copy link
Member

bluss commented Jan 22, 2021

Looks good to me. Sorry for the long wait, I'm not sure if there was a reason for not including this

@vks
Copy link
Contributor Author

vks commented Jan 22, 2021

I rebased it on master.

return None;
}
while self.bitset == 0 {
if self.remaining_blocks.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can accomplish all of is_empty, [0] and [1..] in one method call - self.remaining_blocks.split_first() for a presumed minor win in terms of bounds checking overhead 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or for that matter, reading split_first's implmentation with slice patterns - but I think that's a much more recent feature if let [first, tail @ ..] = self { Some((first, tail)) } else { None } but it's good to read and use elsewhere if not here.

@bluss
Copy link
Member

bluss commented Jan 22, 2021

Looks good to me.

What you think @jrraymond? If you say go I can maybe look at releasing unless you want to do it

@jrraymond
Copy link
Collaborator

jrraymond commented Jan 23, 2021

lgtm as well. I don't mind releasing this morning.

I dumped the asm because I was curious to see if trailing zeroes does produce better asm and it does:

<old::Ones as core::iter::traits::iterator::Iterator>::next:
        mov     ecx, dword ptr [rdi + 32]
        mov     rsi, qword ptr [rdi]
        xor     eax, eax
.LBB8_1:
        mov     rdx, rsi
        mov     esi, ecx
        shr     ecx
        test    sil, 1
        jne     .LBB8_5
        lea     rsi, [rdx + 1]
        test    ecx, ecx
        jne     .LBB8_1
        mov     rsi, qword ptr [rdi + 24]
        test    rsi, rsi
        je      .LBB8_6
        mov     rdx, qword ptr [rdi + 16]
        add     rsi, -1
        mov     ecx, dword ptr [rdx]
        add     rdx, 4
        mov     qword ptr [rdi + 16], rdx
        mov     qword ptr [rdi + 24], rsi
        mov     rsi, qword ptr [rdi + 8]
        add     rsi, 1
        mov     qword ptr [rdi + 8], rsi
        shl     rsi, 5
        jmp     .LBB8_1
.LBB8_5:
        mov     dword ptr [rdi + 32], ecx
        lea     rax, [rdx + 1]
        mov     qword ptr [rdi], rax
        mov     eax, 1
.LBB8_6:
        ret

<new::Ones as core::iter::traits::iterator::Iterator>::next:
        mov     r8d, dword ptr [rdi + 24]
        test    r8d, r8d
        je      .LBB7_3
        mov     rdx, qword ptr [rdi]
        jmp     .LBB7_2
.LBB7_3:
        mov     rsi, qword ptr [rdi + 16]
        add     rsi, -1
        xor     eax, eax
.LBB7_4:
        cmp     rsi, -1
        je      .LBB7_5
        mov     rdx, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 8]
        mov     r8d, dword ptr [rcx]
        mov     dword ptr [rdi + 24], r8d
        add     rcx, 4
        mov     qword ptr [rdi + 8], rcx
        mov     qword ptr [rdi + 16], rsi
        add     rdx, 1
        mov     qword ptr [rdi], rdx
        add     rsi, -1
        test    r8d, r8d
        je      .LBB7_4
.LBB7_2:
        bsf     eax, r8d   ; bit scan forward \o/
        lea     esi, [r8 - 1]
        and     esi, r8d
        mov     dword ptr [rdi + 24], esi
        shl     rdx, 5
        or      rdx, rax
        mov     eax, 1
        ret
.LBB7_5:
        ret

@jrraymond jrraymond merged commit 1563358 into petgraph:master Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ones using trailing_zeros
4 participants