New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster newline count via bit twiddling #122

Merged
merged 5 commits into from Jan 5, 2017

Conversation

Projects
None yet
5 participants
@llogiq
Contributor

llogiq commented Sep 21, 2016

The alignment splitting technique as well as the idea to partially unroll the loop to 2脳wordsize was nicked from @BurntSushi 's rust-memchr crate, and the bit twiddling is classic Stanford. 馃槃

In my benchmarks, I see between 100% and 200% speedup.

@googlebot googlebot added the cla: yes label Sep 21, 2016

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Sep 21, 2016

Nice! One note: @bluss is the one responsible for the initial memchr implementation. I just started the crate. :-)

I also have a fast line counting routine that is similar to the one you have here. One difference is that count_zero_bytes is implemented this way:

(x.wrapping_sub(LO) & !x & HI).count_ones() as u64

On architectures that support it, this will emit a POPCNT instruction, which gives another huge perf boost. (With rustc, I believe you need to explicitly enable this with either RUSTFLAGS="-C target-cpu=native" or RUSTFLAGS="-C target-feature=+popcnt".)

Nice! One note: @bluss is the one responsible for the initial memchr implementation. I just started the crate. :-)

I also have a fast line counting routine that is similar to the one you have here. One difference is that count_zero_bytes is implemented this way:

(x.wrapping_sub(LO) & !x & HI).count_ones() as u64

On architectures that support it, this will emit a POPCNT instruction, which gives another huge perf boost. (With rustc, I believe you need to explicitly enable this with either RUSTFLAGS="-C target-cpu=native" or RUSTFLAGS="-C target-feature=+popcnt".)

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 21, 2016

Contributor

Cool 馃槑 鈥 I'll be sure to benchmark that, but first I need to catch some sleep. 馃槾

Contributor

llogiq commented Sep 21, 2016

Cool 馃槑 鈥 I'll be sure to benchmark that, but first I need to catch some sleep. 馃槾

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 23, 2016

Contributor

@BurntSushi see benchmark 鈥 I widened the loop to up to 8 words per run to amortize the bit counting better.

Contributor

llogiq commented Sep 23, 2016

@BurntSushi see benchmark 鈥 I widened the loop to up to 8 words per run to amortize the bit counting better.

@zdanz

This comment has been minimized.

Show comment
Hide comment
@zdanz

zdanz Sep 27, 2016

@llogiq @BurntSushi sorry if I'm missing something obvious, but it seems to me that both of your count_zero_bytes / count_eol implementations are relying on an algorithm which detects zero bytes -- it doesn't count them correctly. For example if you pass in 0x100, you will get back 8, even though there are only 7 zero bytes.

Is there higher-level code which corrects this? I confess that I didn't go over it carefully.

zdanz commented Sep 27, 2016

@llogiq @BurntSushi sorry if I'm missing something obvious, but it seems to me that both of your count_zero_bytes / count_eol implementations are relying on an algorithm which detects zero bytes -- it doesn't count them correctly. For example if you pass in 0x100, you will get back 8, even though there are only 7 zero bytes.

Is there higher-level code which corrects this? I confess that I didn't go over it carefully.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Sep 27, 2016

@zdanz See this for an explanation: http://bits.stephan-brumme.com/null.html --- Note how it is called: https://github.com/BurntSushi/ripgrep/blob/0263a401f628ff910858b303b52e169559cd4ca9/src/search_stream.rs#L621

(There's no other higher level code that corrects this in ripgrep.)

@zdanz See this for an explanation: http://bits.stephan-brumme.com/null.html --- Note how it is called: https://github.com/BurntSushi/ripgrep/blob/0263a401f628ff910858b303b52e169559cd4ca9/src/search_stream.rs#L621

(There's no other higher level code that corrects this in ripgrep.)

@zdanz

This comment has been minimized.

Show comment
Hide comment
@zdanz

zdanz Sep 27, 2016

@BurntSushi Thanks for the reply! I did see the explanation at that link: note that it "detects" zero bytes. hasZeroByte there returns a bool.

I thought I might be missing the point when I looked at the search_stream code you pointed me to, so I cloned ripgrep and added this line to the top of main():

println!("zdanz: {}", search_stream::count_lines(&[1,0,1,1,1,1,1,1, 1,1,1,1,1,1,0,1], 1));

It prints out 16 when it should print out 14. Shouldn't it? (If you replace the 0's with 2's, then it does print out 14.)

EDIT: @llogiq similarly, when I use the bytecount crate to count the b'1' in "10111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101" (via as_bytes()) I get 128, when I should get 126. Change the 0's to 2's, and I do get 126. llogiq/bytecount#3

zdanz commented Sep 27, 2016

@BurntSushi Thanks for the reply! I did see the explanation at that link: note that it "detects" zero bytes. hasZeroByte there returns a bool.

I thought I might be missing the point when I looked at the search_stream code you pointed me to, so I cloned ripgrep and added this line to the top of main():

println!("zdanz: {}", search_stream::count_lines(&[1,0,1,1,1,1,1,1, 1,1,1,1,1,1,0,1], 1));

It prints out 16 when it should print out 14. Shouldn't it? (If you replace the 0's with 2's, then it does print out 14.)

EDIT: @llogiq similarly, when I use the bytecount crate to count the b'1' in "10111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101" (via as_bytes()) I get 128, when I should get 126. Change the 0's to 2's, and I do get 126. llogiq/bytecount#3

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Sep 27, 2016

@zdanz That's interesting. I've done quite a bit of out-of-band testing (by comparing the output of ripgrep with grep), and I haven't seen it get line counts wrong yet. But, you're example is pretty compelling. I'll put it on my list to check out. @llogiq This might be a good time to write a quickcheck property the checks the naive solution with the optimized version.

@zdanz That's interesting. I've done quite a bit of out-of-band testing (by comparing the output of ripgrep with grep), and I haven't seen it get line counts wrong yet. But, you're example is pretty compelling. I'll put it on my list to check out. @llogiq This might be a good time to write a quickcheck property the checks the naive solution with the optimized version.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 28, 2016

Contributor

I have a quickcheck in place, and the current version no longer exhibits the incorrect overflow behavior.

Contributor

llogiq commented Sep 28, 2016

I have a quickcheck in place, and the current version no longer exhibits the incorrect overflow behavior.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Dec 7, 2016

Contributor

I just added features (similar to what ripgrep uses) to allow for SSE/AVX usage where available.

Contributor

llogiq commented Dec 7, 2016

I just added features (similar to what ripgrep uses) to allow for SSE/AVX usage where available.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Dec 15, 2016

Contributor

@raphlinus What's the status on this? Is there something we need to do before merging?

Contributor

llogiq commented Dec 15, 2016

@raphlinus What's the status on this? Is there something we need to do before merging?

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Jan 5, 2017

Contributor

Again, is there something we need to do?

Contributor

llogiq commented Jan 5, 2017

Again, is there something we need to do?

@raphlinus raphlinus merged commit bc2d545 into google:master Jan 5, 2017

1 check passed

cla/google All necessary CLAs are signed
@raphlinus

This comment has been minimized.

Show comment
Hide comment
@raphlinus

raphlinus Jan 5, 2017

Collaborator

Sorry, I was delaying merging this because I was uneasy about unsafe code, but now I consider bytecount to be part of the "honorary part of rust's stdlib so it's ok if it has unsafe".

Collaborator

raphlinus commented Jan 5, 2017

Sorry, I was delaying merging this because I was uneasy about unsafe code, but now I consider bytecount to be part of the "honorary part of rust's stdlib so it's ok if it has unsafe".

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Jan 6, 2017

Contributor

Ah,I see 馃槃

It may relieve you to hear that bytecount has reduced its use of unsafe code to one line where we build the slice to aligned memory.

Contributor

llogiq commented Jan 6, 2017

Ah,I see 馃槃

It may relieve you to hear that bytecount has reduced its use of unsafe code to one line where we build the slice to aligned memory.

@llogiq llogiq deleted the llogiq:fast_newline branch Jan 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment