Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New constructor with blocks #42

Merged
merged 4 commits into from
Apr 16, 2020
Merged

New constructor with blocks #42

merged 4 commits into from
Apr 16, 2020

Conversation

luizirber
Copy link
Contributor

This PR adds a new constructor taking capacity and blocks. I needed this because loading big bitsets using .from_iter or .extend was slow, and with this change it is much faster because it avoids all the conversions to find what bit to set. (Incidentally, if anyone has good approaches to load data from disk that avoids this PR, I'm interested).

I really like all the other features in fixedbitset, and didn't want to lose them by switching to another crate... But I also understand that this is an invasive change, because it exposes internals and makes it harder to change how data is stored or accessed in the future.

Additional changes: maybe make the signature pub unsafe fn with_capacity_and_blocks(bits: usize, data: Vec<Block>) -> Self to indicate the caller is responsible for passing Vec<block> in the right format?

@bluss
Copy link
Member

bluss commented Apr 15, 2020

Reasonable suggestion, even if it does expose the fact that we use 32-bit blocks. Last chance to revisit that decision - is u64 better? The method signature would be able to be supported even if we switch to u64 internally, though.

/// Create a new **FixedBitSet** with a specific number of bits,
/// all initially clear.
pub fn with_capacity_and_blocks(bits: usize, data: Vec<Block>) -> Self
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any "format" issue that we can't check with an assertion?

the doc comment needs update, also mentioning when and why construction would panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be just as good to accept an IntoIterator of Blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is how I'm using this PR: https://github.com/dib-lab/sourmash/blob/3c7de27fdfc23beedaa5173f370c93cb80a5a722/src/core/src/sketch/nodegraph.rs#L226L249

I thought the "format" issue would raise from using LittleEndian or BigEndian to build the blocks, but that's not really relevant to fixedbitset (because it only cares about it being u32), and it's only an issue when serializing the data from a buffer (a file, in my case).

On the panic side: I think failing makes sense, but a panic might be too harsh. Alternatives are:

  • not failing. Ignore extra blocks if too big, or put empty blocks if too short
  • changing the method to return a Result instead (but all current methods panic, so maybe this is not an issue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic fits the common pattern of out of bounds => panic.

But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?

Copy link
Contributor Author

@luizirber luizirber Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!

I removed the assert_eq! check and doing a .resize on the data now. Keeping the extra capacity involves changing other places, like the .as_slice method (which will return something potentially larger than the initialized capacity), and there is also the need to reset any extra capacity too.

I kept a failing test to discuss this:

let fb = FixedBitSet::with_capacity_and_blocks(1, vec![8u32, 24u32]);
assert!(!fb.contains(3));

since the capacity is 1, the .contains(3) call should be false. Right now it is true, because that first block is 8u32.

So, there is an extra step of setting any bit > capacity to 0 to make the test pass.

Or should it be panicking with out of bounds (since it is more than the capacity)?

Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?

working on that, it was much, much worse...
(I was parsing everything with read_u8, and collecting set bits, and then doing a .from_iter before...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the capacity is 1, the .contains(3) call should be false. Right now it is true, because that first block is 8u32.

It's really a format error that the blocks contain something else than zero past the length. This is something fixedbitset assumes, and it is not 100% enforced in the interface.

It should probably not panic for oob - any bit outside our capacity is assumed to be not set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the bits past capacity, thanks for the patience!

src/lib.rs Outdated
assert_eq!(blocks, data.len());
let (mut n_blocks, rem) = div_rem(bits, BITS);
n_blocks += (rem > 0) as usize;
let data: Vec<Block> = blocks.into_iter().collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Here's our consolation - Vec::into_iter().collect() roundtrips without extra allocation/copying (in this case where the iterator is fresh, no elements taken).

@bluss
Copy link
Member

bluss commented Apr 16, 2020

Thank you, nice.

Feel free to poke me if this doesn't get released "soon".

@bluss bluss merged commit 611e8f4 into petgraph:master Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants