-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New constructor with blocks #42
Conversation
Reasonable suggestion, even if it does expose the fact that we use 32-bit blocks. Last chance to revisit that decision - is u64 better? The method signature would be able to be supported even if we switch to u64 internally, though. |
/// Create a new **FixedBitSet** with a specific number of bits, | ||
/// all initially clear. | ||
pub fn with_capacity_and_blocks(bits: usize, data: Vec<Block>) -> Self | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any "format" issue that we can't check with an assertion?
the doc comment needs update, also mentioning when and why construction would panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be just as good to accept an IntoIterator of Blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is how I'm using this PR: https://github.com/dib-lab/sourmash/blob/3c7de27fdfc23beedaa5173f370c93cb80a5a722/src/core/src/sketch/nodegraph.rs#L226L249
I thought the "format" issue would raise from using LittleEndian
or BigEndian
to build the blocks, but that's not really relevant to fixedbitset
(because it only cares about it being u32
), and it's only an issue when serializing the data from a buffer (a file, in my case).
On the panic side: I think failing makes sense, but a panic might be too harsh. Alternatives are:
- not failing. Ignore extra blocks if too big, or put empty blocks if too short
- changing the method to return a Result instead (but all current methods panic, so maybe this is not an issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panic fits the common pattern of out of bounds => panic.
But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are right, if it's too short, we could just fill out with the right number of blocks and if too long, we could just save those as extra capacity (this case needs a brief walk through the code to check if that's compatible) - sounds like a no-panic solution can work!
I removed the assert_eq!
check and doing a .resize
on the data now. Keeping the extra capacity involves changing other places, like the .as_slice
method (which will return something potentially larger than the initialized capacity), and there is also the need to reset any extra capacity too.
I kept a failing test to discuss this:
let fb = FixedBitSet::with_capacity_and_blocks(1, vec![8u32, 24u32]);
assert!(!fb.contains(3));
since the capacity is 1
, the .contains(3)
call should be false
. Right now it is true
, because that first block is 8u32
.
So, there is an extra step of setting any bit > capacity to 0
to make the test pass.
Or should it be panicking with out of bounds (since it is more than the capacity)?
Your code looks a bit noisy, other things could add to your overhead there. Can you avoid calling individual read_u8 calls and so on?
working on that, it was much, much worse...
(I was parsing everything with read_u8
, and collecting set bits, and then doing a .from_iter
before...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the capacity is 1, the .contains(3) call should be false. Right now it is true, because that first block is 8u32.
It's really a format error that the blocks contain something else than zero past the length. This is something fixedbitset assumes, and it is not 100% enforced in the interface.
It should probably not panic for oob - any bit outside our capacity is assumed to be not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled the bits past capacity, thanks for the patience!
src/lib.rs
Outdated
assert_eq!(blocks, data.len()); | ||
let (mut n_blocks, rem) = div_rem(bits, BITS); | ||
n_blocks += (rem > 0) as usize; | ||
let data: Vec<Block> = blocks.into_iter().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Here's our consolation - Vec::into_iter().collect() roundtrips without extra allocation/copying (in this case where the iterator is fresh, no elements taken).
Thank you, nice. Feel free to poke me if this doesn't get released "soon". |
This PR adds a new constructor taking
capacity
andblocks
. I needed this because loading big bitsets using.from_iter
or.extend
was slow, and with this change it is much faster because it avoids all the conversions to find what bit to set. (Incidentally, if anyone has good approaches to load data from disk that avoids this PR, I'm interested).I really like all the other features in fixedbitset, and didn't want to lose them by switching to another crate... But I also understand that this is an invasive change, because it exposes internals and makes it harder to change how data is stored or accessed in the future.
Additional changes: maybe make the signature
pub unsafe fn with_capacity_and_blocks(bits: usize, data: Vec<Block>) -> Self
to indicate the caller is responsible for passingVec<block>
in the right format?