-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for building index from multiple bloom matrices with variable columns #2
Conversation
Since bitarray 0.9.0, the `numpy.where(bitarray)` call does not return the correct non zero positions for bits. Instead, it returns the index of non zero bytes. For example, for `bitarray("0000000001000000")`, the function call returns non zero positions `[1]`, instead of `[9]`, since bitarray release 0.9.0. This change fixes this issue by iterating over the bitarray and returning the list of indices at which the element is `True`.
The bitarray_bug branch contains fix required to support the feature developed in this branch.
`numpy.where(numpy.unpackbits(bitarray))[0].tolist()` seems to perform better than `[index for index, bit in enumerate(bitarray) if bit]` and `list(itertools.compress(itertools.count(), bitarray))` for a `bitarray` that has large number (>1000s) of elements.
-If we declare a bit array of size m, the bits are initialiased with random values (that is why that sometimes tests work, sometimes they do not); -we actually have to run self.bitarray.setall(0) after this command to ensure that the bitarray is built properly; -this is also written in the docs of bitarray: https://github.com/ilanschnell/bitarray/blob/master/bitarray/__init__.py#L24-L25 -this fixes this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor suggestions (and one potential bug). As a more general suggestion I think moving towards using type annotations will help improve the readability of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work on this @Zhicheng-Liu . The code is really nice, I learned a bunch of stuff. The tests cover almost all your changes, so I feel confident about merging this to master. I have just some minor comments, I won't even request changes for this PR (you can either implement my comments, or put them as issues to be solved later).
This pull request attempts to add support for building a single BIGSI index from multiple bloom matrices stored in different binary files, with different number of columns.
The PR adds two new subcommands:
merge_blooms
merges multiple smaller bloom matrices into one larger bloom matrixlarge_build
builds a single BIGSI index from multiple bloom matricesThe benefits of this PR are: