Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for building index from multiple bloom matrices with variable columns #2

Merged
merged 52 commits into from
Apr 15, 2020
Merged

Conversation

Zhicheng-Liu
Copy link

This pull request attempts to add support for building a single BIGSI index from multiple bloom matrices stored in different binary files, with different number of columns.

The PR adds two new subcommands:

  • merge_blooms merges multiple smaller bloom matrices into one larger bloom matrix
  • large_build builds a single BIGSI index from multiple bloom matrices

The benefits of this PR are:

  • building BIGSI index from multiple bloom matrices with various column sizes, instead of all single column bloom filters.
  • provides bloom filters merging before building an index. This allows building an index for a large number of samples with doing lots of index merging which are time consuming
  • low memory requirement as new functions do not load all the bloom filters into the memory. They read the bloom matrices row by row sequentially.

Phelimb and others added 30 commits May 30, 2019 10:27
Since bitarray 0.9.0, the `numpy.where(bitarray)` call does not return
the correct non zero positions for bits. Instead, it returns the index
of non zero bytes. For example, for `bitarray("0000000001000000")`, the
function call returns non zero positions `[1]`, instead of `[9]`, since
bitarray release 0.9.0.

This change fixes this issue by iterating over the bitarray and returning
the list of indices at which the element is `True`.
The bitarray_bug branch contains fix required to support the feature
developed in this branch.
Zhicheng-Liu and others added 14 commits April 6, 2020 09:42
`numpy.where(numpy.unpackbits(bitarray))[0].tolist()` seems to perform
better than `[index for index, bit in enumerate(bitarray) if bit]` and
`list(itertools.compress(itertools.count(), bitarray))` for a `bitarray`
that has large number (>1000s) of elements.
-If we declare a bit array of size m, the bits are initialiased with random values (that is why that sometimes tests work, sometimes they do not);
-we actually have to run self.bitarray.setall(0) after this command to ensure that the bitarray is built properly;
-this is also written in the docs of bitarray: https://github.com/ilanschnell/bitarray/blob/master/bitarray/__init__.py#L24-L25
-this fixes this issue
@bricoletc bricoletc removed their request for review April 7, 2020 16:53
Copy link
Member

@mbhall88 mbhall88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor suggestions (and one potential bug). As a more general suggestion I think moving towards using type annotations will help improve the readability of the code.

bigsi/__main__.py Outdated Show resolved Hide resolved
bigsi/__main__.py Outdated Show resolved Hide resolved
bigsi/__main__.py Outdated Show resolved Hide resolved
bigsi/cmds/merge_blooms.py Show resolved Hide resolved
bigsi/cmds/variant_search.py Outdated Show resolved Hide resolved
Copy link

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work on this @Zhicheng-Liu . The code is really nice, I learned a bunch of stuff. The tests cover almost all your changes, so I feel confident about merging this to master. I have just some minor comments, I won't even request changes for this PR (you can either implement my comments, or put them as issues to be solved later).

bigsi/__main__.py Show resolved Hide resolved
bigsi/bloom/bit_matrix_reader.py Show resolved Hide resolved
bigsi/bloom/bit_matrix_writer.py Show resolved Hide resolved
bigsi/cmds/large_build.py Show resolved Hide resolved
bigsi/cmds/large_build.py Show resolved Hide resolved
@Zhicheng-Liu Zhicheng-Liu merged commit 12966ca into iqbal-lab-org:master Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants