-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor "kevlar collect" #36
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To be clear, this all but eliminates the problems we've been seeing with some reported contigs having only 1 supporting read. There are still a very small handful that have less than expected, and it looks like we need to consider how we're going to handle low complexity sequence. |
Could use bloom filter with hash of sequence ids. Just sayin'
…--
Titus Brown, ctbrown@ucdavis.edu
On Feb 15, 2017, at 2:47 PM, Daniel Standage ***@***.***> wrote:
After some extensive testing of khmer's counttable and measuring the effects of the false positive rate on approximate abundance (as measured by khmer counttable) vs true abundance (as measured by exact method jellyfish), I've concluded that FPR is a much bigger factor than I previously thought.
The good news is that it's OK if the output of kevlar find includes k-mers whose abundance in the case(s) is drastically inflated (i.e. allocating more memory at this stage is unnecessary). The final stage (kevlar collect) operates on many fewer reads, and can achieve a FRP ≈0.0 with very limited memory.
Up until now, kevlar collect has been collecting reads into a khmer nodegraph, which tracks k-mer presence/absence but not abundance. We can change this to a khmer countgraph to double check the abundance of each putatively novel k-mer, discarding those k-mers whose true abundance falls below the threshold. The bad news is that this requires us to be careful about not loading the same read twice. This is a concern when kevlar find is run in banded mode, and the same read may appear multiple times in different outputs (annotated with different novel k-mers). However, even for human-sized data sets, storing read IDs in naive data structures (such as Python's dict or set) seems tractable even for a laptop to handle.
This pull request:
changes the nodegraph previously used by kevlar collect to a countgraph
changes the one-pass procedure over the input files to two passes
first pass, loads all reads into the countgraph, making sure not to load the same read twice
second pass, loads all novel k-mers, careful to discard those whose true abundance is less than the threshold specified in kevlar find
You can view, comment on, or merge this pull request online at:
#36
Commit Summary
Two passes for filtering inflated abundances
New test
Added test for duplicated reads
Accidentally omitted test data files
Fixed problem with argument passing, added some informative debugging output
File Changes
M .travis.yml (2)
M kevlar/collect.py (114)
A tests/data/collect.alpha.txt (69)
A tests/data/collect.beta.1.txt (48)
A tests/data/collect.beta.2.txt (48)
M tests/test_collect.py (75)
Patch Links:
https://github.com/dib-lab/kevlar/pull/36.patch
https://github.com/dib-lab/kevlar/pull/36.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Now if only we could loosen the restrictions on strings khmer nodetable could consume as input...wait! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After some extensive testing of khmer's counttable and measuring the effects of the false positive rate on approximate abundance (as measured by khmer counttable) vs true abundance (as measured by exact method jellyfish), I've concluded that FPR is a much bigger factor than I previously thought.
kevlar find
includes k-mers whose abundance in the case(s) is drastically inflated (i.e. allocating more memory at this stage is unnecessary). The final stage (kevlar collect
) operates on many fewer reads, and can achieve a FRP ≈0.0 with very limited memory.kevlar collect
has been collecting reads into a khmer nodegraph, which tracks k-mer presence/absence but not abundance. We can change this to a khmer countgraph to double check the abundance of each putatively novel k-mer, discarding those k-mers whose true abundance falls below the threshold. The bad news is that this requires us to be careful about not loading the same read twice. This is a concern whenkevlar find
is run in banded mode, and the same read may appear multiple times in different outputs (annotated with different novel k-mers). However, even for human-sized data sets, storing read IDs in naive data structures (such as Python'sdict
orset
) seems tractable even for a laptop to handle.This pull request:
kevlar collect
to a countgraphkevlar find