Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor "kevlar collect" #36

Merged
merged 6 commits into from
Feb 15, 2017
Merged

Refactor "kevlar collect" #36

merged 6 commits into from
Feb 15, 2017

Conversation

standage
Copy link
Collaborator

After some extensive testing of khmer's counttable and measuring the effects of the false positive rate on approximate abundance (as measured by khmer counttable) vs true abundance (as measured by exact method jellyfish), I've concluded that FPR is a much bigger factor than I previously thought.

  • The good news is that it's OK if the output of kevlar find includes k-mers whose abundance in the case(s) is drastically inflated (i.e. allocating more memory at this stage is unnecessary). The final stage (kevlar collect) operates on many fewer reads, and can achieve a FRP ≈0.0 with very limited memory.
  • Up until now, kevlar collect has been collecting reads into a khmer nodegraph, which tracks k-mer presence/absence but not abundance. We can change this to a khmer countgraph to double check the abundance of each putatively novel k-mer, discarding those k-mers whose true abundance falls below the threshold. The bad news is that this requires us to be careful about not loading the same read twice. This is a concern when kevlar find is run in banded mode, and the same read may appear multiple times in different outputs (annotated with different novel k-mers). However, even for human-sized data sets, storing read IDs in naive data structures (such as Python's dict or set) seems tractable even for a laptop to handle.

This pull request:

  • changes the nodegraph previously used by kevlar collect to a countgraph
  • changes the one-pass procedure over the input files to two passes
    • first pass, loads all reads into the countgraph, making sure not to load the same read twice
    • second pass, loads all novel k-mers, careful to discard those whose true abundance is less than the threshold specified in kevlar find

@standage
Copy link
Collaborator Author

To be clear, this all but eliminates the problems we've been seeing with some reported contigs having only 1 supporting read. There are still a very small handful that have less than expected, and it looks like we need to consider how we're going to handle low complexity sequence.

@ctb
Copy link
Collaborator

ctb commented Feb 15, 2017 via email

@standage
Copy link
Collaborator Author

Now if only we could loosen the restrictions on strings khmer nodetable could consume as input...wait!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants