Refactor "kevlar collect" #36

standage · 2017-02-15T19:47:49Z

After some extensive testing of khmer's counttable and measuring the effects of the false positive rate on approximate abundance (as measured by khmer counttable) vs true abundance (as measured by exact method jellyfish), I've concluded that FPR is a much bigger factor than I previously thought.

The good news is that it's OK if the output of kevlar find includes k-mers whose abundance in the case(s) is drastically inflated (i.e. allocating more memory at this stage is unnecessary). The final stage (kevlar collect) operates on many fewer reads, and can achieve a FRP ≈0.0 with very limited memory.
Up until now, kevlar collect has been collecting reads into a khmer nodegraph, which tracks k-mer presence/absence but not abundance. We can change this to a khmer countgraph to double check the abundance of each putatively novel k-mer, discarding those k-mers whose true abundance falls below the threshold. The bad news is that this requires us to be careful about not loading the same read twice. This is a concern when kevlar find is run in banded mode, and the same read may appear multiple times in different outputs (annotated with different novel k-mers). However, even for human-sized data sets, storing read IDs in naive data structures (such as Python's dict or set) seems tractable even for a laptop to handle.

This pull request:

changes the nodegraph previously used by kevlar collect to a countgraph
changes the one-pass procedure over the input files to two passes
- first pass, loads all reads into the countgraph, making sure not to load the same read twice
- second pass, loads all novel k-mers, careful to discard those whose true abundance is less than the threshold specified in kevlar find

… output

standage · 2017-02-15T19:52:57Z

To be clear, this all but eliminates the problems we've been seeing with some reported contigs having only 1 supporting read. There are still a very small handful that have less than expected, and it looks like we need to consider how we're going to handle low complexity sequence.

ctb · 2017-02-15T20:09:37Z

Could use bloom filter with hash of sequence ids. Just sayin'

…

-- Titus Brown, ctbrown@ucdavis.edu

On Feb 15, 2017, at 2:47 PM, Daniel Standage ***@***.***> wrote: After some extensive testing of khmer's counttable and measuring the effects of the false positive rate on approximate abundance (as measured by khmer counttable) vs true abundance (as measured by exact method jellyfish), I've concluded that FPR is a much bigger factor than I previously thought. The good news is that it's OK if the output of kevlar find includes k-mers whose abundance in the case(s) is drastically inflated (i.e. allocating more memory at this stage is unnecessary). The final stage (kevlar collect) operates on many fewer reads, and can achieve a FRP ≈0.0 with very limited memory. Up until now, kevlar collect has been collecting reads into a khmer nodegraph, which tracks k-mer presence/absence but not abundance. We can change this to a khmer countgraph to double check the abundance of each putatively novel k-mer, discarding those k-mers whose true abundance falls below the threshold. The bad news is that this requires us to be careful about not loading the same read twice. This is a concern when kevlar find is run in banded mode, and the same read may appear multiple times in different outputs (annotated with different novel k-mers). However, even for human-sized data sets, storing read IDs in naive data structures (such as Python's dict or set) seems tractable even for a laptop to handle. This pull request: changes the nodegraph previously used by kevlar collect to a countgraph changes the one-pass procedure over the input files to two passes first pass, loads all reads into the countgraph, making sure not to load the same read twice second pass, loads all novel k-mers, careful to discard those whose true abundance is less than the threshold specified in kevlar find You can view, comment on, or merge this pull request online at: #36 Commit Summary Two passes for filtering inflated abundances New test Added test for duplicated reads Accidentally omitted test data files Fixed problem with argument passing, added some informative debugging output File Changes M .travis.yml (2) M kevlar/collect.py (114) A tests/data/collect.alpha.txt (69) A tests/data/collect.beta.1.txt (48) A tests/data/collect.beta.2.txt (48) M tests/test_collect.py (75) Patch Links: https://github.com/dib-lab/kevlar/pull/36.patch https://github.com/dib-lab/kevlar/pull/36.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

standage · 2017-02-15T20:18:07Z

Now if only we could loosen the restrictions on strings khmer nodetable could consume as input...wait!

standage added 5 commits February 14, 2017 21:28

Two passes for filtering inflated abundances

4b37e6c

New test

a39bf7c

Added test for duplicated reads

1a2e75d

Accidentally omitted test data files

abb5e56

Fixed problem with argument passing, added some informative debugging…

c4fc6a7

… output

Fix pipe test

193706d

standage mentioned this pull request Feb 15, 2017

Track read IDs with a bloom filter in kevlar collect #37

Closed

standage merged commit 1cbfaae into refactor/find Feb 15, 2017

standage deleted the refactor/collect branch February 15, 2017 20:47

standage mentioned this pull request Aug 4, 2017

Handle contaminants and reference the same way using a unified "mask" interface #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor "kevlar collect" #36

Refactor "kevlar collect" #36

standage commented Feb 15, 2017

standage commented Feb 15, 2017

ctb commented Feb 15, 2017 via email

standage commented Feb 15, 2017

Refactor "kevlar collect" #36

Refactor "kevlar collect" #36

Conversation

standage commented Feb 15, 2017

standage commented Feb 15, 2017

ctb commented Feb 15, 2017 via email

standage commented Feb 15, 2017