adding stats to dedup #55

sergpolly · 2017-11-25T01:20:09Z

In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

should we parse line_buffer[i] right next to where we're writing it to outstream?
should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

The text was updated successfully, but these errors were encountered:

golobor · 2017-11-28T19:29:25Z

awesome suggestions!

the suggested change in the stats interface is great, since it makes the API much more transparent - it's nice to show explicitly which characteristics of the alignment we actually use when we calculate statistics. It will simplify re-using stats in other contexts.
indeed, using cols_buffer instead of line_buffer to save one splitting operation is nice!
The issue, however, is that we do not add unmapped lines to cols_buffer, so we'd have to add it in a separate command.
finally, the question is where to calculate the duplicate statistics.
The two options are:
a) keep it the same way it is right now - calculate dup/dedup statistics in dedup and insert the final data into the stats object. pro: it's a minimal change from what we have right now, cons: it spreads calculation of statistics across different modules
b) rewrite stats to calculate dup/dedup numbers inside it, by checking the pair_type flag (we tag duplicates with a DD flag). pros: it nicely localizes statistics calculation in a single module, cons: we won't be able to only output dup/dedup statistics, and will have to calculate the full statistics every time we do dedup. Also, right now tagging of duplicate reads with DD is optional and is performed only when we output duplicates into a separate file. With the proposed scheme, we'd have to tag them everytime, which, probably is not a big deal.
Not sure what the optimal solution is. @nvictus ?..

sergpolly · 2017-12-05T00:55:34Z

@golobor So I've finally got to something about this issue, but I started small and addressed only (1)-item from your list.
I'm still digesting (2) and (3)

With (3b) do you mean we should get rid of
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and "hide" everything inside stats, by enabling add_pair to keep track of unmapped/dups ?
In this case, cols_buffer is not sufficient as it does not contain "bad" pairs ...
Maybe in this case we can use parsed cols list, since we'll be feeding all pairs to add_pairs regardless if it's "normal" or not :
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L278

(3a) in that case, imply keeping
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and instead using add_pair to keep track of "normal" pairs only, for which cols_buffer seems to be sufficient? But we'd have to accumulate that buffer not just for if mark_dups:, but rather always ...

sergpolly · 2017-12-05T01:27:50Z

I personally favor (3b) - code would be more consistent, kind of. Stats belong to one place, kind of - would be easier to address https://github.com/mirnylab/pairsamtools/issues/54
by relying on a rigid structure of stats object, by rigid I mean with the unmapped and dups files predefined @nvictus , @golobor what do you think?

golobor · 2017-12-08T03:55:10Z

re: 3(b), yes, your understanding is right, calculate all statistics via add_stats.
I too now think that we should with (3b), b/c it keeps the code tidy(er).

sergpolly referenced this issue in dekkerlab/pairsamtools Dec 5, 2017

stats API updated: algn=chr,pos,strand

1a3c9bd

sergpolly mentioned this issue Dec 5, 2017

stats API updated: algn=chr,pos,strand #57

Merged

sergpolly mentioned this issue Dec 15, 2017

Dedup stats #58

Merged

sergpolly closed this as completed Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding stats to dedup #55

adding stats to dedup #55

sergpolly commented Nov 25, 2017

golobor commented Nov 28, 2017

sergpolly commented Dec 5, 2017 •

edited

sergpolly commented Dec 5, 2017

golobor commented Dec 8, 2017

adding stats to dedup #55

adding stats to dedup #55

Comments

sergpolly commented Nov 25, 2017

golobor commented Nov 28, 2017

sergpolly commented Dec 5, 2017 • edited

sergpolly commented Dec 5, 2017

golobor commented Dec 8, 2017

sergpolly commented Dec 5, 2017 •

edited