Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding stats to dedup #55

Closed
sergpolly opened this issue Nov 25, 2017 · 4 comments
Closed

adding stats to dedup #55

sergpolly opened this issue Nov 25, 2017 · 4 comments

Comments

@sergpolly
Copy link
Member

In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

  • should we parse line_buffer[i] right next to where we're writing it to outstream?
  • should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
  • stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
    Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
    It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

@golobor
Copy link
Member

golobor commented Nov 28, 2017

awesome suggestions!

  1. the suggested change in the stats interface is great, since it makes the API much more transparent - it's nice to show explicitly which characteristics of the alignment we actually use when we calculate statistics. It will simplify re-using stats in other contexts.

  2. indeed, using cols_buffer instead of line_buffer to save one splitting operation is nice!
    The issue, however, is that we do not add unmapped lines to cols_buffer, so we'd have to add it in a separate command.

  3. finally, the question is where to calculate the duplicate statistics.
    The two options are:
    a) keep it the same way it is right now - calculate dup/dedup statistics in dedup and insert the final data into the stats object. pro: it's a minimal change from what we have right now, cons: it spreads calculation of statistics across different modules
    b) rewrite stats to calculate dup/dedup numbers inside it, by checking the pair_type flag (we tag duplicates with a DD flag). pros: it nicely localizes statistics calculation in a single module, cons: we won't be able to only output dup/dedup statistics, and will have to calculate the full statistics every time we do dedup. Also, right now tagging of duplicate reads with DD is optional and is performed only when we output duplicates into a separate file. With the proposed scheme, we'd have to tag them everytime, which, probably is not a big deal.
    Not sure what the optimal solution is. @nvictus ?..

@sergpolly
Copy link
Member Author

sergpolly commented Dec 5, 2017

@golobor So I've finally got to something about this issue, but I started small and addressed only (1)-item from your list.
I'm still digesting (2) and (3)

With (3b) do you mean we should get rid of
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and "hide" everything inside stats, by enabling add_pair to keep track of unmapped/dups ?
In this case, cols_buffer is not sufficient as it does not contain "bad" pairs ...
Maybe in this case we can use parsed cols list, since we'll be feeding all pairs to add_pairs regardless if it's "normal" or not :
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L278

(3a) in that case, imply keeping
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and instead using add_pair to keep track of "normal" pairs only, for which cols_buffer seems to be sufficient? But we'd have to accumulate that buffer not just for if mark_dups:, but rather always ...

@sergpolly
Copy link
Member Author

I personally favor (3b) - code would be more consistent, kind of. Stats belong to one place, kind of - would be easier to address https://github.com/mirnylab/pairsamtools/issues/54
by relying on a rigid structure of stats object, by rigid I mean with the unmapped and dups files predefined @nvictus , @golobor what do you think?

@golobor
Copy link
Member

golobor commented Dec 8, 2017

re: 3(b), yes, your understanding is right, calculate all statistics via add_stats.
I too now think that we should with (3b), b/c it keeps the code tidy(er).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants