Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect mapq stats in the pairs-stats if possible #80

Closed
sergpolly opened this issue Feb 4, 2020 · 6 comments
Closed

collect mapq stats in the pairs-stats if possible #80

sergpolly opened this issue Feb 4, 2020 · 6 comments

Comments

@sergpolly
Copy link
Member

sergpolly commented Feb 4, 2020

collect mapq stats in our stats if possible - would be useful/helpful for #78 - as there seems to be no easy drop-in mapping quality summary collector ...

maybe check samtools module though - https://multiqc.info/docs/#samtools

@sergpolly sergpolly changed the title collect mapq stats in the stats if possible collect mapq stats in the pairs-stats if possible Feb 4, 2020
@agalitsyna agalitsyna mentioned this issue Apr 6, 2022
31 tasks
@agalitsyna
Copy link
Member

pairtools select + pairtools stats combination covers this use case, see example distiller version that implements filters: https://github.com/open2c/distiller-nf/blob/filter_stats/distiller.nf

@Phlya
Copy link
Member

Phlya commented Apr 28, 2022

I actually think it shouldn't be hard to implement, and the output would be less confusing than separate select + stats, which produces fewer total reads for mapq30, for example, and then requires stitching two separate files to get proper stats in one place... Either hardcode splitting by different mapq values, or for arbitrary filtering criteria (thinking the distiller filters!). Shall we try to squeeze it in this release?

@Phlya Phlya reopened this Apr 28, 2022
@agalitsyna
Copy link
Member

Yes, if you see an easy way to do it, than there are no restrictions!

The first problem that I anticipate is that we have a hard-coded formatting of stats dictionary into either tsv or yaml. With addition of variable output depending on mapq we'll probably have to invent something more general than what we have right now:

def flatten(self):

The second problem is that we probably want not only mapq filter but filters on other columns as well. My idea that pairtools stats can take a set of columns and conditions how to split non-catgorical columns into groups (maybe a groupby expression?). If there is a pythonic way to explain how to group the contacts, that will be general enough to perform any sophisticated type of filtering.

@Phlya
Copy link
Member

Phlya commented Apr 28, 2022

Formatting for saving is indeed annoying, and I actually think we should organize the original dictionary as we want to save it, and at least with YAML we could just dump it as it is...

Re different filters: I would just use query() to subset pairs, it should be able to use exactly the same expression as the pairtools select argument I think. I'm not sure how to best pass it in the CLI to give names to the filters as well...

@agalitsyna
Copy link
Member

Some of the fields in stats dictionary are numpy arrays that have to be formatted before posting into YAML. Just a warning if you decide to change it.

Re different filters: query is not the same as groupby. I guess by different stats for mapq you meant that we want separate stats for mapq<30 and mapq>30, right?
If you need only query, then it's good, you may re-use the pairtools select engine:

# Compile the filtering expression:
match_func = compile(condition, "<string>", "eval")
# Columns filtration rule:
if remove_columns:
column_scheme = [input_columns.index(COL) for COL in updated_columns]
for line in body_stream:
COLS = line.rstrip().split(pairsam_format.PAIRSAM_SEP)
# Evaluate filtering expression:
filter_passed = eval(match_func)

@Phlya
Copy link
Member

Phlya commented Jul 26, 2022

This is now implemented with arbitrary filters in pairtools stats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants