New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endogenous DNA (%) calculation #522
Comments
From my email where this discussion started: While I agree it is possible a better calculation, it is not so trival to calculate:
An easier thing I think would be to implement would be to have an optional module to do length filtering on BAM files after mapping. Then a user can turn off length filtering during AdapterRemoval (which is already possible). Therefore, all (merged) read are then kept for mapping so will be correctly recorded in the output BAM file and subsequent flagstat output. The endogenous DNA can then be calculated on a post-filter flagstat, - something that is already compatible with out endogenous calculator (endorS.py). Essentially we would just need to extend the samtools_filter process that already exists. Would this be sufficient? |
Hmm, I admit that the example I posted was for single-end data for which the case may be clearer. But even for paired-end data I don't understand why we can't simply take the number of read pairs as the raw number of molecules sequenced. That is literally what it is: Every pair is a molecule, and in the end you want to know what percentage of those map confidently onto the genome (i.e. are longer than 30bp and map). Why does this require non-trivial outputs from AdapterRemoval? I would leave all that in and simply divide by the number of sequenced molecules instead of the number of length-filtered and merged molecules. |
OK fair enough. Lets take read pairs, which makes it slightly less trivial except:
To be clear, I'm not against this but I personally don't have the time to write all of that for what is actually really an edge case (the one you described privately), as least it is pretty rare that you would loose so many of your input reads. If you're losing so many reads you should find out why (as this is unusual!), and re-run the pipeline with a reduced length filtering parameter. |
Also - forgot to add, the writer of the script will need to make sure it will work with the output of other adapter clipping tools, in case in the future we add other tools. Or they could parse the output of 8 FastQC files. |
Hmm, I'm pretty confused why that simple division would require so much custom scripting. Maybe I should talk to you in private. I certainly didn't want to suggest you should write tons of new scripts. I thought it would just be a different variable to choose, for which even I could submit a PR, but obviously I'm missing something. |
Ah ok. I see where the confusion is. Please have a read up on how MultiQC works and they we can talk again about other possibilities. I've just thought about it from a different angle and maybe we could ask Aida to add it to endorS.py... at least then there is a foundation. But collecting and summing across the different sequencing set ups will still remain a complicating factor. |
OK, thanks for acting so quickly on this, @jfy133, and this PR is definitely a solution. I'm skeptical though that this would mean that it's an opt-in behaviour, and I can't completely oversee whether it's safe to make this the standard. I talked to @TCLamnidis briefly in popgen-meeting, and he seemed to lean towards a custom script that just re-calculates the endogenous DNA the way we want it to. That does seem to be the safer option to me, because it wouldn't change the processing itself, just this one calculation. Perhaps we should discuss this at some point. I just don't know whether generally switching off length filtering with Adapter-Removal isn't too big a change for most people. It's definitely good to have this option, though, so I think the PR can go through no problem. |
Ok! The reason I would rather keep the defaults as is currently (and make the new calculation system opt-in), as this is what the wider user group expect from the long-term usage of running with EAGER1. You're proposing unilaterally changing the calculation of quite a major value reported in aDNA studies here, so could cause quite a shock (even if the original calculaution is not 'stable' so to say). I also spoke to Thiseas, and it still isn't as trivial as it might seem, to the very dynamic input that go into such a custom script (and custom scripts are frowned upon to a certain level if nf-core). Happy to talk, but basically the new implementation as above should do basically exactly the same thing as Kay's pipeline. You get a reported endogneous DNA before/after the length filtering too as he does. However! Importantly, despite it being opt-in there is no reason why you can't make your own profile which changes the defaults for the human side. See here: https://github.com/nf-core/configs/blob/master/conf/pipeline/eager/shh.config where I've already made a bare-bones one for human. You're free to update accordingly to your own wishes. |
OK, you're making very convincing points. Perhaps that's indeed the best way for now. We will advertise this new option then to the popgen-team and will test it. Thanks! |
When calculating the Endogenous DNA (%) as reported in the "General Statistics" Table in the MultiQC Output, Eager divides the number of mapped over total reads after AdapterRemoval. AdapterRemoval also performs size-filtering, however. This means that the ratio of mapped over "total" reads only uses a subset of all reads, leading to an inflated estimate.
In a recent test we did, we had a very poorly preserved sample, where out of >5 million reads sequenced, only 150K passed AdapterRemoval, mostly because reads were very short. The outcome was that Eager reported an Endogenous DNA % of 5%, which is quite good. However, when divided over the actual raw number of reads sequenced, the result would be 0.144%, which is very poor and just good enough for performing in-solution SNP-Capture.
Long story short: I think we should think a bit what definition of "Endogenous DNA (%)" is most reasonable. I would argue we should go by the principle of "Least Surprise". And in this case, a number of 5% was certainly very surprising to me, given the poor preservation of that sample, and I would have preferred to have been reported 0.144%, i.e. the ratio of mapped length-filtered reads divided over total sequenced reads. This definition makes economically most sense, as it is literally the fraction of sequenced reads that can be used for analysis as mapped reads. A counter-argument would be that the total number of sequenced reads might include artefacts from the library building, such as Adapter-Dimers or so, and hence the out coming estimate wouldn't really be the "biological" estimate of Endogenous DNA. However, I think that argument doesn't really count, since "artefacts" like Adapter-Dimers are real, and should justifiably reduce the Endogenous DNA of that library, just as - say - contamination by bacteria or other environmental DNA would decrease the Endogenous DNA.
So I would vote for changing the definition of "Endogenous DNA (%)" to the more conservative ratio of
mapped reads (post-AdapterRemoval) / total reads (pre-AdapterRemoval)
The text was updated successfully, but these errors were encountered: