Endogenous DNA (%) calculation #522

stschiff · 2020-07-21T12:32:59Z

When calculating the Endogenous DNA (%) as reported in the "General Statistics" Table in the MultiQC Output, Eager divides the number of mapped over total reads after AdapterRemoval. AdapterRemoval also performs size-filtering, however. This means that the ratio of mapped over "total" reads only uses a subset of all reads, leading to an inflated estimate.

In a recent test we did, we had a very poorly preserved sample, where out of >5 million reads sequenced, only 150K passed AdapterRemoval, mostly because reads were very short. The outcome was that Eager reported an Endogenous DNA % of 5%, which is quite good. However, when divided over the actual raw number of reads sequenced, the result would be 0.144%, which is very poor and just good enough for performing in-solution SNP-Capture.

Long story short: I think we should think a bit what definition of "Endogenous DNA (%)" is most reasonable. I would argue we should go by the principle of "Least Surprise". And in this case, a number of 5% was certainly very surprising to me, given the poor preservation of that sample, and I would have preferred to have been reported 0.144%, i.e. the ratio of mapped length-filtered reads divided over total sequenced reads. This definition makes economically most sense, as it is literally the fraction of sequenced reads that can be used for analysis as mapped reads. A counter-argument would be that the total number of sequenced reads might include artefacts from the library building, such as Adapter-Dimers or so, and hence the out coming estimate wouldn't really be the "biological" estimate of Endogenous DNA. However, I think that argument doesn't really count, since "artefacts" like Adapter-Dimers are real, and should justifiably reduce the Endogenous DNA of that library, just as - say - contamination by bacteria or other environmental DNA would decrease the Endogenous DNA.

So I would vote for changing the definition of "Endogenous DNA (%)" to the more conservative ratio of

mapped reads (post-AdapterRemoval) / total reads (pre-AdapterRemoval)

The text was updated successfully, but these errors were encountered:

jfy133 · 2020-07-21T12:51:40Z

From my email where this discussion started:

While I agree it is possible a better calculation, it is not so trival to calculate:

If you have paired-end reads, what is the pre-AdapterRemoval reads? Each F/R read distinct (pre-merging)? Then you're inflating your denominator massively and doesn't comapre to what what you're actually mapping (you're doubling the number of 'molecules' actually present)
AdapterRemoval does not report statistics consistently and this can be hard to reconstruct.
We could calculate this e.g. by taking only forward read and summing e.g. from FastQC, but then we will need a special custom script to calculate this and again isn't trivial in how nextflow pipelines work. It again is also not necessarily matching what is physically going into mapping.

An easier thing I think would be to implement would be to have an optional module to do length filtering on BAM files after mapping. Then a user can turn off length filtering during AdapterRemoval (which is already possible).

Therefore, all (merged) read are then kept for mapping so will be correctly recorded in the output BAM file and subsequent flagstat output. The endogenous DNA can then be calculated on a post-filter flagstat, - something that is already compatible with out endogenous calculator (endorS.py).

Essentially we would just need to extend the samtools_filter process that already exists.

Would this be sufficient?

stschiff · 2020-07-21T13:00:26Z

Hmm, I admit that the example I posted was for single-end data for which the case may be clearer. But even for paired-end data I don't understand why we can't simply take the number of read pairs as the raw number of molecules sequenced. That is literally what it is: Every pair is a molecule, and in the end you want to know what percentage of those map confidently onto the genome (i.e. are longer than 30bp and map).

Why does this require non-trivial outputs from AdapterRemoval? I would leave all that in and simply divide by the number of sequenced molecules instead of the number of length-filtered and merged molecules.

jfy133 · 2020-07-21T13:14:35Z

OK fair enough. Lets take read pairs, which makes it slightly less trivial except:

It's non-trivial because AdapterRemoval reports input reads slightly differently depending on PE/SE input.
Secondly people want to merge PE/SE data of the same library, so you also have collect and sum these
You have to merge two different types of log files (adapterremoval settings and BWA flagstat). But because merging logs from different tools isn't the MultiQC philosphy, we will have to make a custom script to take this
You have to make sure the script you produce will generate a JSON that is acceptable to MultiQC.
This also adds a lot of 'custom' scripts which isn't really in the nf-core philosphy (unless whoever writes this script is willing to make a proper tool and put it on bioconda).

To be clear, I'm not against this but I personally don't have the time to write all of that for what is actually really an edge case (the one you described privately), as least it is pretty rare that you would loose so many of your input reads.

If you're losing so many reads you should find out why (as this is unusual!), and re-run the pipeline with a reduced length filtering parameter.

jfy133 · 2020-07-21T13:17:21Z

Also - forgot to add, the writer of the script will need to make sure it will work with the output of other adapter clipping tools, in case in the future we add other tools. Or they could parse the output of 8 FastQC files.

stschiff · 2020-07-21T13:43:23Z

Hmm, I'm pretty confused why that simple division would require so much custom scripting. Maybe I should talk to you in private. I certainly didn't want to suggest you should write tons of new scripts. I thought it would just be a different variable to choose, for which even I could submit a PR, but obviously I'm missing something.

jfy133 · 2020-07-21T14:42:26Z

Ah ok. I see where the confusion is. Please have a read up on how MultiQC works and they we can talk again about other possibilities.

I've just thought about it from a different angle and maybe we could ask Aida to add it to endorS.py... at least then there is a foundation. But collecting and summing across the different sequencing set ups will still remain a complicating factor.

stschiff · 2020-07-24T07:15:51Z

OK, thanks for acting so quickly on this, @jfy133, and this PR is definitely a solution. I'm skeptical though that this would mean that it's an opt-in behaviour, and I can't completely oversee whether it's safe to make this the standard. I talked to @TCLamnidis briefly in popgen-meeting, and he seemed to lean towards a custom script that just re-calculates the endogenous DNA the way we want it to. That does seem to be the safer option to me, because it wouldn't change the processing itself, just this one calculation.

Perhaps we should discuss this at some point. I just don't know whether generally switching off length filtering with Adapter-Removal isn't too big a change for most people. It's definitely good to have this option, though, so I think the PR can go through no problem.

jfy133 · 2020-07-24T07:39:19Z

Ok!

The reason I would rather keep the defaults as is currently (and make the new calculation system opt-in), as this is what the wider user group expect from the long-term usage of running with EAGER1. You're proposing unilaterally changing the calculation of quite a major value reported in aDNA studies here, so could cause quite a shock (even if the original calculaution is not 'stable' so to say). I also spoke to Thiseas, and it still isn't as trivial as it might seem, to the very dynamic input that go into such a custom script (and custom scripts are frowned upon to a certain level if nf-core).

Happy to talk, but basically the new implementation as above should do basically exactly the same thing as Kay's pipeline. You get a reported endogneous DNA before/after the length filtering too as he does.

However! Importantly, despite it being opt-in there is no reason why you can't make your own profile which changes the defaults for the human side. See here: https://github.com/nf-core/configs/blob/master/conf/pipeline/eager/shh.config where I've already made a bare-bones one for human. You're free to update accordingly to your own wishes.

stschiff · 2020-07-24T09:02:15Z

OK, you're making very convincing points. Perhaps that's indeed the best way for now. We will advertise this new option then to the popgen-team and will test it. Thanks!

apeltzer added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Jul 21, 2020

jfy133 mentioned this issue Jul 22, 2020

Post mapping length filter #525

Merged

8 tasks

jfy133 closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endogenous DNA (%) calculation #522

Endogenous DNA (%) calculation #522

stschiff commented Jul 21, 2020

jfy133 commented Jul 21, 2020

stschiff commented Jul 21, 2020 •

edited

jfy133 commented Jul 21, 2020 •

edited

jfy133 commented Jul 21, 2020 •

edited

stschiff commented Jul 21, 2020

jfy133 commented Jul 21, 2020

stschiff commented Jul 24, 2020

jfy133 commented Jul 24, 2020

stschiff commented Jul 24, 2020

Endogenous DNA (%) calculation #522

Endogenous DNA (%) calculation #522

Comments

stschiff commented Jul 21, 2020

jfy133 commented Jul 21, 2020

stschiff commented Jul 21, 2020 • edited

jfy133 commented Jul 21, 2020 • edited

jfy133 commented Jul 21, 2020 • edited

stschiff commented Jul 21, 2020

jfy133 commented Jul 21, 2020

stschiff commented Jul 24, 2020

jfy133 commented Jul 24, 2020

stschiff commented Jul 24, 2020

stschiff commented Jul 21, 2020 •

edited

jfy133 commented Jul 21, 2020 •

edited

jfy133 commented Jul 21, 2020 •

edited