Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endogenous DNA (%) calculation #522

Closed
stschiff opened this issue Jul 21, 2020 · 9 comments · Fixed by #525
Closed

Endogenous DNA (%) calculation #522

stschiff opened this issue Jul 21, 2020 · 9 comments · Fixed by #525
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@stschiff
Copy link

When calculating the Endogenous DNA (%) as reported in the "General Statistics" Table in the MultiQC Output, Eager divides the number of mapped over total reads after AdapterRemoval. AdapterRemoval also performs size-filtering, however. This means that the ratio of mapped over "total" reads only uses a subset of all reads, leading to an inflated estimate.

In a recent test we did, we had a very poorly preserved sample, where out of >5 million reads sequenced, only 150K passed AdapterRemoval, mostly because reads were very short. The outcome was that Eager reported an Endogenous DNA % of 5%, which is quite good. However, when divided over the actual raw number of reads sequenced, the result would be 0.144%, which is very poor and just good enough for performing in-solution SNP-Capture.

Long story short: I think we should think a bit what definition of "Endogenous DNA (%)" is most reasonable. I would argue we should go by the principle of "Least Surprise". And in this case, a number of 5% was certainly very surprising to me, given the poor preservation of that sample, and I would have preferred to have been reported 0.144%, i.e. the ratio of mapped length-filtered reads divided over total sequenced reads. This definition makes economically most sense, as it is literally the fraction of sequenced reads that can be used for analysis as mapped reads. A counter-argument would be that the total number of sequenced reads might include artefacts from the library building, such as Adapter-Dimers or so, and hence the out coming estimate wouldn't really be the "biological" estimate of Endogenous DNA. However, I think that argument doesn't really count, since "artefacts" like Adapter-Dimers are real, and should justifiably reduce the Endogenous DNA of that library, just as - say - contamination by bacteria or other environmental DNA would decrease the Endogenous DNA.

So I would vote for changing the definition of "Endogenous DNA (%)" to the more conservative ratio of

mapped reads (post-AdapterRemoval) / total reads (pre-AdapterRemoval)

@apeltzer apeltzer added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Jul 21, 2020
@jfy133
Copy link
Member

jfy133 commented Jul 21, 2020

From my email where this discussion started:

While I agree it is possible a better calculation, it is not so trival to calculate:

  1. If you have paired-end reads, what is the pre-AdapterRemoval reads? Each F/R read distinct (pre-merging)? Then you're inflating your denominator massively and doesn't comapre to what what you're actually mapping (you're doubling the number of 'molecules' actually present)
  2. AdapterRemoval does not report statistics consistently and this can be hard to reconstruct.
  3. We could calculate this e.g. by taking only forward read and summing e.g. from FastQC, but then we will need a special custom script to calculate this and again isn't trivial in how nextflow pipelines work. It again is also not necessarily matching what is physically going into mapping.

An easier thing I think would be to implement would be to have an optional module to do length filtering on BAM files after mapping. Then a user can turn off length filtering during AdapterRemoval (which is already possible).

Therefore, all (merged) read are then kept for mapping so will be correctly recorded in the output BAM file and subsequent flagstat output. The endogenous DNA can then be calculated on a post-filter flagstat, - something that is already compatible with out endogenous calculator (endorS.py).

Essentially we would just need to extend the samtools_filter process that already exists.

Would this be sufficient?

@stschiff
Copy link
Author

stschiff commented Jul 21, 2020

Hmm, I admit that the example I posted was for single-end data for which the case may be clearer. But even for paired-end data I don't understand why we can't simply take the number of read pairs as the raw number of molecules sequenced. That is literally what it is: Every pair is a molecule, and in the end you want to know what percentage of those map confidently onto the genome (i.e. are longer than 30bp and map).

Why does this require non-trivial outputs from AdapterRemoval? I would leave all that in and simply divide by the number of sequenced molecules instead of the number of length-filtered and merged molecules.

@jfy133
Copy link
Member

jfy133 commented Jul 21, 2020

OK fair enough. Lets take read pairs, which makes it slightly less trivial except:

  1. It's non-trivial because AdapterRemoval reports input reads slightly differently depending on PE/SE input.
  2. Secondly people want to merge PE/SE data of the same library, so you also have collect and sum these
  3. You have to merge two different types of log files (adapterremoval settings and BWA flagstat). But because merging logs from different tools isn't the MultiQC philosphy, we will have to make a custom script to take this
  4. You have to make sure the script you produce will generate a JSON that is acceptable to MultiQC.
  5. This also adds a lot of 'custom' scripts which isn't really in the nf-core philosphy (unless whoever writes this script is willing to make a proper tool and put it on bioconda).

To be clear, I'm not against this but I personally don't have the time to write all of that for what is actually really an edge case (the one you described privately), as least it is pretty rare that you would loose so many of your input reads.

If you're losing so many reads you should find out why (as this is unusual!), and re-run the pipeline with a reduced length filtering parameter.

@jfy133
Copy link
Member

jfy133 commented Jul 21, 2020

Also - forgot to add, the writer of the script will need to make sure it will work with the output of other adapter clipping tools, in case in the future we add other tools. Or they could parse the output of 8 FastQC files.

@stschiff
Copy link
Author

Hmm, I'm pretty confused why that simple division would require so much custom scripting. Maybe I should talk to you in private. I certainly didn't want to suggest you should write tons of new scripts. I thought it would just be a different variable to choose, for which even I could submit a PR, but obviously I'm missing something.

@jfy133
Copy link
Member

jfy133 commented Jul 21, 2020

Ah ok. I see where the confusion is. Please have a read up on how MultiQC works and they we can talk again about other possibilities.

I've just thought about it from a different angle and maybe we could ask Aida to add it to endorS.py... at least then there is a foundation. But collecting and summing across the different sequencing set ups will still remain a complicating factor.

@jfy133 jfy133 mentioned this issue Jul 22, 2020
8 tasks
@stschiff
Copy link
Author

OK, thanks for acting so quickly on this, @jfy133, and this PR is definitely a solution. I'm skeptical though that this would mean that it's an opt-in behaviour, and I can't completely oversee whether it's safe to make this the standard. I talked to @TCLamnidis briefly in popgen-meeting, and he seemed to lean towards a custom script that just re-calculates the endogenous DNA the way we want it to. That does seem to be the safer option to me, because it wouldn't change the processing itself, just this one calculation.

Perhaps we should discuss this at some point. I just don't know whether generally switching off length filtering with Adapter-Removal isn't too big a change for most people. It's definitely good to have this option, though, so I think the PR can go through no problem.

@jfy133
Copy link
Member

jfy133 commented Jul 24, 2020

Ok!

The reason I would rather keep the defaults as is currently (and make the new calculation system opt-in), as this is what the wider user group expect from the long-term usage of running with EAGER1. You're proposing unilaterally changing the calculation of quite a major value reported in aDNA studies here, so could cause quite a shock (even if the original calculaution is not 'stable' so to say). I also spoke to Thiseas, and it still isn't as trivial as it might seem, to the very dynamic input that go into such a custom script (and custom scripts are frowned upon to a certain level if nf-core).

Happy to talk, but basically the new implementation as above should do basically exactly the same thing as Kay's pipeline. You get a reported endogneous DNA before/after the length filtering too as he does.

However! Importantly, despite it being opt-in there is no reason why you can't make your own profile which changes the defaults for the human side. See here: https://github.com/nf-core/configs/blob/master/conf/pipeline/eager/shh.config where I've already made a bare-bones one for human. You're free to update accordingly to your own wishes.

@stschiff
Copy link
Author

OK, you're making very convincing points. Perhaps that's indeed the best way for now. We will advertise this new option then to the popgen-team and will test it. Thanks!

@jfy133 jfy133 closed this as completed Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants