Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correcting low cov reads from heterozygous genomes #7

Closed
dcopetti opened this issue Feb 8, 2019 · 5 comments
Closed

correcting low cov reads from heterozygous genomes #7

dcopetti opened this issue Feb 8, 2019 · 5 comments

Comments

@dcopetti
Copy link

dcopetti commented Feb 8, 2019

Hello,
I have about 20x (per allele) PromethION data of a highly heterozygous plant genome. I also have plenty of short read data to align to it for the error correction.
I wonder if with FMLRC the small allelic variants (substitutions and indels) will be washed away at the error correction step: I want to remove errors, but keep allelism so that I can assemble separately the two alleles - we know that this is already possible with Illumina, I want to do it with long reads now.
Did you ever try your tool with low (~20x raw data) ONT coverage from a heterozygous genome? Do you have any suggestions (k and K size, T, ...) on how not to lose allelic variation?
thanks!

@holtjma
Copy link
Owner

holtjma commented Feb 8, 2019

Hello,

We've never explicitly tested this, but I can give you some information on what I would expect to happen.

For FMLRC, the coverage of your long reads doesn't matter actually since reads are all corrected individually. Instead, this is going to largely depend on the short-read data you're using. FMLRC looks for evidence of the k-mer sequences in the short reads, so if a particular allele is absent (or at a low frequency) from that short read data, then FMLRC will treat it as if it were a sequencing error and will most likely correct it to an allele that is present in the short read data. However, if you have multiple alleles and those alleles are present at the required thresholds, then FMLRC should recognize the allele as a valid k/K-mer. Does that make sense?

As for suggested parameters, I don't have any reason to believe one value for -k or -K will outperform another from a heterozygosity perspective. However, -m and -f both influence what FMLRC will consider a supported k/K-mer. If you were using something like 20x short-read data on heterozygous samples, then I would likely recommend lowering to -m 3 (indicating at least 3 reads required for a valid k/K-mer) simply because you are expecting fewer reads per allele and the default of -m 5 is possibly too high for some situations. Again, we never performed any tests on that type of data, so this is all based on my expectations for how the algorithm would perform.

Let me know if you have any more questions!

@holtjma
Copy link
Owner

holtjma commented Feb 13, 2019

Closing due to inactivity. Feel free to open if you have more questions.

@holtjma holtjma closed this as completed Feb 13, 2019
@dcopetti
Copy link
Author

dcopetti commented May 2, 2019

Hello,
Another question: how are indels dealt with in FMLRC? I mean, if the ONT raw read has a 5 bp insertion that introduces a new (rare) k-mer, how is that region corrected? Assuming that the 5 new k-mers will be at low frequency in the BWT index.
Even not correcting them should be OK, under the assumption that indel errors are occurring randomly. In that way, other overlapping e.c. reads will not have that insertion and will drive the consensus.
Does it make sense?

@holtjma
Copy link
Owner

holtjma commented May 2, 2019

In the code, indels and single base changes are indistinguishable and we calculate edit distance between the uncorrected and the correction in the event of multiple possible corrections that need to be selected from.

The short answer is that any k-mer that is not solid (i.e. present) in the short read BWT will be treated as an error, even if that same k-mer block occurs hundreds or thousands of times in the long read data (remember, each long read is handled independently).

Currently, solid is defined using two parameters:

  1. -m INT (default: 5)- this is the absolute minimum for a k-mer to be considered solid; any count less than this will be considered an error that needs correcting
  2. -f FLOAT (default: 0.10) - this creates a dynamic minimum based on the read. Given a read, we first calculate all k-mer counts for that read, and then calculate the median of all counts greater than the absolute minimum (the -m parameter above). Then, we calculate a second minimum, min2 = median*f. Any counts less than that second minimum are also considered errors that need correction.

So if the short 5-bp insertion is present at the above requirements in your short read dataset, then I would not expect fmlrc to correct it because it thinks the k-mers are not errors.

@dcopetti
Copy link
Author

dcopetti commented May 2, 2019

clear now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants