correcting low cov reads from heterozygous genomes #7

dcopetti · 2019-02-08T14:51:17Z

Hello,
I have about 20x (per allele) PromethION data of a highly heterozygous plant genome. I also have plenty of short read data to align to it for the error correction.
I wonder if with FMLRC the small allelic variants (substitutions and indels) will be washed away at the error correction step: I want to remove errors, but keep allelism so that I can assemble separately the two alleles - we know that this is already possible with Illumina, I want to do it with long reads now.
Did you ever try your tool with low (~20x raw data) ONT coverage from a heterozygous genome? Do you have any suggestions (k and K size, T, ...) on how not to lose allelic variation?
thanks!

holtjma · 2019-02-08T18:40:34Z

Hello,

We've never explicitly tested this, but I can give you some information on what I would expect to happen.

For FMLRC, the coverage of your long reads doesn't matter actually since reads are all corrected individually. Instead, this is going to largely depend on the short-read data you're using. FMLRC looks for evidence of the k-mer sequences in the short reads, so if a particular allele is absent (or at a low frequency) from that short read data, then FMLRC will treat it as if it were a sequencing error and will most likely correct it to an allele that is present in the short read data. However, if you have multiple alleles and those alleles are present at the required thresholds, then FMLRC should recognize the allele as a valid k/K-mer. Does that make sense?

As for suggested parameters, I don't have any reason to believe one value for -k or -K will outperform another from a heterozygosity perspective. However, -m and -f both influence what FMLRC will consider a supported k/K-mer. If you were using something like 20x short-read data on heterozygous samples, then I would likely recommend lowering to -m 3 (indicating at least 3 reads required for a valid k/K-mer) simply because you are expecting fewer reads per allele and the default of -m 5 is possibly too high for some situations. Again, we never performed any tests on that type of data, so this is all based on my expectations for how the algorithm would perform.

Let me know if you have any more questions!

holtjma · 2019-02-13T03:39:14Z

Closing due to inactivity. Feel free to open if you have more questions.

dcopetti · 2019-05-02T12:51:17Z

Hello,
Another question: how are indels dealt with in FMLRC? I mean, if the ONT raw read has a 5 bp insertion that introduces a new (rare) k-mer, how is that region corrected? Assuming that the 5 new k-mers will be at low frequency in the BWT index.
Even not correcting them should be OK, under the assumption that indel errors are occurring randomly. In that way, other overlapping e.c. reads will not have that insertion and will drive the consensus.
Does it make sense?

holtjma · 2019-05-02T13:38:31Z

In the code, indels and single base changes are indistinguishable and we calculate edit distance between the uncorrected and the correction in the event of multiple possible corrections that need to be selected from.

The short answer is that any k-mer that is not solid (i.e. present) in the short read BWT will be treated as an error, even if that same k-mer block occurs hundreds or thousands of times in the long read data (remember, each long read is handled independently).

Currently, solid is defined using two parameters:

-m INT (default: 5)- this is the absolute minimum for a k-mer to be considered solid; any count less than this will be considered an error that needs correcting
-f FLOAT (default: 0.10) - this creates a dynamic minimum based on the read. Given a read, we first calculate all k-mer counts for that read, and then calculate the median of all counts greater than the absolute minimum (the -m parameter above). Then, we calculate a second minimum, min2 = median*f. Any counts less than that second minimum are also considered errors that need correction.

So if the short 5-bp insertion is present at the above requirements in your short read dataset, then I would not expect fmlrc to correct it because it thinks the k-mers are not errors.

dcopetti · 2019-05-02T14:24:53Z

clear now, thanks!

holtjma closed this as completed Feb 13, 2019

holtjma mentioned this issue Jul 1, 2021

Do heterozygous variants remain in long reads after correction? HudsonAlpha/fmlrc2#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correcting low cov reads from heterozygous genomes #7

correcting low cov reads from heterozygous genomes #7

dcopetti commented Feb 8, 2019

holtjma commented Feb 8, 2019

holtjma commented Feb 13, 2019

dcopetti commented May 2, 2019

holtjma commented May 2, 2019

dcopetti commented May 2, 2019

correcting low cov reads from heterozygous genomes #7

correcting low cov reads from heterozygous genomes #7

Comments

dcopetti commented Feb 8, 2019

holtjma commented Feb 8, 2019

holtjma commented Feb 13, 2019

dcopetti commented May 2, 2019

holtjma commented May 2, 2019

dcopetti commented May 2, 2019