You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tested it on four ONT datasets one of which is public.
The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested?
The set which did not work has the highest coverage. Is there a coverage limit to respect?
For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.
Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles.
And when I extract contig coverages from the gfa file they are very low while they should be around 10.
For human ultra-long datasets which we tested with, the loss in nucleotides after correction was generally less than 2% of the input (which already filtered away reads shorter than 10kbp and < Q10).
Reads or segments that do not have enough overlaps found are not corrected and discarded, retaining more reads in the output (and also having higher quality for corrected reads) depends on ensuring overlaps can be found - there are a few possible things that can be done about this:
Try to use longer reads if possible - the ultra long human datasets which we did the most tests with generally have N50 of ~100k.
Tweak the minimap2 parameters to retain more overlaps - as mentioned in HERRO's preprint, we used
for non-UL datasets and achieved better results. In fact, I would also add that -f 0.005 to the parameters in create_batched_alignments.sh which was mostly tested on UL reads and reduce the -m. Generally lower -m means more overlaps retained but higher running time. Although the appropriate value to balance retaining overlaps and running time seems to vary between different kinds of datasets.
Thank you for providing herro.
I've tested it on four ONT datasets one of which is public.
The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested?
The set which did not work has the highest coverage. Is there a coverage limit to respect?
For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.
For example for the public data set, data found in
https://www.ncbi.nlm.nih.gov/bioproject/781898
Number of kmers found once in the read set = errors
Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles.
And when I extract contig coverages from the gfa file they are very low while they should be around 10.
Have you seen this before?
What could I change to improve correction or assembly?
The text was updated successfully, but these errors were encountered: