Poor assembly metrics with hifiasm v0.19.8 #35

chklopp · 2024-05-29T07:18:36Z

Thank you for providing herro.

I've tested it on four ONT datasets one of which is public.
The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested?
The set which did not work has the highest coverage. Is there a coverage limit to respect?
For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.

For example for the public data set, data found in
https://www.ncbi.nlm.nih.gov/bioproject/781898

Number of kmers found once in the read set = errors

grep 'ha_hist_line' slurm-7727010.out | grep ' 1:'
[M::ha_hist_line]     1: ****************************************************************************************************> 52175410
[M::ha_hist_line]     1: ****************************************************************************************************> 45429496
[M::ha_hist_line]     1: ****************************************************************************************************> 41842772
[M::ha_hist_line]     1: ****************************************************************************************************> 39899872

Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles.
And when I extract contig coverages from the gfa file they are very low while they should be around 10.

awk '/^S/{print $2"\t"$4"\t"$5}' hifiasm_0.19.8_no_HiC.bp.hap1.p_ctg.gfa \
| sed 's/LN:i://;s/rd:i://' | more
h1tg000001l 114415 6
h1tg000002l 1935040 3
h1tg000003l 485308 3
h1tg000004l 113763 0
h1tg000005l 54120 0
h1tg000006l 82359 0
h1tg000007l 3376377 2
h1tg000008l 505683 2
h1tg000009l 1826044 2
h1tg000010l 4045620 2
h1tg000011l 151854 1
h1tg000012l 172642 0
h1tg000013l 75530 0
h1tg000014l 82829 0
h1tg000015l 71160 0
h1tg000016l 944815 1
h1tg000017l 357347 3
h1tg000018l 207160 8
h1tg000019l 510563 5

Have you seen this before?
What could I change to improve correction or assembly?

dehui333 · 2024-08-07T02:39:12Z

Hi,

For human ultra-long datasets which we tested with, the loss in nucleotides after correction was generally less than 2% of the input (which already filtered away reads shorter than 10kbp and < Q10).

Reads or segments that do not have enough overlaps found are not corrected and discarded, retaining more reads in the output (and also having higher quality for corrected reads) depends on ensuring overlaps can be found - there are a few possible things that can be done about this:

Try to use longer reads if possible - the ultra long human datasets which we did the most tests with generally have N50 of ~100k.
Tweak the minimap2 parameters to retain more overlaps - as mentioned in HERRO's preprint, we used

minimap2 -K8g -cx ava-ont -k21 -w14 -f 0.005 -e100 -r150 -z200 -m1500 -t${num_threads} --dual=yes $reads $reads

for non-UL datasets and achieved better results. In fact, I would also add that -f 0.005 to the parameters in create_batched_alignments.sh which was mostly tested on UL reads and reduce the -m. Generally lower -m means more overlaps retained but higher running time. Although the appropriate value to balance retaining overlaps and running time seems to vary between different kinds of datasets.

Best,
Dehui

1Wencai mentioned this issue Dec 2, 2024

herro inference error #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor assembly metrics with hifiasm v0.19.8 #35

Poor assembly metrics with hifiasm v0.19.8 #35

chklopp commented May 29, 2024

dehui333 commented Aug 7, 2024

Poor assembly metrics with hifiasm v0.19.8 #35

Poor assembly metrics with hifiasm v0.19.8 #35

Comments

chklopp commented May 29, 2024

dehui333 commented Aug 7, 2024