Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor assembly metrics with hifiasm v0.19.8 #35

Open
chklopp opened this issue May 29, 2024 · 1 comment
Open

Poor assembly metrics with hifiasm v0.19.8 #35

chklopp opened this issue May 29, 2024 · 1 comment

Comments

@chklopp
Copy link

chklopp commented May 29, 2024

Thank you for providing herro.

I've tested it on four ONT datasets one of which is public.
The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested?
The set which did not work has the highest coverage. Is there a coverage limit to respect?
For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.

For example for the public data set, data found in
https://www.ncbi.nlm.nih.gov/bioproject/781898

Number of kmers found once in the read set = errors

grep 'ha_hist_line' slurm-7727010.out | grep ' 1:'
[M::ha_hist_line]     1: ****************************************************************************************************> 52175410
[M::ha_hist_line]     1: ****************************************************************************************************> 45429496
[M::ha_hist_line]     1: ****************************************************************************************************> 41842772
[M::ha_hist_line]     1: ****************************************************************************************************> 39899872

Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles.
And when I extract contig coverages from the gfa file they are very low while they should be around 10.

awk '/^S/{print $2"\t"$4"\t"$5}' hifiasm_0.19.8_no_HiC.bp.hap1.p_ctg.gfa \
| sed 's/LN:i://;s/rd:i://' | more
h1tg000001l 114415 6
h1tg000002l 1935040 3
h1tg000003l 485308 3
h1tg000004l 113763 0
h1tg000005l 54120 0
h1tg000006l 82359 0
h1tg000007l 3376377 2
h1tg000008l 505683 2
h1tg000009l 1826044 2
h1tg000010l 4045620 2
h1tg000011l 151854 1
h1tg000012l 172642 0
h1tg000013l 75530 0
h1tg000014l 82829 0
h1tg000015l 71160 0
h1tg000016l 944815 1
h1tg000017l 357347 3
h1tg000018l 207160 8
h1tg000019l 510563 5

Have you seen this before?
What could I change to improve correction or assembly?

@dehui333
Copy link
Collaborator

dehui333 commented Aug 7, 2024

Hi,

For human ultra-long datasets which we tested with, the loss in nucleotides after correction was generally less than 2% of the input (which already filtered away reads shorter than 10kbp and < Q10).

Reads or segments that do not have enough overlaps found are not corrected and discarded, retaining more reads in the output (and also having higher quality for corrected reads) depends on ensuring overlaps can be found - there are a few possible things that can be done about this:

  1. Try to use longer reads if possible - the ultra long human datasets which we did the most tests with generally have N50 of ~100k.
  2. Tweak the minimap2 parameters to retain more overlaps - as mentioned in HERRO's preprint, we used

minimap2 -K8g -cx ava-ont -k21 -w14 -f 0.005 -e100 -r150 -z200 -m1500 -t${num_threads} --dual=yes $reads $reads

for non-UL datasets and achieved better results. In fact, I would also add that -f 0.005 to the parameters in create_batched_alignments.sh which was mostly tested on UL reads and reduce the -m. Generally lower -m means more overlaps retained but higher running time. Although the appropriate value to balance retaining overlaps and running time seems to vary between different kinds of datasets.

Best,
Dehui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants