Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long unitigging overlapper step #770

Closed
jychoilab opened this issue Jan 31, 2018 · 5 comments
Closed

Long unitigging overlapper step #770

jychoilab opened this issue Jan 31, 2018 · 5 comments

Comments

@jychoilab
Copy link

jychoilab commented Jan 31, 2018

Hello

Sorry to ask a non-issue related question but I wanted to ask a general question with using canu for plant genome assembly using nanopore data.
We have a~100X coverage for our plant using nanopore and I used canu with the following default paramters:

canu -p plant -d plant_default_param genomeSize=380m maxMemory=400g maxThreads=48 corMaxEvidenceErate=0.15 -nanopore-raw plant.fastq.gz

I haven't had problems and right now its on the unitigging overlapper step and there's a couple of jobs that are basically hanging for close to 2 weeks. I'm hoping they will eventually finish at some point but I have another 100X nanopore data from a different individual and was wondering how to avoid the long hang of the overlapper step.
I was thinking reads with low quality may be a problem, or non-sense reads those that are tandem repeats (close to 10kb in length) of short K-mers that are not biological but noise generated from the sequencing reaction may be causing the hang? Are there any recommended pre quality control steps one should do on the nanopore reads before using canu to assemble them?

Thank you

@skoren
Copy link
Member

skoren commented Feb 1, 2018

This is the same as the trimming jobs running slowly, see issue #521 for suggestions. Your best option is probably to run the assembly with overlapper=mhap utgReAlign=true. I'd let this assembly finish and then run a new one with the above option and compare the results. If the assemblies are similar contiguity and quality, switch to the faster option for your subsequent individuals.

@brianwalenz
Copy link
Member

Did it run fine through the trimming step?

overlapper=mhap is essentially the correct solution (and we're going to switch to that at some point), but I'm curious why it's only getting stuck here.

Can you share the *.histogram files in the trimming/0-mercounts and unitigging/0-mercounts directories?

What is the number in *.ms22.estMerThresh.out? Kmers with count above this are not used for seeding overlaps. It could be slow because it's aligning a loooooong overlap, or it could be slow because it's looking too hard for repeat overlaps.

And if not, just close the issue.

@jychoilab
Copy link
Author

Hi Brian
The trimming step was ok and it was the overlapping step that was the problem. I'll try with the overlapper=mhap option as well to see if it makes a difference.

The number I got for the trimming/0-mercounts/basmati334.ms22.estMerThresh.out:
1950
And unitigging/0-mercounts/basmati334.ms22.estMerThresh.out:
1850

The histogram files are:
trimming.ms22.histogram.txt
unitigging.ms22.histogram.txt

@jychoilab
Copy link
Author

Its been a while but I thought I would share my results with the community on using the overlapper=mhap option that was suggested.
So when I use overlapper=mhap option for my plant genome, the assembly time is very quick taking about ~4days to assemble a plant genome thats expected to be 340Mbp. But the assembly stats looked Ok-ish

--   contigs:      3592 sequences, total length 296475883 bp (including 357 repeats of total length 25356880 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  357230 sequences, total length 2607305874 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      325853            85    38309683
--     20      212771           233    76200279
--     30      150598           447   114134174
--     40      110947           744   152083070
--     50       82361          1144   190001421
--     60       59019          1689   228017353
--     70       39440          2477   266038087

But when I ran canu with default params it look almost 10times as longer ie. about a month but the assembly stats looks as following:

--   contigs:      3057 sequences, total length 356564119 bp (including 301 repeats of total length 5405095 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  77924 sequences, total length 577972789 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      893221            33    38192837
--     20      657741            83    76573300
--     30      492260           150   114008458
--     40      375851           238   152024830
--     50      292354           352   190032017
--     60      218501           504   228054553
--     70      146478           716   266069700
--     80       83034          1057   304071404
--     90       23487          1893   342007095

Seems like for my genome I can't cut corners and will have to just do it the default way.

@skoren
Copy link
Member

skoren commented Apr 5, 2018

Thanks for the update, when you ran the fast option did you use just overlapper=mhap or overlapper=mhap utgReAlign=true? The latter would work better if you didn't already try it.

@skoren skoren closed this as completed Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants