New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very slow overlapInCore #1222
Comments
I was about to say there's no useful logging, and was reading the code to figure out how to add some, when I discovered the never-recently-if-ever-used -l (-ell) option. This, default UINT64_MAX, claims to control how many overlaps to output per read end. When there are more than this many potential overlaps, comments claim overlaps will be processed positive diagonal longest to shortest, then negative diagonal longest to shortest, until that many overlaps are found (per end, not in total). WARNING! Unused code path! It's not a big code path (at the bottom of overlapInCore-Process_String_Overlaps.C; -l is G.Frag_Olap_Limit) but almost certainly never used since I refactored this code. Some experiments with minimum overlap length in the fall is convincing me that long read assembly doesn't suffer (at all) from losing overlaps to repeats at the ends of reads. Setting -l to 2x, 3x, 5x, 10x (1x?) coverage is probably safe (assuming -l actually it works). If you want to wait a bit, we can probably run a parameter sweep on dmel or arabidopsis and see what happens. |
Dear Brian and Sergey, Canu is in overlapInCore task for 10 days and it is currently processing batches 6 to 9 out of 200. Thank you Starting 'obtovl' concurrent execution on Mon Jan 21 14:47:13 2019 with 554.65 GB free disk space (200 processes; 4 concurrently)
/canu_out/trimming/1-overlapper/001$ ls -ilthr |
@KarinaMartins, you've got only 40 cpus for a 1gb genome, we recommend using a cluster when possible for that assembly. The issue is each of the overlap jobs wants to use 8cpus and so you're only running 4 at a time. Are you running the 1.8 release (what does canu --version say)? You can try re-starting from trimming and adding |
Thank you, Sergey. I'll do as you recommended. |
I'm also seeing some very slow overlapInCore runtimes for a cluster job. At first I thought this was due to my reads being sorted by position in chromosome, so I modified my pipeline to randomly shuffle reads before running. This is for CLR human data. When I attach to a running overlapInCore in gdb, it is in a step aligning long-ish (30-40kb) centromeric reads in both the shuffled and ordered runs. Depending on the overlapInCore format, I can swap in some pretty quick overlapping code I wrote last summer. |
I didn't think read order should affect runtime, as I said in #1202. I think you can adjust the threshold to filter more repetitive k-mers in these reads to speed up the computation (and potentially adjust the error rate down). It's easy to substitute another program, canu already has this option via the |
Does -fast create options to speed up the falconsense step? After getting
through the overlapInCore step with the options you provided, this is also
taking quite some time.
|
No, fast is independent of falcon_sense unfortunately. Falcon_sense usually can be sped up by letting it run on scratch disk (via staging in canu or just copying data) as it does a lot of random access. |
Thanks! I'll give that a try - most of the jobs run at 100% cpu over a nfs,
but it won't hurt to have them work on scratch.
…On Wed, Feb 19, 2020 at 11:36 AM Sergey Koren ***@***.***> wrote:
No, fast is independent of falcon_sense unfortunately. Falcon_sense
usually can be sped up by letting it run on scratch disk (via staging in
canu or just copying data) as it does a lot of random access.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1222?email_source=notifications&email_token=AAKPGPRGAAI7NP7RTSDAYB3RDWC23A5CNFSM4GRSP7Y2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMJGM2Y#issuecomment-588408427>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKPGPTR6ZJ2LFYKF56KN33RDWC23ANCNFSM4GRSP7YQ>
.
|
Hi, 54Gb nanopore fastq data My sever is 28 cpus and 500G RAM, and the process has run 5 days.... By now, canu is running I wonder if the phenomenon is normal, and I tested other methods never demanding this long time... Thanks in adcance, |
Yes, this step can take a while with ONT data. There is the |
Ok, thanks for your reply. |
Following up to #1202, the error correction finished, but now is stuck in trimming. Symptoms are similar to before where most batches finish right away, but some have taken weeks with overlapInCore. Is there a way to monitor which reads it is processing and exclude any that take suspiciously long time? I also launched a separate job to trim and assemble with the corrected reads using -fast mode as a backup but would be nice to push this forward too
Thanks!
Mike
The text was updated successfully, but these errors were encountered: