Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure at overlap stage of trimming on slurm; multiple retries #1098

Closed
jmodlis opened this issue Sep 21, 2018 · 9 comments
Closed

Failure at overlap stage of trimming on slurm; multiple retries #1098

jmodlis opened this issue Sep 21, 2018 · 9 comments

Comments

@jmodlis
Copy link

jmodlis commented Sep 21, 2018

Hi,

I've had an assembly that I am trying to run on slurm that has failed a couple of times now and I've been unsuccessful at restarting it. At first, it was failing at this overlap store stage. I deleted the files listed below and now it seems to be failing at the next stage. See second canu.out below. Do you have any ideas what is going on?

Thanks,
Jen

End of canu.out from previous attempt

-- BEGIN TRIMMING
--
--
-- Overlap store sorting jobs failed, tried 2 times, giving up.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0024 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0039 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0042 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0046 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0058 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0190 FAILED.
--   job trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0231 FAILED.

End of canu.out

ERROR:
ERROR:  Failed with exit code 134.  (rc=34304)
ERROR:

ABORT:
ABORT: Canu 1.7.1
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to build index for overlap store.
ABORT:
ABORT: Disk space available:  577.96 GB
ABORT:

End of trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/logs/3-index.err:

Created ovStore '.' with 36351749415 overlaps for reads from 1 to 2450140.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Failed with 'Aborted'; backtrace (libbacktrace):
AS_UTL/AS_UTL_stackTrace.C::97 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
/opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/vterminate.cc::95 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv()
/opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/eh_terminate.cc::47 in _ZN10__cxxabiv111__terminateEPFvvE()
/opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/eh_terminate.cc::57 in _ZSt9terminatev()
/opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/eh_throw.cc::93 in __cxa_throw()
/opt/conda/conda-bld/compilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/new_op.cc::54 in _Znwm()
stores/ovStoreHistogram.C::605 in _ZN16ovStoreHistogram3addEPS_()
stores/ovStoreWriter.C::607 in _ZN13ovStoreWriter14mergeHistogramEv()
stores/ovStoreIndexer.C::138 in main()
(null)::0 in (null)()
(null)::0 in (null)()
./scripts/3-index.sh: line 18: 87713 Aborted                 (core dumped) $bin/ovStoreIndexer -O . -F 290
@skoren
Copy link
Member

skoren commented Sep 21, 2018

Could you give more details on which files you erased to complete the store building?/

It is possible you're just running out of memory but it is also possible some of the intermediate outputs are corrupt (which is why I asked about the files which were erased). First thing to try is to restart Canu with the options gridOptionsExecutive="--mem=16g and see if that succeeds.

@jmodlis
Copy link
Author

jmodlis commented Sep 21, 2018

I deleted these files..

trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0024 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0039 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0042 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0046 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0058 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0190 FAILED.
trimming/Penst_canu_asm1_hap.ovlStore.BUILDING/0231 FAILED.

And then 56 accidentally...

Reran with your suggestion and it gives me the same error.

@skoren
Copy link
Member

skoren commented Sep 25, 2018

Those aren't file names but job information, Canu doesn't use file names with spaces. However running rm on those names shouldn't have removed any necessary files.

Can you try increasing the memory even more, gridOptionsExecutive="--mem=120g" and confirm the submitted canu job is requesting that much memory from slurm? Can you also run ls -lha on the store folder and post the results?

@jmodlis
Copy link
Author

jmodlis commented Sep 26, 2018

Hi,

The new job is in the queue, but in the meantime, here is the contents of the trimming/Penst_canu_asm1_hap.ovlStore.BUILDING folder (it was very lengthy so I directed it to a file)
folder.contents.txt

And the contents of Penst_canu_asm1_hap.gkpStore which is in the top level of the assembly directory.

total 3.9G
drwxrwsr-x 2 jlm50 omicscore 8.0K Aug 27 12:50 .
drwxrwsr-x 6 jlm50 omicscore 8.0K Sep 26 16:11 ..
-rw-rw-r-- 1 jlm50 omicscore 3.6G Aug 27 12:50 blobs
-rw-rw-r-- 1 jlm50 omicscore  44M Aug 27 12:50 errorLog
-rw-rw-r-- 1 jlm50 omicscore   88 Aug 27 12:50 info
-rw-rw-r-- 1 jlm50 omicscore  435 Aug 27 12:50 info.txt
-rw-rw-r-- 1 jlm50 omicscore  336 Aug 27 12:50 libraries
-rw-rw-r-- 1 jlm50 omicscore  200 Aug 27 12:50 libraries.txt
-rw-rw-r-- 1 jlm50 omicscore  380 Aug 27 12:50 load.dat
-rw-rw-r-- 1 jlm50 omicscore  12M Aug 27 12:50 readlengths-obt.dat
-rw-rw-r-- 1 jlm50 omicscore  628 Aug 27 12:50 readlengths-obt.gp
-rw-rw-r-- 1 jlm50 omicscore  30K Aug 27 12:50 readlengths-obt.lg.svg
-rw-rw-r-- 1 jlm50 omicscore  29K Aug 27 12:50 readlengths-obt.sm.svg
-rw-rw-r-- 1 jlm50 omicscore 163M Aug 27 12:50 readNames.txt
-rw-rw-r-- 1 jlm50 omicscore  94M Aug 27 12:50 reads

Also, the file names in my previous post above didn't have " FAILED" in them, although maybe that would have been amusing. That was from a log file somewhere and I forgot to remove the "FAILED" part. :)

Thank you!
Jen

@skoren
Copy link
Member

skoren commented Oct 2, 2018

It looks like some part of your output is corrupt/missing, this was the one which was accidentally deleted so that data was lost:

-rw-rw-r--  1 jlm50 omicscore    0 Sep 21 14:09 0056
-rw-rw-r--  1 jlm50 omicscore   24 Sep 21 14:09 0056.evalueLen
-rw-rw-r--  1 jlm50 omicscore   24 Sep 21 14:09 0056.index
-rw-rw-r--  1 jlm50 omicscore   64 Sep 21 14:09 0056.info
-rw-rw-r--  1 jlm50 omicscore    8 Sep 21 14:09 0056.overlapScores

I'm guessing this is causing your crash. Do you still have the ovb files under 1-overlapper/results? If so, the easiest would be to remove the asm.ovlStore.BUILDING folder completely and let canu re-build it. You could also check if you still have files named slice*56 in the bucket* folders (ls bucket*/slice*56). If you have those, you can just remove all files named 0056.* and re-run 2-sort.sh 56.

@jmodlis
Copy link
Author

jmodlis commented Oct 2, 2018

I have ovb files through 0042 in 1-overlapper/001

And that is where the slice/bucket files stop too...

I'm not sure where that leaves me.

@skoren
Copy link
Member

skoren commented Oct 2, 2018

The number of bucket folders is the same as the number of ovb files but that doesn't necessarily equal the number of slice files. Each bucket folder should have multiple slice files and multiple buckets will have the same slice.

If you don't have the slice files for job 56, then you'll have to re-build the store by removing the whole folder as I suggested. The issue is one of the intermediate outputs got deleted so the only way to recover it is to re-run.

@jmodlis
Copy link
Author

jmodlis commented Oct 2, 2018

Great, thank you! It is re-starting now. I will keep you posted.

@skoren
Copy link
Member

skoren commented Dec 3, 2018

Idle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants