SLURM metagenome run - increased batMemory from restart not recognised #1281

caity-s · 2019-03-15T12:04:00Z

Hello,

Firstly, thank you for a fantastic tool - it has worked very well so far, however I have run into some problems for a couple of our larger metagenomes.

I ran the Canu 1.8 command (see below) firstly calling our HPC slim nodes, with batMemory=200, and bogart failed due to memory limits.

When I increased the batMemory=450 and selected to use fat nodes instead, it failed at the same step and looking at the canu.out file (with the restart time stamp) it seem like the memory increase was not registered (still running with batMemory=200). It was trying to run the same command again. However, unitigger.jobSubmit-01.sh did seem to recognise the changed arguments... (see below)

This is my first time using an HPC cluster, so I am not sure if it is a problem with:

SLURM and user memory limits
I should be deleting a specific output (config or .sh) from CANU so that it regenerates the right file to continue with the restart. I am hoping you will have a suggestion as to which file I can delete if that is the case?
Files are missing for the restart because the HPC does not store the intermediates after a fail...

Here is the sbatch file I ran after changing the batMemory, and from slim to fat nodes:

#! /bin/bash
#
#SBATCH --account aauhpc_fat      # account
#SBATCH --nodes 1                  # number of nodes
#SBATCH --time 24:00:00            # max time (HH:MM:SS)
#SBATCH --mail-type all            # amount of mail recieved with job-name
#SBATCH --mail-user email@uni # mail address

FASTQPATH=/gpfs/gss1/work/aauhpc/csingleton/data/Kalu_18-Q3-R12-55_np_nptrim.fq;
# Genome assembly using CANU V. 1.8 # metagenome input from https://github.com/marbl/canu/issues/634
/gpfs/gss1/work/aauhpc/caitys/software/canu-1.8/Linux-amd64/bin/canu -p Kalu_18-Q3-R12-55_np -d Kalu_18-Q3-R12-55_np corMinCoverage=0 corOutCoverage=all corMhapSensitivity=high correctedErrorRate=0.105 genomeSize=5m corMaxEvidenceCoverageLocal=10 corMaxEvidenceCoverageGlobal=10 oeaMemory=32 redMemory=32 batMemory=450 gridOptionsJobName=CMSKalunCANU gridOptions="--account hpc_fat --time=24:00:00 --mail-user email@uni--mail-type FAIL" -nanopore-raw $FASTQPATH

Here is the error:

--
-- Bogart failed, tried 2 times, giving up.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT: Disk space available:  14802.304 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/4-unitigger/unitigger.err):
ABORT:
ABORT:   
ABORT:   Lengths:
ABORT:     Minimum read          0 bases
ABORT:     Minimum overlap       500 bases
ABORT:   
ABORT:   Overlap Error Rates:
ABORT:     Graph                 0.105 (10.500%)
ABORT:     Max                   0.105 (10.500%)
ABORT:   
ABORT:   Deviations:
ABORT:     Graph                 6.000
ABORT:     Bubble                6.000
ABORT:     Repeat                3.000
ABORT:   Edge Confusion:
ABORT:     Absolute              2100
ABORT:     Percent               200.0000
ABORT:   
ABORT:   Unitig Construction:
ABORT:     Minimum intersection  500 bases
ABORT:     Maxiumum placements   2 positions
ABORT:   
ABORT:   Debugging Enabled:
ABORT:     (none)
ABORT:   
ABORT:   ==> LOADING AND FILTERING OVERLAPS.
ABORT:   
ABORT:   ReadInfo()-- Using 1819405 reads, no minimum read length used.
ABORT:   
ABORT:   OverlapCache()-- limited to 204800MB memory (user supplied).
ABORT:   
ABORT:   OverlapCache()--      13MB for read data.
ABORT:   OverlapCache()--      69MB for best edges.
ABORT:   OverlapCache()--     180MB for tigs.
ABORT:   OverlapCache()--      48MB for tigs - read layouts.
ABORT:   OverlapCache()--      69MB for tigs - error profiles.
ABORT:   OverlapCache()--   51200MB for tigs - error profile overlaps.
ABORT:   OverlapCache()--       0MB for other processes.
ABORT:   OverlapCache()-- ---------
ABORT:   OverlapCache()--   51616MB for data structures (sum of above).
ABORT:   OverlapCache()-- ---------
ABORT:   OverlapCache()--      34MB for overlap store structure.
ABORT:   OverlapCache()--  153148MB for overlap data.
ABORT:   OverlapCache()-- ---------
ABORT:   OverlapCache()--  204800MB allowed.
ABORT:   OverlapCache()--
ABORT:   OverlapCache()-- Retain at least 5620 overlaps/read, based on 2810.40x coverage.
ABORT:   OverlapCache()-- Initial guess at 5516 overlaps/read.
ABORT:   OverlapCache()--
ABORT:   OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.
ABORT:

The coverage is very high... the other metagenome that failed is 7000x coverage for the setting of a 5m genome. The sample is ~50 Gbp of nanopore data.

The error from 4-unitigger/unitigger.err (the time stamp, similar to canu.out, indicates it was from the restart):

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Using 1819405 reads, no minimum read length used.

OverlapCache()-- limited to 204800MB memory (user supplied).

OverlapCache()--      13MB for read data.
OverlapCache()--      69MB for best edges.
OverlapCache()--     180MB for tigs.
OverlapCache()--      48MB for tigs - read layouts.
OverlapCache()--      69MB for tigs - error profiles.
OverlapCache()--   51200MB for tigs - error profile overlaps.
OverlapCache()--       0MB for other processes.
OverlapCache()-- ---------
OverlapCache()--   51616MB for data structures (sum of above).
OverlapCache()-- ---------
OverlapCache()--      34MB for overlap store structure.
OverlapCache()--  153148MB for overlap data.
OverlapCache()-- ---------
OverlapCache()--  204800MB allowed.
OverlapCache()--
OverlapCache()-- Retain at least 5620 overlaps/read, based on 2810.40x coverage.
OverlapCache()-- Initial guess at 5516 overlaps/read.
OverlapCache()--
OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.

Config section of SLURM out

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_111' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 504 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found  64 hosts with  24 cores and  498 GB memory under Slurm control.
-- Found 520 hosts with  24 cores and   61 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     10 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6 GB   12 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     4 GB    8 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red       32 GB    4 CPUs  (read error detection)
-- Grid:  oea       32 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      450 GB    4 CPUs  (contig construction with bogart)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)

And output of unitigger.jobSubmit-01.sh:

#!/bin/sh

sbatch \
  --mem-per-cpu=115200m --cpus-per-task=4 --account aauhpc_fat --time=24:00:00 --mail-user user@email --mail-type FAIL -o unitigger.%A_%a.out \
  -D `pwd` -J "bat_Kalu_18-Q3-R12-55_np_CMSKalunCANU" \
  -a 1-1 \
  ./unitigger.sh 0 \
> ./unitigger.jobSubmit-01.out 2>&1

And unitigger.out, just in case:

Found perl:
   /usr/bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_111"

Found canu:
   /gpfs/gss1/work/aauhpc/csingleton/software/canu-1.8/Linux-amd64/bin/canu
   Canu 1.8

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.
bogart appears to have failed. No Kalu_18-Q3-R12-55_np.ctgStore or Kalu_18-Q3-R12-55_np.utgStore found.

Thank you in advance for any suggestions,

Caitlin

Also, in case the restart canu and bogart logs (from canu-logs) are useful:

bogart log:

Canu v1.8 (+0 commits) r0 .

Current Working Directory:
/gpfs/gss1/work/aauhpc/csingleton/CANU_MGP1000/Kalu_18-Q3-R12-55_np/unitigging/4-unitigger

Command:
/gpfs/gss1/work/aauhpc/csingleton/software/canu-1.8/Linux-amd64/bin/bogart \
  -S ../../Kalu_18-Q3-R12-55_np.seqStore \
  -O ../Kalu_18-Q3-R12-55_np.ovlStore \
  -o ./Kalu_18-Q3-R12-55_np \
  -gs 5000000 \
  -eg 0.105 \
  -eM 0.105 \
  -mo 500 \
  -dg 6 \
  -db 6 \
  -dr 3 \
  -ca 2100 \
  -cp 200 \
  -threads 4 \
  -M 200 \
  -unassembled 2 0 1.0 0.5 3

Canu log:

###
###  Reading options from '/gpfs/gss1/work/aauhpc/csingleton/software/canu-1.8/Linux-amd64/bin/canu.defaults'
###

# Add site specific options (for setting up Grid or limiting memory/threads) here.

###
###  Reading options from the command line.
###

corMinCoverage=0
corOutCoverage=all
corMhapSensitivity=high
correctedErrorRate=0.105
genomeSize=5m
corMaxEvidenceCoverageLocal=10
corMaxEvidenceCoverageGlobal=10
oeaMemory=32
redMemory=32
batMemory=450
gridOptionsJobName=CMSKalunCANU
gridOptions=--account aauhpc_fat --time=24:00:00 --mail-user user@email --mail-type FAIL
canuIteration=2

The text was updated successfully, but these errors were encountered:

caity-s · 2019-03-15T13:26:24Z

Can't believe I missed it, but I just found issue #1253, so will try deleting 4-unitigger.

skoren · 2019-03-15T13:27:33Z

Yes, that was going to be my suggestion, in general the scripts aren't re-generated if they exist so new parameters don't overwrite the old ones.

caity-s · 2019-03-15T13:45:48Z

Hello skoren, brilliant thanks - it is happily fat bogarting now:

Command:
/gpfs/gss1/work/aauhpc/csingleton/software/canu-1.8/Linux-amd64/bin/bogart \
  -S ../../Kalu_18-Q3-R12-55_np.seqStore \
  -O ../Kalu_18-Q3-R12-55_np.ovlStore \
  -o ./Kalu_18-Q3-R12-55_np \
  -gs 5000000 \
  -eg 0.105 \
  -eM 0.105 \
  -mo 500 \
  -dg 6 \
  -db 6 \
  -dr 3 \
  -ca 2100 \
  -cp 200 \
  -threads 4 \
  -M 450 \
  -unassembled 2 0 1.0 0.5 3

It is great that it restarts so smoothly, providing great flexibility for the tricky samples.

Sorry for the issue duplication! And thanks again!

caity-s · 2019-03-15T15:20:43Z

Sergey,

I'd really appreciate your advice on memory. Unfortunately, for the really big metagenome (7000x coverage of 5m, 50 Gbp of data), even batMemory=498 wasn't enough... do you have any suggestions? Our grid nodes don't get bigger that 500GB, so I get the No available machine configuration can run this task error if I try upping it. Would batThreads have an impact to somehow spread the memory requirement across nodes - or does it all have to happen on the one node at this point?

Otherwise, we have a single server with 1TB, so moving the data there might be the best option - though maybe this sample is just too big...

Thank you!

brianwalenz · 2019-03-15T16:05:17Z

The issue is from this bit of logging:

ABORT:   OverlapCache()-- Retain at least 5620 overlaps/read, based on 2810.40x coverage.
ABORT:   OverlapCache()-- Initial guess at 5516 overlaps/read.
ABORT:   OverlapCache()--
ABORT:   OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.

where bogart thinks, based on a genome size of 5 Mbp, that you've really got 2810x coverage, and so it wants to load at least 'gobs and gobs' of overlaps per read.

A simple fix would be to increase the genome size supplied to bogart (-gs 5000000). Increasing to 50 Mbp would reduce 'coverage' to 281x, and bogart would require only 552 overlaps per read. It picks the 'strongest' overlaps, and just ignores any additional overlaps.

The genome size here is used only to compute coverage and N50 statistics in logs.

And since we don't rewrite scripts, you can edit unitigger.sh and just restart canu. ;-)

caity-s · 2019-03-18T08:48:59Z

That's really awesome, thanks so much Brian :D

skoren closed this as completed Mar 15, 2019

skoren mentioned this issue Apr 11, 2019

CANU is failing - bogart issue #1323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM metagenome run - increased batMemory from restart not recognised #1281

SLURM metagenome run - increased batMemory from restart not recognised #1281

caity-s commented Mar 15, 2019 •

edited

Loading

caity-s commented Mar 15, 2019

skoren commented Mar 15, 2019

caity-s commented Mar 15, 2019

caity-s commented Mar 15, 2019

brianwalenz commented Mar 15, 2019

caity-s commented Mar 18, 2019

SLURM metagenome run - increased batMemory from restart not recognised #1281

SLURM metagenome run - increased batMemory from restart not recognised #1281

Comments

caity-s commented Mar 15, 2019 • edited Loading

caity-s commented Mar 15, 2019

skoren commented Mar 15, 2019

caity-s commented Mar 15, 2019

caity-s commented Mar 15, 2019

brianwalenz commented Mar 15, 2019

caity-s commented Mar 18, 2019

caity-s commented Mar 15, 2019 •

edited

Loading