Memory allocation problem on a slurm system #148

Dictionary2b · 2023-12-27T17:06:37Z

Hello,
I was running the workflow for an extensive data set (over 800 samples) on a slrum platform (UPPMAX). I used the GATK approach with intervals. I got an error message like this:
rule create_cov_bed:
input: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
output: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed
jobid: 2558
benchmark: benchmarks/GCA_009792885.1/covbed/lark20231207_benchmark.txt
reason: Missing output files: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed; Input files updated by another job: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
wildcards: refGenome=GCA_009792885.1, prefix=lark20231207
resources: mem_mb=448200, mem_mib=427437, disk_mb=448200, disk_mib=427437, tmpdir=

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/./profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, **sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--partition=core', '--time=7-00:00:00', '--ntasks=8', '--output=logs/slurm/slurm-%j.out', '--account=naiss2023-5-278', '--mem=448200', '/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/.snakemake/tmp.fi5cgg1q/snakejob.create_cov_bed.2558.sh']' returned non-zero exit status 1.
Error submitting jobscript (exit code 1):

Select jobs to execute...

There is not enough memory on the slurm system. I'm not sure where the issue is, whether due to I didn't ask for large enough memory in the setting or the HPC simply cannot provide that much memory. Should I probably change the source code to ask for a larger memory allocation for the workflow from the beginning (e.g. more than 1000 nodes)?

Besides, the whole workflow runs not that fast either; for the bam2vcf jobs, it reported a progress of 2% every 24 hours and obviously will go over the time limitation of the snakemake slurm job. Could you please give some suggestions on this? Thanks in advance.

Best,
Zongzhuang

cademirch · 2023-12-29T04:57:30Z

Hi Zongzhuang, sorry for your issues with this. Could you provide your config, resource config, and command line used to run snparcher? The create_cov_bed step is pretty memory intensive so that could be the issue with regards to your posted error log. As for slow progress, that could be a number of things outside of snpArcher's control, such as HPC queue and resource limits. However, please post those things above, and we can try to diagnose.

Dictionary2b · 2023-12-29T10:31:29Z

@cademirch Thanks for your suggestion! Here are the config, resources, and the bash script I used to run snparcher (.sh). They are all archived in this zip file.

By asking the UPPMAX support, I got a suggestion to ask for a fat node partition job of at least 512 GB (-C 512 GB) instead of using the core partition as I did, which has at most 128 GB. Is this something I can change in the cluster-config file? Besides, since there are only a few fat nodes on the cluster, is it possible to make the workflow only submit the jobs with intensive memory requirements in the node partition, and the others keep with the previous setting? I'm also unsure if I need to kill the current process and restart it to apply the changes.

Zongzhuang_config.zip

cademirch · 2024-01-01T22:57:44Z

Hi Zonghuang,

I took a look and you configs look OK to me. One thing I will suggest is using the --slurm option when executing Snakemake instead of the profile. See these docs for more details: https://snakemake.readthedocs.io/en/v7.32.3/executing/cluster.html#executing-on-slurm-clusters

As for submitting certain rules to specific partitions this is possible, the docs above detail how. I would suggest creating a YAML profile and you can define which rules will go to which partition. I'll provide an example here:

uppmax_example_profile.yaml

slurm: True # same as `--slurm on command line`
jobs: 1000 # set number of jobs to run concurrently
use-conda: True
# can set other wanted command line options here
default-resources:
  slurm_partition: <Your partition name here> # This will be default partition for all rules
  slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
  create_cov_bed:
     slurm_partition: <Your big partition name here> # This will override default partition for this rule
 # ... you can specify partitions for certain rules by following this pattern

Then when you run snparcher you can do so with this profile. Let me know if this helps!

Dictionary2b · 2024-01-02T16:02:01Z

Hi Cade,

Many thanks for your explanation!

I'm unsure whether using the --slurm option without the --profile option can work in this case. The cpus-per-task issue on slurm system seems to persist still, which is now partially solved by an edition in profiles/slurm/slurm_utils.py.

If this is the case that I still have to use the --profile option, can I probably modify profiles/slurm/config.yaml (or something in cluster_config.yml?) for submitting certain rules to specific partitions?

cademirch · 2024-01-02T17:33:45Z

Okay, sorry I didn't realize that issue also. So in your shell script you sent above looks like this:

❯ cat run_pipeline_zongz1123.sh
#!/bin/bash
#SBATCH -A naiss2023-5-278
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 10-00:00:00
#SBATCH -J snpArcher
#SBATCH -e snpArcher_%A_%a.err # File to which STDERR will be written
#SBATCH -o snpArcher_%A_%a.out
#SBATCH --mail-type=all
#SBATCH --mail-user=dictionary2b@gmail.com
module load conda/latest

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
mamba activate snparcher
snakemake --snakefile workflow/Snakefile --profile ./profiles/slurm

You would edit the file .profiles/slurm to include this:

default-resources:
  slurm_partition: <Your partition name here> # This will be default partition for all rules
  slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
  create_cov_bed:
     slurm_partition: <Your big partition name here> # This will override default partition for this rule
 # ... you can specify partitions for certain rules by following this pattern

Let me know if this makes sense and is helpful!

Dictionary2b · 2024-01-02T17:47:34Z

Thanks, Cade. .profiles/slurm is a directory containing both config. yaml and cluster_config.yml. In this case, I can't find the right place to include the code directly as you suggested. To my understanding, if I want to add a specific resource setting for certain jobs, I would need to add it to cluster_config.yml to make it look like this:

__default__:
    partition: "snowy"
    time: 7-00:00:00
    partition: core
    ntasks: 8
    output: "logs/slurm/slurm-%j.out"
    account: naiss2023-5-278

create_cov_bed:
    partition: "snowy"
    time: 7-00:00:00
    partition: node
    nodes: 1
    ntasks: 8
    constraint: mem512GB
    output: "logs/slurm/slurm-%j.out"
    account: naiss2023-5-278

Do I understand you correctly? Sorry for the misunderstanding!

cademirch · 2024-01-02T17:52:22Z

Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.

brian-arnold · 2024-02-07T22:56:54Z

Hello!
Just to follow up on this discussion, how is memory getting determined for this rule? Is it modifiable and capabale of being run with lower memory? looking at the create_cov_bed rule I don't see any resources section.

The discussion above could be a potential solution, but we're running this rule and it's trying to request a ton of memory (1,700 GB) that may not exist on any node on our computing cluster (error message below saying "Requested node configuration is not available").

If it's useful information, we're using low-depth human sample (~325) mapped to the hg38 genome, which is quite complete.

Sincerley,
Brian

[Wed Feb 7 13:22:58 2024]
rule create_cov_bed:
input: results/hg38/summary_stats/all_cov_sumstats.txt, results/hg38/callable_sites/all_samples.d4
output: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
jobid: 1634
benchmark: benchmarks/hg38/covbed/past_and_turk_benchmark.txt
reason: Missing output files: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
wildcards: refGenome=hg38, prefix=past_and_turk
resources: mem_mb=1743686, mem_mib=1662909, disk_mb=1743686, disk_mib=1662909, tmpdir=
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, **sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/argo/users/emmarg/.conda/envs/snparcher/lib/python3.12/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/argo/users/emmarg/.conda/envs/snparcher/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--time=9000', '--nodes=1', '--mem=1743686', '--
output=logs/slurm/slurm-%j.out', '--cpus-per-task=1',
'/Genomics/ayroleslab2/emma/snpArcher/past_and_turk/.snakemake/tmp.lh6xpm8u/snakejob.create_cov_bed.1634.sh']'
returned non-zero exit status 1.
Error submitting jobscript (exit code 1):

tsackton · 2024-02-08T12:56:13Z

We have seen this a number of times - the default memory specification seems to go off the rails for a reason we don't yet understand.

One solution is to just define mem_mb = some other reasonable number in the snakemake rule directly, in the resources section.

We are hoping to debug this but so far haven't tracked down the problem.

cademirch · 2024-02-08T21:57:33Z

I think this may be happening since create_cov_bed is not defined in the resources yaml, so Snakemake comes up with a default:
https://github.com/snakemake/snakemake/blob/0998cc57cbd02c38d1a3bbf1662c8b23b7601e20/snakemake/resources.py#L11-L16

Erythroxylum · 2024-02-09T16:51:07Z

Hello, this issue has failed at the qc module of 2/3 otherwise successful runs. I have defined

resources:
    mem_mb = 16000

in the workflow/modules/qc/Snakemake file before the 'run' or 'shell' command in every rule, but the error and job failure persist. Snakefile and err file attached.

As you say Cade, the first error is for create_cov_bed, which is not a rule on this Snakefile.
err336.txt
Snakefile.txt

Dictionary2b · 2024-02-13T17:49:19Z

Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.

Thanks, Cade. The workflow is now appropriately finished. Defining the specific resource allocation of each job in cluster_config.yml , as I did, is the solution. : )

erikenbody mentioned this issue Jan 2, 2024

cpus-per-task as required for slurm #110

Closed

tsackton closed this as completed Feb 14, 2024

tsackton mentioned this issue Feb 14, 2024

Add resources to rules #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation problem on a slurm system #148

Memory allocation problem on a slurm system #148

Dictionary2b commented Dec 27, 2023

cademirch commented Dec 29, 2023

Dictionary2b commented Dec 29, 2023

cademirch commented Jan 1, 2024

Dictionary2b commented Jan 2, 2024

cademirch commented Jan 2, 2024

Dictionary2b commented Jan 2, 2024

cademirch commented Jan 2, 2024

brian-arnold commented Feb 7, 2024

tsackton commented Feb 8, 2024

cademirch commented Feb 8, 2024

Erythroxylum commented Feb 9, 2024

Dictionary2b commented Feb 13, 2024

Memory allocation problem on a slurm system #148

Memory allocation problem on a slurm system #148

Comments

Dictionary2b commented Dec 27, 2023

cademirch commented Dec 29, 2023

Dictionary2b commented Dec 29, 2023

cademirch commented Jan 1, 2024

Dictionary2b commented Jan 2, 2024

cademirch commented Jan 2, 2024

Dictionary2b commented Jan 2, 2024

cademirch commented Jan 2, 2024

brian-arnold commented Feb 7, 2024

tsackton commented Feb 8, 2024

cademirch commented Feb 8, 2024

Erythroxylum commented Feb 9, 2024

Dictionary2b commented Feb 13, 2024