Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation problem on a slurm system #148

Closed
Dictionary2b opened this issue Dec 27, 2023 · 12 comments
Closed

Memory allocation problem on a slurm system #148

Dictionary2b opened this issue Dec 27, 2023 · 12 comments

Comments

@Dictionary2b
Copy link

Hello,
I was running the workflow for an extensive data set (over 800 samples) on a slrum platform (UPPMAX). I used the GATK approach with intervals. I got an error message like this:
rule create_cov_bed:
input: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
output: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed
jobid: 2558
benchmark: benchmarks/GCA_009792885.1/covbed/lark20231207_benchmark.txt
reason: Missing output files: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed; Input files updated by another job: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
wildcards: refGenome=GCA_009792885.1, prefix=lark20231207
resources: mem_mb=448200, mem_mib=427437, disk_mb=448200, disk_mib=427437, tmpdir=

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/./profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, **sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--partition=core', '--time=7-00:00:00', '--ntasks=8', '--output=logs/slurm/slurm-%j.out', '--account=naiss2023-5-278', '--mem=448200', '/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/.snakemake/tmp.fi5cgg1q/snakejob.create_cov_bed.2558.sh']' returned non-zero exit status 1.
Error submitting jobscript (exit code 1):

Select jobs to execute...

There is not enough memory on the slurm system. I'm not sure where the issue is, whether due to I didn't ask for large enough memory in the setting or the HPC simply cannot provide that much memory. Should I probably change the source code to ask for a larger memory allocation for the workflow from the beginning (e.g. more than 1000 nodes)?

Besides, the whole workflow runs not that fast either; for the bam2vcf jobs, it reported a progress of 2% every 24 hours and obviously will go over the time limitation of the snakemake slurm job. Could you please give some suggestions on this? Thanks in advance.

Best,
Zongzhuang

@cademirch
Copy link
Collaborator

Hi Zongzhuang, sorry for your issues with this. Could you provide your config, resource config, and command line used to run snparcher? The create_cov_bed step is pretty memory intensive so that could be the issue with regards to your posted error log. As for slow progress, that could be a number of things outside of snpArcher's control, such as HPC queue and resource limits. However, please post those things above, and we can try to diagnose.

@Dictionary2b
Copy link
Author

@cademirch Thanks for your suggestion! Here are the config, resources, and the bash script I used to run snparcher (.sh). They are all archived in this zip file.

By asking the UPPMAX support, I got a suggestion to ask for a fat node partition job of at least 512 GB (-C 512 GB) instead of using the core partition as I did, which has at most 128 GB. Is this something I can change in the cluster-config file? Besides, since there are only a few fat nodes on the cluster, is it possible to make the workflow only submit the jobs with intensive memory requirements in the node partition, and the others keep with the previous setting? I'm also unsure if I need to kill the current process and restart it to apply the changes.

Zongzhuang_config.zip

@cademirch
Copy link
Collaborator

Hi Zonghuang,

I took a look and you configs look OK to me. One thing I will suggest is using the --slurm option when executing Snakemake instead of the profile. See these docs for more details: https://snakemake.readthedocs.io/en/v7.32.3/executing/cluster.html#executing-on-slurm-clusters

As for submitting certain rules to specific partitions this is possible, the docs above detail how. I would suggest creating a YAML profile and you can define which rules will go to which partition. I'll provide an example here:

uppmax_example_profile.yaml

slurm: True # same as `--slurm on command line`
jobs: 1000 # set number of jobs to run concurrently
use-conda: True
# can set other wanted command line options here
default-resources:
  slurm_partition: <Your partition name here> # This will be default partition for all rules
  slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
  create_cov_bed:
     slurm_partition: <Your big partition name here> # This will override default partition for this rule
 # ... you can specify partitions for certain rules by following this pattern

Then when you run snparcher you can do so with this profile. Let me know if this helps!

@Dictionary2b
Copy link
Author

Hi Cade,

Many thanks for your explanation!

I'm unsure whether using the --slurm option without the --profile option can work in this case. The cpus-per-task issue on slurm system seems to persist still, which is now partially solved by an edition in profiles/slurm/slurm_utils.py.

If this is the case that I still have to use the --profile option, can I probably modify profiles/slurm/config.yaml (or something in cluster_config.yml?) for submitting certain rules to specific partitions?

@cademirch
Copy link
Collaborator

Okay, sorry I didn't realize that issue also. So in your shell script you sent above looks like this:

❯ cat run_pipeline_zongz1123.sh
#!/bin/bash
#SBATCH -A naiss2023-5-278
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 10-00:00:00
#SBATCH -J snpArcher
#SBATCH -e snpArcher_%A_%a.err # File to which STDERR will be written
#SBATCH -o snpArcher_%A_%a.out
#SBATCH --mail-type=all
#SBATCH --mail-user=dictionary2b@gmail.com
module load conda/latest

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
mamba activate snparcher
snakemake --snakefile workflow/Snakefile --profile ./profiles/slurm

You would edit the file .profiles/slurm to include this:

default-resources:
  slurm_partition: <Your partition name here> # This will be default partition for all rules
  slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
  create_cov_bed:
     slurm_partition: <Your big partition name here> # This will override default partition for this rule
 # ... you can specify partitions for certain rules by following this pattern

Let me know if this makes sense and is helpful!

@Dictionary2b
Copy link
Author

Thanks, Cade. .profiles/slurm is a directory containing both config. yaml and cluster_config.yml. In this case, I can't find the right place to include the code directly as you suggested. To my understanding, if I want to add a specific resource setting for certain jobs, I would need to add it to cluster_config.yml to make it look like this:

__default__:
    partition: "snowy"
    time: 7-00:00:00
    partition: core
    ntasks: 8
    output: "logs/slurm/slurm-%j.out"
    account: naiss2023-5-278

create_cov_bed:
    partition: "snowy"
    time: 7-00:00:00
    partition: node
    nodes: 1
    ntasks: 8
    constraint: mem512GB
    output: "logs/slurm/slurm-%j.out"
    account: naiss2023-5-278

Do I understand you correctly? Sorry for the misunderstanding!

@cademirch
Copy link
Collaborator

Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.

@brian-arnold
Copy link
Collaborator

Hello!
Just to follow up on this discussion, how is memory getting determined for this rule? Is it modifiable and capabale of being run with lower memory? looking at the create_cov_bed rule I don't see any resources section.

The discussion above could be a potential solution, but we're running this rule and it's trying to request a ton of memory (1,700 GB) that may not exist on any node on our computing cluster (error message below saying "Requested node configuration is not available").

If it's useful information, we're using low-depth human sample (~325) mapped to the hg38 genome, which is quite complete.

Sincerley,
Brian

[Wed Feb 7 13:22:58 2024]
rule create_cov_bed:
input: results/hg38/summary_stats/all_cov_sumstats.txt, results/hg38/callable_sites/all_samples.d4
output: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
jobid: 1634
benchmark: benchmarks/hg38/covbed/past_and_turk_benchmark.txt
reason: Missing output files: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
wildcards: refGenome=hg38, prefix=past_and_turk
resources: mem_mb=1743686, mem_mib=1662909, disk_mb=1743686, disk_mib=1662909, tmpdir=
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, **sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/Genomics/ayroleslab2/emma/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/argo/users/emmarg/.conda/envs/snparcher/lib/python3.12/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Genomics/argo/users/emmarg/.conda/envs/snparcher/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--time=9000', '--nodes=1', '--mem=1743686', '--
output=logs/slurm/slurm-%j.out', '--cpus-per-task=1',
'/Genomics/ayroleslab2/emma/snpArcher/past_and_turk/.snakemake/tmp.lh6xpm8u/snakejob.create_cov_bed.1634.sh']'
returned non-zero exit status 1.
Error submitting jobscript (exit code 1):

@tsackton
Copy link
Contributor

tsackton commented Feb 8, 2024

We have seen this a number of times - the default memory specification seems to go off the rails for a reason we don't yet understand.

One solution is to just define mem_mb = some other reasonable number in the snakemake rule directly, in the resources section.

We are hoping to debug this but so far haven't tracked down the problem.

@cademirch
Copy link
Collaborator

I think this may be happening since create_cov_bed is not defined in the resources yaml, so Snakemake comes up with a default:
https://github.com/snakemake/snakemake/blob/0998cc57cbd02c38d1a3bbf1662c8b23b7601e20/snakemake/resources.py#L11-L16

@Erythroxylum
Copy link

Hello, this issue has failed at the qc module of 2/3 otherwise successful runs. I have defined

resources:
    mem_mb = 16000

in the workflow/modules/qc/Snakemake file before the 'run' or 'shell' command in every rule, but the error and job failure persist. Snakefile and err file attached.

As you say Cade, the first error is for create_cov_bed, which is not a rule on this Snakefile.
err336.txt
Snakefile.txt

@Dictionary2b
Copy link
Author

Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.

Thanks, Cade. The workflow is now appropriately finished. Defining the specific resource allocation of each job in cluster_config.yml , as I did, is the solution. : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants