-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation problem on a slurm system #148
Comments
Hi Zongzhuang, sorry for your issues with this. Could you provide your config, resource config, and command line used to run snparcher? The create_cov_bed step is pretty memory intensive so that could be the issue with regards to your posted error log. As for slow progress, that could be a number of things outside of snpArcher's control, such as HPC queue and resource limits. However, please post those things above, and we can try to diagnose. |
@cademirch Thanks for your suggestion! Here are the config, resources, and the bash script I used to run snparcher (.sh). They are all archived in this zip file. By asking the UPPMAX support, I got a suggestion to ask for a fat node partition job of at least 512 GB (-C 512 GB) instead of using the core partition as I did, which has at most 128 GB. Is this something I can change in the cluster-config file? Besides, since there are only a few fat nodes on the cluster, is it possible to make the workflow only submit the jobs with intensive memory requirements in the node partition, and the others keep with the previous setting? I'm also unsure if I need to kill the current process and restart it to apply the changes. |
Hi Zonghuang, I took a look and you configs look OK to me. One thing I will suggest is using the As for submitting certain rules to specific partitions this is possible, the docs above detail how. I would suggest creating a YAML profile and you can define which rules will go to which partition. I'll provide an example here:
slurm: True # same as `--slurm on command line`
jobs: 1000 # set number of jobs to run concurrently
use-conda: True
# can set other wanted command line options here
default-resources:
slurm_partition: <Your partition name here> # This will be default partition for all rules
slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
create_cov_bed:
slurm_partition: <Your big partition name here> # This will override default partition for this rule
# ... you can specify partitions for certain rules by following this pattern Then when you run snparcher you can do so with this profile. Let me know if this helps! |
Hi Cade, Many thanks for your explanation! I'm unsure whether using the If this is the case that I still have to use the |
Okay, sorry I didn't realize that issue also. So in your shell script you sent above looks like this:
You would edit the file
Let me know if this makes sense and is helpful! |
Thanks, Cade.
Do I understand you correctly? Sorry for the misunderstanding! |
Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on. |
Hello! The discussion above could be a potential solution, but we're running this rule and it's trying to request a ton of memory (1,700 GB) that may not exist on any node on our computing cluster (error message below saying "Requested node configuration is not available"). If it's useful information, we're using low-depth human sample (~325) mapped to the hg38 genome, which is quite complete. Sincerley, [Wed Feb 7 13:22:58 2024] |
We have seen this a number of times - the default memory specification seems to go off the rails for a reason we don't yet understand. One solution is to just define mem_mb = some other reasonable number in the snakemake rule directly, in the resources section. We are hoping to debug this but so far haven't tracked down the problem. |
I think this may be happening since |
Hello, this issue has failed at the qc module of 2/3 otherwise successful runs. I have defined
in the workflow/modules/qc/Snakemake file before the 'run' or 'shell' command in every rule, but the error and job failure persist. Snakefile and err file attached. As you say Cade, the first error is for create_cov_bed, which is not a rule on this Snakefile. |
Thanks, Cade. The workflow is now appropriately finished. Defining the specific resource allocation of each job in |
Hello,
I was running the workflow for an extensive data set (over 800 samples) on a slrum platform (UPPMAX). I used the GATK approach with intervals. I got an error message like this:
rule create_cov_bed:
input: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
output: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed
jobid: 2558
benchmark: benchmarks/GCA_009792885.1/covbed/lark20231207_benchmark.txt
reason: Missing output files: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed; Input files updated by another job: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4
wildcards: refGenome=GCA_009792885.1, prefix=lark20231207
resources: mem_mb=448200, mem_mib=427437, disk_mb=448200, disk_mib=427437, tmpdir=
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/./profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, **sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--partition=core', '--time=7-00:00:00', '--ntasks=8', '--output=logs/slurm/slurm-%j.out', '--account=naiss2023-5-278', '--mem=448200', '/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/.snakemake/tmp.fi5cgg1q/snakejob.create_cov_bed.2558.sh']' returned non-zero exit status 1.
Error submitting jobscript (exit code 1):
Select jobs to execute...
There is not enough memory on the slurm system. I'm not sure where the issue is, whether due to I didn't ask for large enough memory in the setting or the HPC simply cannot provide that much memory. Should I probably change the source code to ask for a larger memory allocation for the workflow from the beginning (e.g. more than 1000 nodes)?
Besides, the whole workflow runs not that fast either; for the bam2vcf jobs, it reported a progress of 2% every 24 hours and obviously will go over the time limitation of the snakemake slurm job. Could you please give some suggestions on this? Thanks in advance.
Best,
Zongzhuang
The text was updated successfully, but these errors were encountered: