Error with big dataset and slurm #198

max-hence · 2024-05-29T14:32:11Z

Dear snparcher developers,

I encounter errors when I run the pipeline on real size datasets.
I get this kind of message during bam2gvcf, gvcf2DB, DB2vcf or concat_gvcfs.

Error in rule DB2vcf:
    message: SLURM-job '12919779' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2590
    input: results/GCA_015227805.2/genomics_db_import/DB_L0224.tar, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna.fai, results/GCA_015227805.2/data/genome/GCA_015227805.2.dict
    output: results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz, results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz.tbi
    log: logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt, /scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/hrustica/.snakemake/conda/040e922e8494c7bc027131fb77bc2d6d_
    shell:
        
        tar -xf results/GCA_015227805.2/genomics_db_import/DB_L0224.tar
        gatk GenotypeGVCFs             --java-options '-Xmx180000m -Xms180000m'             -R results/GCA_015227805.2/data/genome/GCA_015227805.2.fna             --heterozygosity 0.005             --genomicsdb-shared-posixfs-optimizations true             -V gendb://results/GCA_015227805.2/genomics_db_import/DB_L0224             -O results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz             --tmp-dir <TBD> &> logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 12919779

Trying to restart job 2590.

Here are logs linked to this job :

/scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log :
example_job12919779.log
logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt
example_0224.txt
full log
full_log.txt
config file
config.txt

I gave you here an example with a detailed log but most of the time, for bam2gvcf or concat_gvcfs rules, logs are empty and I can't find any clue to understand the error.

At first sight I would have said that it's a memory error cause the job can restart and work the second or third time.
But sometimes even with a huge amount of memory the job doesn't work.
And what worries me is that, I recently tried with an other cluster that has a more recent slurm version and, even if I have the same errors, after the 2nd or 3rd try jobs end up being successul and the pipeline runs until the end.

My main question is then : Is your pipeline set up with a specific version of slurm ?
Or do I need to better set "minNmer", "num_gvcf_intervals", "db_scatter_factor" parameters to improve the handling of big dataset ?

Tell me if you need more informations.

Thanks a lot !

Maxence Brault

The text was updated successfully, but these errors were encountered:

cademirch · 2024-05-29T14:58:37Z

Hi Maxence,

Thanks for opening such a detailed issue.

I think it is unlikely that SLURM versions are the culprit here. Given you got the workflow to work on a different cluster I suspect it could be related to the tmpdir setting. You should refer to your cluster admins/docs to see where they suggest is the best place to write temp files on your cluster.

Its also possible something is wrong with how resources are being specified. It seems like your using mem_mb_per_cpu to specify memory, however unless you've modified the workflow rules to use that resource, then they still might be using just mem_mb.

max-hence · 2024-05-30T14:19:45Z

Hi Cade,

Thanks a lot for your quick answer. It's good to know better where the problem comes from. I ll ask cluster admins, if they have an explanation. However I doubt it comes form the tmpdir as all my pipeline runs in a /scratch directory made to handle heavy temporary files.

Yes on the cluster I'm using, the argument mem_mb renders this error :
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

I bypassed the error by replacing mem_mb with mem_mb_per_cpu but it may have made things worse. But it is something I don't have on an other slurm cluster.

Do you think setting minNmer, num_gvcf_intervals, db_scatter_factor can improve also the way memory is handled between jobs and cluster nodes ? If so, do you have recommendations of good values ?

Thanks again,

Maxence Brault

tsackton · 2024-05-30T14:57:12Z

The default temporary directory is not where the workflow runs, it is whatever your system settings are, which is probably /tmp on the compute node. I notice that you don't have anything set for the "bigtmp" option in your config. You might try setting this to snpArcher-tmp/ or something similar (note no leading slash, so it is created as a directory in the working directory you run the command from).

There might also be memory issues with using mem_mb_per_cpu. Can you share your slurm profile file? This is where these parameters would be set and might help us see if there are specific problems.

max-hence · 2024-05-30T22:27:33Z

Thank you for your help. I ll try setting the bigtmp option.

Here is the slurm profile file : slurm_profile.txt

Earlier, I had changed mem_mb for mem_mb_per_cpu and I think most of the error were coming from here but it seems that just adding --mem=<n>G to the main sbatch command solved my incompatibility problem between mem_mb_per_cpu and mem_mb arguments...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with big dataset and slurm #198

Error with big dataset and slurm #198

max-hence commented May 29, 2024

cademirch commented May 29, 2024

max-hence commented May 30, 2024

tsackton commented May 30, 2024

max-hence commented May 30, 2024

Error with big dataset and slurm #198

Error with big dataset and slurm #198

Comments

max-hence commented May 29, 2024

cademirch commented May 29, 2024

max-hence commented May 30, 2024

tsackton commented May 30, 2024

max-hence commented May 30, 2024