Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with big dataset and slurm #198

Open
max-hence opened this issue May 29, 2024 · 4 comments
Open

Error with big dataset and slurm #198

max-hence opened this issue May 29, 2024 · 4 comments

Comments

@max-hence
Copy link

Dear snparcher developers,

I encounter errors when I run the pipeline on real size datasets.
I get this kind of message during bam2gvcf, gvcf2DB, DB2vcf or concat_gvcfs.

Error in rule DB2vcf:
    message: SLURM-job '12919779' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2590
    input: results/GCA_015227805.2/genomics_db_import/DB_L0224.tar, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna.fai, results/GCA_015227805.2/data/genome/GCA_015227805.2.dict
    output: results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz, results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz.tbi
    log: logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt, /scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/hrustica/.snakemake/conda/040e922e8494c7bc027131fb77bc2d6d_
    shell:
        
        tar -xf results/GCA_015227805.2/genomics_db_import/DB_L0224.tar
        gatk GenotypeGVCFs             --java-options '-Xmx180000m -Xms180000m'             -R results/GCA_015227805.2/data/genome/GCA_015227805.2.fna             --heterozygosity 0.005             --genomicsdb-shared-posixfs-optimizations true             -V gendb://results/GCA_015227805.2/genomics_db_import/DB_L0224             -O results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz             --tmp-dir <TBD> &> logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 12919779

Trying to restart job 2590.

Here are logs linked to this job :

I gave you here an example with a detailed log but most of the time, for bam2gvcf or concat_gvcfs rules, logs are empty and I can't find any clue to understand the error.

At first sight I would have said that it's a memory error cause the job can restart and work the second or third time.
But sometimes even with a huge amount of memory the job doesn't work.
And what worries me is that, I recently tried with an other cluster that has a more recent slurm version and, even if I have the same errors, after the 2nd or 3rd try jobs end up being successul and the pipeline runs until the end.

My main question is then : Is your pipeline set up with a specific version of slurm ?
Or do I need to better set "minNmer", "num_gvcf_intervals", "db_scatter_factor" parameters to improve the handling of big dataset ?

Tell me if you need more informations.

Thanks a lot !

Maxence Brault

@cademirch
Copy link
Collaborator

Hi Maxence,

Thanks for opening such a detailed issue.

I think it is unlikely that SLURM versions are the culprit here. Given you got the workflow to work on a different cluster I suspect it could be related to the tmpdir setting. You should refer to your cluster admins/docs to see where they suggest is the best place to write temp files on your cluster.

Its also possible something is wrong with how resources are being specified. It seems like your using mem_mb_per_cpu to specify memory, however unless you've modified the workflow rules to use that resource, then they still might be using just mem_mb.

@max-hence
Copy link
Author

Hi Cade,

Thanks a lot for your quick answer. It's good to know better where the problem comes from. I ll ask cluster admins, if they have an explanation. However I doubt it comes form the tmpdir as all my pipeline runs in a /scratch directory made to handle heavy temporary files.

Yes on the cluster I'm using, the argument mem_mb renders this error :
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

I bypassed the error by replacing mem_mb with mem_mb_per_cpu but it may have made things worse. But it is something I don't have on an other slurm cluster.

Do you think setting minNmer, num_gvcf_intervals, db_scatter_factor can improve also the way memory is handled between jobs and cluster nodes ? If so, do you have recommendations of good values ?

Thanks again,

Maxence Brault

@tsackton
Copy link
Contributor

The default temporary directory is not where the workflow runs, it is whatever your system settings are, which is probably /tmp on the compute node. I notice that you don't have anything set for the "bigtmp" option in your config. You might try setting this to snpArcher-tmp/ or something similar (note no leading slash, so it is created as a directory in the working directory you run the command from).

There might also be memory issues with using mem_mb_per_cpu. Can you share your slurm profile file? This is where these parameters would be set and might help us see if there are specific problems.

@max-hence
Copy link
Author

Thank you for your help. I ll try setting the bigtmp option.

Here is the slurm profile file : slurm_profile.txt

Earlier, I had changed mem_mb for mem_mb_per_cpu and I think most of the error were coming from here but it seems that just adding --mem=<n>G to the main sbatch command solved my incompatibility problem between mem_mb_per_cpu and mem_mb arguments...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants