Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs failing with sigbus and unknown userid errors #377

Closed
mniederhuber opened this issue Jan 24, 2024 · 1 comment
Closed

jobs failing with sigbus and unknown userid errors #377

mniederhuber opened this issue Jan 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@mniederhuber
Copy link

mniederhuber commented Jan 24, 2024

Description of the bug

This may be two separate issues, but I'm having jobs fail either with a Java Runtime Environment SIGBUS (0x7) error, or with FATAL: Couldn't determine user account information: user: unknown userid [some number]

In both cases, the process that fails and the file associated with the error changes with each rerun.
With subsequent reruns, the process and sample that failed previously will complete, but a new process and sample will error out. Eventually the pipeline can be completed with multiple reruns.
The pipeline runs fine with the test profile.

From what I can figure out the Java Runtime Environment error has something to do with running out of memory in the java vm.
but I'm stumped on the userid error.

I have made a small modification to the modules.config file to add --no-model param to MACS. But I am not seeing any errors with MACS, and again, the test profile runs as expected.

Any help would be greatly appreciated!

Command used and terminal output

#!/bin/bash
#SBATCH --mem=8G
#SBATCH -t 8:00:00
#SBATCH -n 1
#SBATCH -o var/log/chip-%j.out
#SBATCH -e var/log/chip-%j.err

module load nextflow

nextflow -log var/log/.chipseq run nf-core/chipseq \
	-profile singularity \
	-c config/slurm.config \
	-resume \
	-params-file config/chip_params.yaml

Relevant files

bug.zip

System information

Nextflow: 23.04.2
Hardware: HPC
Executor: slurm
Container: Singularity
OS: RHEL8
nf-core/chipseq: 2.0.0

@mniederhuber mniederhuber added the bug Something isn't working label Jan 24, 2024
@mniederhuber
Copy link
Author

It turns out this is almost certainly an issue on the HPC side.
slurm / admins were causing jobs that were submitted to one partition to get redirected to a different partition.
By default, nextflow polls job status from the submitted partition, and if it can't find the job things get messed up.
We have been able to fix this by adding the following to our config file:

executor {
    name = "slurm"
    queueGlobalStatus = true
}

This will tell nextflow to poll for job status globally and not just within the submitted partition.

It would be great if this was either the default behavior from nextflow, or there was a more informative error message in cases where a job has been redirected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant