New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grid executor reports invalid queue status #1045
Comments
Looking at the log it looks NF cannot retrieve the job
This may happen if the shared file system is not consistent and because it uses a too aggressive caching policy. |
When I go to look, that file is in fact there- though obviously it wasn't at run time. So there's some sort of delay. Is there a workaround? Is exitStatusReadTimeoutMillis it? |
Indeed, it meant for that, but it looks you already set to > 8 min, not sure that increasing it will help. You can try . |
Okay, thanks. I have set it to stupidly long timescales, will see what happens. |
@pditommaso I also happened this issue multiple times recently and use SGE, Job was still running on SGE but NF execute qstat and fetch empty result, then nextflow think this job should have been finished and read its exitcode, but exitcode was not exists, NF will throw a ProcessException and resumeOrDie, if we specify
|
@Crabime If |
@pditommaso I think there might be some problem, |
I may have spotted a problem related to the
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@pinin4fjords @Crabime Does the uploaded patch solves the problem with the empty queue status problem? |
Sorry @pditommaso I've been away. The issue occurred non-deterministically for me, so it might take a while to crop up again anyway, but I'm running with the patch now so I'll let you know. |
@pditommaso So far I've not seen this error again. |
Excellent. |
@pditommaso we were just checking in to see if this has been reported, awesome to see that it appears to be addressed! We'll check this out as well. |
@pditommaso should we grab the |
Not completely sure it's solved, was a bit random for me anyway. If you just run with NXF_VER as above NF should download the update itself |
@cjfields snapshots are not uploaded as GH releases. You can use it as mentioned above. @pinin4fjords include the log file if it still occurs. |
@pditommaso thx, got snapshot mixed up w/ 'edge' |
I'll upload march edge release next week including this patch. |
In a nutshell the the grid executor uses a process builder to fetch the queue status using bjobs or a similar. Since the output is read by a separate thread, after the process termination it's required to wait for the output consumer thread otherwise the output fetched can be incomplete.
We have seen this error again in v19.07.0 on LSF Here is the log |
Hi there, Are there any updates on this? I am working with SLURM (nextflow version 22.04.5.5708) and I am getting the same issue. The job is finished but the exitcode file has not been generated yet for an unknown reason. Nextflow assigns exitstatus '-' to the job and decides is an error(?). Also, if that job is a candidate for 'retry', it will resubmit and I will end up with the same file twice in the tracing system, causing a file collision error downstream.
Here, I understand that it has submitted the first job (jobId: 14658419) and it is waiting for it to finish (status is RUNNING), but then at some point it retries the job (jobId: 14658420) even though the first one is still running. However, by the time this second job finishes the first one is also finished (see last line) and therefore causes the downstream error below. I also added the channel dumps in there where the file seems to be replicated (likely because it was submitted and finished twice). Downstream error:
The channel dump for one of the duplicated files looks like this:
For now, the only workaround I have found is to increase the exitReadTimeout variable and removing the 'retry'. But this is not optimal when submitting lots of samples, i.e. it limits nextflow scalability feature because the retry is quite handy when weird errors happen in the cluster. Any help would be appreciated. Thanks. |
Hi, @pditommaso We are running into this same issue on our side. Was wondering if the snapshot was merged into the release branch ever. Thank you. |
This was solved by 5229e93. If you are still experiencing a problem, please fill out an issue, including the nextflow file. |
Bug report
Expected behavior and actual behavior
Expect executor to wait until job complete and report success or failure. Should not later find completed output and exit status of 0 for 'failed' jobs.
What seems to happen with some long-running jobs is that the LSF executor decides the job has failed. When I go to the job directory (some time later) I then find the output is there and the exit code file says '0'.
I'm unsure if increasing exitStatusReadTimeoutMillis will help. Is this the same issue as #927?
Steps to reproduce the problem
Difficult to say, it's a fairly-non- deterministic error, and I haven't been able to reproduce it with simple 'sleep' tests. But jobs of more than a couple of minutes seem to trigger the issue.
Program output
Erroring part of .nextflow.log is:
Full log to be attached below.
Environment
Additional context
nextflow.log
The text was updated successfully, but these errors were encountered: