Grid executor reports invalid queue status #1045

pinin4fjords · 2019-02-24T15:09:53Z

Bug report

Expected behavior and actual behavior

Expect executor to wait until job complete and report success or failure. Should not later find completed output and exit status of 0 for 'failed' jobs.

What seems to happen with some long-running jobs is that the LSF executor decides the job has failed. When I go to the job directory (some time later) I then find the output is there and the exit code file says '0'.

I'm unsure if increasing exitStatusReadTimeoutMillis will help. Is this the same issue as #927?

Steps to reproduce the problem

Difficult to say, it's a fairly-non- deterministic error, and I haven't been able to reproduce it with simple 'sleep' tests. But jobs of more than a couple of minutes seem to trigger the issue.

Program output

Erroring part of .nextflow.log is:

...
Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
>   (empty)

Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group>  197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group>  158 Feb 23 23:13 .command.trace

Feb-23 23:13:20.370 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: COMPLETED; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ]
Feb-23 23:13:20.430 [Task monitor] INFO  nextflow.processor.TaskProcessor - [72/1f67ab] NOTE: Process `download_fastqs (1319)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
...

Full log to be attached below.

Environment

Nextflow version: 19.01.0.5050
Java version: openjdk 10.0.2 2018-07-17
Operating system: Linux

Additional context

nextflow.log

The text was updated successfully, but these errors were encountered:

pditommaso · 2019-02-25T13:47:20Z

Looking at the log it looks NF cannot retrieve the job .exitcode file.

Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
>   (empty)

Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group>  197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group>  158 Feb 23 23:13 .command.trace

This may happen if the shared file system is not consistent and because it uses a too aggressive caching policy.

pinin4fjords · 2019-02-25T15:52:32Z

When I go to look, that file is in fact there- though obviously it wasn't at run time. So there's some sort of delay. Is there a workaround? Is exitStatusReadTimeoutMillis it?

pditommaso · 2019-02-25T15:54:40Z

Indeed, it meant for that, but it looks you already set to > 8 min, not sure that increasing it will help. You can try .

pinin4fjords · 2019-02-25T15:57:56Z

Okay, thanks. I have set it to stupidly long timescales, will see what happens.

Crabime · 2019-02-26T06:59:08Z

@pditommaso I also happened this issue multiple times recently and use SGE, Job was still running on SGE but NF execute qstat and fetch empty result, then nextflow think this job should have been finished and read its exitcode, but exitcode was not exists, NF will throw a ProcessException and resumeOrDie, if we specify errorStrategy retry, NF will resubmit another same job, here is the snapshot:

here is error log snippets:

Feb-26 13:08:05.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:08:05.781 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  55366 0.50000 nf-QC_(lib chenqi       r     02/26/2019 09:22:57 first.q@bio03.wutbiolab.com        1        

Feb-26 13:08:05.784 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
  job:  55366: RUNNING

Feb-26 13:08:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:10.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2706)
Feb-26 13:08:10.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:10.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:10.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:15.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2707)
Feb-26 13:08:15.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:15.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:15.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:20.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2708)
Feb-26 13:08:20.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:20.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:20.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:25.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2709)
Feb-26 13:08:25.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:25.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:25.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:30.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2710)
Feb-26 13:08:30.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:30.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:30.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:35.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2711)
Feb-26 13:08:35.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:35.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:35.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:40.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2712)
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:40.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:45.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2713)
Feb-26 13:08:45.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:45.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:45.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:50.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2714)
Feb-26 13:08:50.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:50.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:50.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:55.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2715)
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:55.728 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:00.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2716)
Feb-26 13:09:00.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:00.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:00.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:05.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2717)
Feb-26 13:09:05.728 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.729 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:05.731 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:09:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0

Feb-26 13:09:05.787 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
  (empty)
Feb-26 13:09:05.789 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > map does not contain jobId: `55366`
Feb-26 13:09:05.791 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.792 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: RUNNING; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 5725168
Current queue status:
>   (null)

Content of workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
null
Feb-26 13:09:05.797 [Task monitor] TRACE g.r.m.CustomMetaClassCreationHandle - Registering custom meta-class for: class java.io.FileReader
Feb-26 13:09:05.850 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - clear complete job 55366 result : true
Feb-26 13:09:05.851 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.852 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: COMPLETED; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ]
Feb-26 13:09:05.853 [Task monitor] TRACE nextflow.processor.TaskProcessor - finalizing process > QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003) -- id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
Feb-26 13:09:05.880 [Task monitor] TRACE nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
  error [nextflow.exception.ProcessFailedException]: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system
Feb-26 13:09:05.924 [Task monitor] INFO  nextflow.processor.TaskProcessor - [fc/64e809] NOTE: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
Feb-26 13:09:05.928 [Task monitor] TRACE nextflow.processor.TaskHandler - task stats: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); start: 1551144180132; complete: 1551157745794; realtime: 13565662 [13565662]; 
Feb-26 13:09:05.972 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] Cacheable folder=/mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a -- exists=false; try=2; shouldTryCache=false
Feb-26 13:09:05.973 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] actual run folder: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.974 [pool-2-thread-1] TRACE nextflow.processor.TaskDispatcher - Scheduling process: id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.976 [pool-2-thread-1] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a/.exitcode
Feb-26 13:09:05.977 [pool-2-thread-1] TRACE n.processor.TaskPollingMonitor - Scheduled task > TaskHandler[jobId: null; id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: NEW; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a started: -; exited: -; ]

pditommaso · 2019-02-26T13:12:59Z

@Crabime If qstat is not returning the job status NF assumes it's terminated, and therefore kill it after a while.

Crabime · 2019-02-27T00:58:15Z

@pditommaso I think there might be some problem, qstat really had value but ProcessBuilder return empty, Maybe it's a bug of ProcessBuilder

pditommaso · 2019-02-27T09:28:02Z

I may have spotted a problem related to the ProcessBuilder. Please have a try with the snapshot I've just uploaded using this command below:

NXF_VER=19.03.0-SNAPSHOT nextflow run .. [your options]

pditommaso · 2019-03-04T13:38:53Z

@pinin4fjords @Crabime Does the uploaded patch solves the problem with the empty queue status problem?

pinin4fjords · 2019-03-05T09:40:32Z

Sorry @pditommaso I've been away.

The issue occurred non-deterministically for me, so it might take a while to crop up again anyway, but I'm running with the patch now so I'll let you know.

pinin4fjords · 2019-03-06T14:33:16Z

@pditommaso So far I've not seen this error again.

pditommaso · 2019-03-06T14:48:15Z

Excellent.

cjfields · 2019-03-06T18:39:41Z

@pditommaso we were just checking in to see if this has been reported, awesome to see that it appears to be addressed! We'll check this out as well.

cjfields · 2019-03-06T22:23:54Z

@pditommaso should we grab the 19.03.0-SNAPSHOT snapshot from the GitHub master branch? I'm not seeing a release tagged: https://github.com/nextflow-io/nextflow/releases

pinin4fjords · 2019-03-06T22:45:35Z

Not completely sure it's solved, was a bit random for me anyway.

If you just run with NXF_VER as above NF should download the update itself

pditommaso · 2019-03-07T08:20:48Z

@cjfields snapshots are not uploaded as GH releases. You can use it as mentioned above.

@pinin4fjords include the log file if it still occurs.

cjfields · 2019-03-07T19:21:17Z

@pditommaso thx, got snapshot mixed up w/ 'edge'

pditommaso · 2019-03-07T20:11:10Z

I'll upload march edge release next week including this patch.

In a nutshell the the grid executor uses a process builder to fetch the queue status using bjobs or a similar. Since the output is read by a separate thread, after the process termination it's required to wait for the output consumer thread otherwise the output fetched can be incomplete.

gongyixiao · 2019-11-11T04:09:50Z

We have seen this error again in v19.07.0 on LSF

Here is the log .nextflow.log and one set of .command.* and .exitcode for an error job.

nextflow.log
command.log
exitcode.txt

RaqManzano · 2023-02-21T16:17:11Z

Hi there,

Are there any updates on this? I am working with SLURM (nextflow version 22.04.5.5708) and I am getting the same issue. The job is finished but the exitcode file has not been generated yet for an unknown reason. Nextflow assigns exitstatus '-' to the job and decides is an error(?). Also, if that job is a candidate for 'retry', it will resubmit and I will end up with the same file twice in the tracing system, causing a file collision error downstream.
See an extract of the log here:

Feb-20 20:29:52.779 [Task submitter] INFO  nextflow.Session - [3a/448801] Submitted process > MUTECT2 (TUMOUR_vs_NORMAL)
~> TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: -; ]
~> TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: 2023-02-20T20:48:13Z; ]
~> TaskHandler[jobId: 14658420; id: 446; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348314; exited: 2023-02-20T20:48:13Z; ]
Feb-20 20:50:49.211 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: COMPLETED; exit: 0; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: 2023-02-20T20:48:13Z; ]

Here, I understand that it has submitted the first job (jobId: 14658419) and it is waiting for it to finish (status is RUNNING), but then at some point it retries the job (jobId: 14658420) even though the first one is still running. However, by the time this second job finishes the first one is also finished (see last line) and therefore causes the downstream error below. I also added the channel dumps in there where the file seems to be replicated (likely because it was submitted and finished twice).

Downstream error:

Feb-20 21:14:48.823 [Actor Thread 38] DEBUG nextflow.processor.TaskProcessor - Process MERGEMUTECTSTATS > collision check staging file names: [...,TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz.stats:2, TUMOUR_vs_NORMAL.chrY_1-57227415.vcf.gz.stats:2, TUMOUR_vs_NORMAL.chr18_1-80373285.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chrX_1-156040895.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chr22_1-50818468.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chr19_1-58617616.vcf.gz.stats:2, ...]

The channel dump for one of the duplicated files looks like this:

Feb-20 20:50:49.239 [Actor Thread 46] INFO  nextflow.extension.DumpOp - [DUMP: mutect2_vcf_branch] [['id':'TUMOUR_vs_NORMAL', 'normal_id':'NORMAL', 'num_intervals':24, 'patient':'SAMPLE1', 'status':1, 'tumor_id':'TUMOR'], /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733/TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz]
Feb-20 20:50:49.263 [Actor Thread 29] INFO  nextflow.extension.DumpOp - [DUMP: mutect2_vcf_branch] [['id':'TUMOUR_vs_NORMAL', 'normal_id':'NORMAL', 'num_intervals':24, 'patient':'SAMPLE1', 'status':1, 'tumor_id':'TUMOR'], /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733/TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz]

For now, the only workaround I have found is to increase the exitReadTimeout variable and removing the 'retry'. But this is not optimal when submitting lots of samples, i.e. it limits nextflow scalability feature because the retry is quite handy when weird errors happen in the cluster. Any help would be appreciated. Thanks.

santy-ttam · 2023-05-23T23:29:43Z

Hi, @pditommaso We are running into this same issue on our side. Was wondering if the snapshot was merged into the release branch ever. Thank you.

pditommaso · 2023-05-31T10:48:10Z

This was solved by 5229e93. If you are still experiencing a problem, please fill out an issue, including the nextflow file.

This comment has been minimized.

Sign in to view

pditommaso mentioned this issue Mar 5, 2019

LSF jobs occasionally fail with missing output file #931

Closed

pditommaso added this to the v19.04.0 milestone Mar 7, 2019

pditommaso added the bug label Mar 13, 2019

pditommaso changed the title ~~LSF executor- incorrect failure with "Failed to get exit status for process" and "terminated for an unknown reason"~~ Grid executor reports invalid queue status Mar 13, 2019

pditommaso closed this as completed Mar 13, 2019

pditommaso mentioned this issue Mar 14, 2019

process terminated for unknown reason but actually such job succeed on SGE cluster #1008

Closed

aboyher mentioned this issue Jul 20, 2019

Condor jobs dont last longer than 7 minutes #1235

Closed

gongyixiao mentioned this issue Sep 20, 2019

Error "Failed to get exit status for process" mskcc/tempo#651

Closed

yocra3 mentioned this issue Jun 23, 2020

Grid executor reports invalid queue status in SLURM executor #1644

Closed

hukai916 mentioned this issue May 4, 2022

Process terminated for an unknown reason -- Likely it has been terminated by the external system #2847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grid executor reports invalid queue status #1045

Grid executor reports invalid queue status #1045

pinin4fjords commented Feb 24, 2019

pditommaso commented Feb 25, 2019

pinin4fjords commented Feb 25, 2019

pditommaso commented Feb 25, 2019

pinin4fjords commented Feb 25, 2019

Crabime commented Feb 26, 2019

pditommaso commented Feb 26, 2019

Crabime commented Feb 27, 2019

pditommaso commented Feb 27, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pditommaso commented Mar 4, 2019

pinin4fjords commented Mar 5, 2019

pinin4fjords commented Mar 6, 2019

pditommaso commented Mar 6, 2019

cjfields commented Mar 6, 2019

cjfields commented Mar 6, 2019

pinin4fjords commented Mar 6, 2019

pditommaso commented Mar 7, 2019

cjfields commented Mar 7, 2019

pditommaso commented Mar 7, 2019

gongyixiao commented Nov 11, 2019

RaqManzano commented Feb 21, 2023

santy-ttam commented May 23, 2023

pditommaso commented May 31, 2023

Grid executor reports invalid queue status #1045

Grid executor reports invalid queue status #1045

Comments

pinin4fjords commented Feb 24, 2019

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

pditommaso commented Feb 25, 2019

pinin4fjords commented Feb 25, 2019

pditommaso commented Feb 25, 2019

pinin4fjords commented Feb 25, 2019

Crabime commented Feb 26, 2019

pditommaso commented Feb 26, 2019

Crabime commented Feb 27, 2019

pditommaso commented Feb 27, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pditommaso commented Mar 4, 2019

pinin4fjords commented Mar 5, 2019

pinin4fjords commented Mar 6, 2019

pditommaso commented Mar 6, 2019

cjfields commented Mar 6, 2019

cjfields commented Mar 6, 2019

pinin4fjords commented Mar 6, 2019

pditommaso commented Mar 7, 2019

cjfields commented Mar 7, 2019

pditommaso commented Mar 7, 2019

gongyixiao commented Nov 11, 2019

RaqManzano commented Feb 21, 2023

santy-ttam commented May 23, 2023

pditommaso commented May 31, 2023