Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid executor reports invalid queue status #1045

Closed
pinin4fjords opened this issue Feb 24, 2019 · 27 comments
Closed

Grid executor reports invalid queue status #1045

pinin4fjords opened this issue Feb 24, 2019 · 27 comments
Labels
Milestone

Comments

@pinin4fjords
Copy link
Contributor

Bug report

Expected behavior and actual behavior

Expect executor to wait until job complete and report success or failure. Should not later find completed output and exit status of 0 for 'failed' jobs.

What seems to happen with some long-running jobs is that the LSF executor decides the job has failed. When I go to the job directory (some time later) I then find the output is there and the exit code file says '0'.

I'm unsure if increasing exitStatusReadTimeoutMillis will help. Is this the same issue as #927?

Steps to reproduce the problem

Difficult to say, it's a fairly-non- deterministic error, and I haven't been able to reproduce it with simple 'sleep' tests. But jobs of more than a couple of minutes seem to trigger the issue.

Program output

Erroring part of .nextflow.log is:

...
Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
>   (empty)

Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group>  197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group>  158 Feb 23 23:13 .command.trace

Feb-23 23:13:20.370 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: COMPLETED; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ]
Feb-23 23:13:20.430 [Task monitor] INFO  nextflow.processor.TaskProcessor - [72/1f67ab] NOTE: Process `download_fastqs (1319)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
...

Full log to be attached below.

Environment

  • Nextflow version: 19.01.0.5050
  • Java version: openjdk 10.0.2 2018-07-17
  • Operating system: Linux

Additional context

nextflow.log

@pditommaso
Copy link
Member

Looking at the log it looks NF cannot retrieve the job .exitcode file.

Feb-23 23:13:20.369 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 3137621; id: 1321; name: download_fastqs (1319); status: RUNNING; exit: -; error: -; workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f started: 1550962950317; exited: -; ] -- exitStatusReadTimeoutMillis: 500000; delta: 500045
Current queue status:
>   (empty)

Content of workDir: <path>/work/E-MTAB-5530/danio_rerio/smartseq/72/1f67ab72a847f8e5191d78d5a5ec3f
> total 32
> drwxrwxr-x 2 <user <group> 4096 Feb 23 23:02 .
> drwxr-xr-x 6 <user <group> 4096 Feb 23 23:02 ..
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.begin
> -rw-r--r-- 1 <user <group>    0 Feb 23 23:02 .command.err
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.log
> -rw-r--r-- 1 <user <group>  118 Feb 23 23:03 .command.out
> -rw-r--r-- 1 <user <group> 2183 Feb 23 23:02 .command.run
> -rw-r--r-- 1 <user <group>  197 Feb 23 23:02 .command.sh
> -rw-r--r-- 1 <user <group> 3548 Feb 23 23:02 .command.stub
> -rw-r--r-- 1 <user <group>  158 Feb 23 23:13 .command.trace

This may happen if the shared file system is not consistent and because it uses a too aggressive caching policy.

@pinin4fjords
Copy link
Contributor Author

When I go to look, that file is in fact there- though obviously it wasn't at run time. So there's some sort of delay. Is there a workaround? Is exitStatusReadTimeoutMillis it?

@pditommaso
Copy link
Member

Indeed, it meant for that, but it looks you already set to > 8 min, not sure that increasing it will help. You can try .

@pinin4fjords
Copy link
Contributor Author

Okay, thanks. I have set it to stupidly long timescales, will see what happens.

@Crabime
Copy link

Crabime commented Feb 26, 2019

@pditommaso I also happened this issue multiple times recently and use SGE, Job was still running on SGE but NF execute qstat and fetch empty result, then nextflow think this job should have been finished and read its exitcode, but exitcode was not exists, NF will throw a ProcessException and resumeOrDie, if we specify errorStrategy retry, NF will resubmit another same job, here is the snapshot:
qstat
here is error log snippets:

Feb-26 13:08:05.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:08:05.781 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  55366 0.50000 nf-QC_(lib chenqi       r     02/26/2019 09:22:57 first.q@bio03.wutbiolab.com        1        

Feb-26 13:08:05.784 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
  job:  55366: RUNNING

Feb-26 13:08:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:10.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2706)
Feb-26 13:08:10.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:10.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:10.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:15.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2707)
Feb-26 13:08:15.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:15.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:15.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:20.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2708)
Feb-26 13:08:20.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:20.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:20.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:25.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2709)
Feb-26 13:08:25.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:25.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:25.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:30.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2710)
Feb-26 13:08:30.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:30.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:30.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:35.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2711)
Feb-26 13:08:35.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:35.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:35.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:40.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2712)
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:40.725 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:40.726 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:45.722 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2713)
Feb-26 13:08:45.725 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:45.726 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:45.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:50.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2714)
Feb-26 13:08:50.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:50.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:50.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:08:55.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2715)
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:08:55.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:08:55.728 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:00.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2716)
Feb-26 13:09:00.726 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:00.727 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:00.727 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > jobId `55366` active status: true
Feb-26 13:09:05.723 [Task monitor] TRACE n.processor.TaskPollingMonitor - Scheduler queue size: 1 (iteration: 2717)
Feb-26 13:09:05.728 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.729 [Task monitor] TRACE nextflow.executor.GridTaskHandler - JobId `55366` exit file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode - lastModified: null - size: null
Feb-26 13:09:05.731 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] getting queue status > cmd: qstat
Feb-26 13:09:05.785 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > cmd exit: 0

Feb-26 13:09:05.787 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status >
  (empty)
Feb-26 13:09:05.789 [Task monitor] TRACE n.executor.AbstractGridExecutor - [SGE] queue status > map does not contain jobId: `55366`
Feb-26 13:09:05.791 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.792 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: RUNNING; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 5725168
Current queue status:
>   (null)

Content of workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
null
Feb-26 13:09:05.797 [Task monitor] TRACE g.r.m.CustomMetaClassCreationHandle - Registering custom meta-class for: class java.io.FileReader
Feb-26 13:09:05.850 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - clear complete job 55366 result : true
Feb-26 13:09:05.851 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972/.exitcode
Feb-26 13:09:05.852 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 55366; id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: COMPLETED; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972 started: 1551144180132; exited: -; ]
Feb-26 13:09:05.853 [Task monitor] TRACE nextflow.processor.TaskProcessor - finalizing process > QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003) -- id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
Feb-26 13:09:05.880 [Task monitor] TRACE nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: id: 1; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/fc/64e8091dafa56c400b141093f58972
  error [nextflow.exception.ProcessFailedException]: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system
Feb-26 13:09:05.924 [Task monitor] INFO  nextflow.processor.TaskProcessor - [fc/64e809] NOTE: Process `QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
Feb-26 13:09:05.928 [Task monitor] TRACE nextflow.processor.TaskHandler - task stats: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); start: 1551144180132; complete: 1551157745794; realtime: 13565662 [13565662]; 
Feb-26 13:09:05.972 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] Cacheable folder=/mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a -- exists=false; try=2; shouldTryCache=false
Feb-26 13:09:05.973 [pool-2-thread-1] TRACE nextflow.processor.TaskProcessor - [QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003)] actual run folder: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.974 [pool-2-thread-1] TRACE nextflow.processor.TaskDispatcher - Scheduling process: id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); type: SCRIPTLET; exit: -; error: null; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a
Feb-26 13:09:05.976 [pool-2-thread-1] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a/.exitcode
Feb-26 13:09:05.977 [pool-2-thread-1] TRACE n.processor.TaskPollingMonitor - Scheduled task > TaskHandler[jobId: null; id: 2; name: QC (lib-S033_gooalgene-A_MU-2_BHVMV2CCXY_S1_L003); status: NEW; exit: -; error: -; workDir: /mnt/beegfs/chenqi/project/marco_fish/02survey/work/e9/17d9bf496fc81c269f6546e1f32f5a started: -; exited: -; ]

@pditommaso
Copy link
Member

@Crabime If qstat is not returning the job status NF assumes it's terminated, and therefore kill it after a while.

@Crabime
Copy link

Crabime commented Feb 27, 2019

@pditommaso I think there might be some problem, qstat really had value but ProcessBuilder return empty, Maybe it's a bug of ProcessBuilder

@pditommaso
Copy link
Member

I may have spotted a problem related to the ProcessBuilder. Please have a try with the snapshot I've just uploaded using this command below:

NXF_VER=19.03.0-SNAPSHOT nextflow run .. [your options]

@alneberg

This comment has been minimized.

@alneberg

This comment has been minimized.

@pditommaso

This comment has been minimized.

@alneberg

This comment has been minimized.

@pditommaso

This comment has been minimized.

@pditommaso
Copy link
Member

@pinin4fjords @Crabime Does the uploaded patch solves the problem with the empty queue status problem?

@pinin4fjords
Copy link
Contributor Author

Sorry @pditommaso I've been away.

The issue occurred non-deterministically for me, so it might take a while to crop up again anyway, but I'm running with the patch now so I'll let you know.

@pinin4fjords
Copy link
Contributor Author

@pditommaso So far I've not seen this error again.

@pditommaso
Copy link
Member

Excellent.

@cjfields
Copy link

cjfields commented Mar 6, 2019

@pditommaso we were just checking in to see if this has been reported, awesome to see that it appears to be addressed! We'll check this out as well.

@cjfields
Copy link

cjfields commented Mar 6, 2019

@pditommaso should we grab the 19.03.0-SNAPSHOT snapshot from the GitHub master branch? I'm not seeing a release tagged: https://github.com/nextflow-io/nextflow/releases

@pinin4fjords
Copy link
Contributor Author

Not completely sure it's solved, was a bit random for me anyway.

If you just run with NXF_VER as above NF should download the update itself

@pditommaso
Copy link
Member

@cjfields snapshots are not uploaded as GH releases. You can use it as mentioned above.

@pinin4fjords include the log file if it still occurs.

@cjfields
Copy link

cjfields commented Mar 7, 2019

@pditommaso thx, got snapshot mixed up w/ 'edge'

@pditommaso
Copy link
Member

I'll upload march edge release next week including this patch.

@pditommaso pditommaso added this to the v19.04.0 milestone Mar 7, 2019
@pditommaso pditommaso added the bug label Mar 13, 2019
@pditommaso pditommaso changed the title LSF executor- incorrect failure with "Failed to get exit status for process" and "terminated for an unknown reason" Grid executor reports invalid queue status Mar 13, 2019
pditommaso added a commit that referenced this issue Mar 13, 2019
In a nutshell the the grid executor uses a process builder
to fetch the queue status using bjobs or a similar. Since
the output is read by a separate thread, after the process
termination it's required to wait for the output consumer
thread otherwise the output fetched can be incomplete.
@gongyixiao
Copy link

We have seen this error again in v19.07.0 on LSF

Here is the log .nextflow.log and one set of .command.* and .exitcode for an error job.

nextflow.log
command.log
exitcode.txt

@RaqManzano
Copy link

Hi there,

Are there any updates on this? I am working with SLURM (nextflow version 22.04.5.5708) and I am getting the same issue. The job is finished but the exitcode file has not been generated yet for an unknown reason. Nextflow assigns exitstatus '-' to the job and decides is an error(?). Also, if that job is a candidate for 'retry', it will resubmit and I will end up with the same file twice in the tracing system, causing a file collision error downstream.
See an extract of the log here:

Feb-20 20:29:52.779 [Task submitter] INFO  nextflow.Session - [3a/448801] Submitted process > MUTECT2 (TUMOUR_vs_NORMAL)
~> TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: -; ]
~> TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: 2023-02-20T20:48:13Z; ]
~> TaskHandler[jobId: 14658420; id: 446; name: MUTECT2 (TUMOUR_vs_NORMAL); status: RUNNING; exit: -; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348314; exited: 2023-02-20T20:48:13Z; ]
Feb-20 20:50:49.211 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 14658419; id: 443; name: MUTECT2 (TUMOUR_vs_NORMAL); status: COMPLETED; exit: 0; error: -; workDir: /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733 started: 1676925348312; exited: 2023-02-20T20:48:13Z; ]

Here, I understand that it has submitted the first job (jobId: 14658419) and it is waiting for it to finish (status is RUNNING), but then at some point it retries the job (jobId: 14658420) even though the first one is still running. However, by the time this second job finishes the first one is also finished (see last line) and therefore causes the downstream error below. I also added the channel dumps in there where the file seems to be replicated (likely because it was submitted and finished twice).

Downstream error:

Feb-20 21:14:48.823 [Actor Thread 38] DEBUG nextflow.processor.TaskProcessor - Process MERGEMUTECTSTATS > collision check staging file names: [...,TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz.stats:2, TUMOUR_vs_NORMAL.chrY_1-57227415.vcf.gz.stats:2, TUMOUR_vs_NORMAL.chr18_1-80373285.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chrX_1-156040895.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chr22_1-50818468.vcf.gz.stats:1, TUMOUR_vs_NORMAL.chr19_1-58617616.vcf.gz.stats:2, ...]

The channel dump for one of the duplicated files looks like this:

Feb-20 20:50:49.239 [Actor Thread 46] INFO  nextflow.extension.DumpOp - [DUMP: mutect2_vcf_branch] [['id':'TUMOUR_vs_NORMAL', 'normal_id':'NORMAL', 'num_intervals':24, 'patient':'SAMPLE1', 'status':1, 'tumor_id':'TUMOR'], /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733/TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz]
Feb-20 20:50:49.263 [Actor Thread 29] INFO  nextflow.extension.DumpOp - [DUMP: mutect2_vcf_branch] [['id':'TUMOUR_vs_NORMAL', 'normal_id':'NORMAL', 'num_intervals':24, 'patient':'SAMPLE1', 'status':1, 'tumor_id':'TUMOR'], /rds/project/rds-nRVsLqKmyyw/work/SAMPLE1/work/3a/4488012214c7916e3d7305119e3733/TUMOUR_vs_NORMAL.chr4_1-190214555.vcf.gz]

For now, the only workaround I have found is to increase the exitReadTimeout variable and removing the 'retry'. But this is not optimal when submitting lots of samples, i.e. it limits nextflow scalability feature because the retry is quite handy when weird errors happen in the cluster. Any help would be appreciated. Thanks.

@santy-ttam
Copy link

Hi, @pditommaso We are running into this same issue on our side. Was wondering if the snapshot was merged into the release branch ever. Thank you.

@pditommaso
Copy link
Member

This was solved by 5229e93. If you are still experiencing a problem, please fill out an issue, including the nextflow file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants