Skip to content

nextflow hanging with slurm executor #361

@msmootsgi

Description

@msmootsgi

Nextflow seems to be hanging after the pipeline fails when running on a slurm cluster. Here is the full .nextflow.log. This particular pipeline has a workflow.onComplete block that also isn't being called.

May-31 19:17:58.816 [main] DEBUG nextflow.cli.Launcher - $> /usr/local/bin/nextflow run http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git -resume -hub gitlab -r aws_changes -latest -params-file /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/params.yaml -process.executor slurm
May-31 19:17:58.964 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.24.4
May-31 19:17:59.521 [main] DEBUG nextflow.scm.AssetManager - Repository URL: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git; Project: SGI-Pipelines/GeneFlagging; Hub provider: gitlab
May-31 19:17:59.529 [main] INFO  nextflow.cli.CmdRun - Pulling SGI-Pipelines/GeneFlagging ...
May-31 19:17:59.533 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging
May-31 19:17:59.884 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging/repository/files?file_path=nextflow.config&ref=master
May-31 19:18:00.018 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging/repository/files?file_path=main.nf&ref=master
May-31 19:18:00.139 [main] DEBUG nextflow.scm.AssetManager - Pulling SGI-Pipelines/GeneFlagging -- Using remote clone url: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.244 [main] INFO  nextflow.cli.CmdRun -  downloaded from http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.675 [main] DEBUG nextflow.scm.AssetManager - Git config: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/.git/config; branch: master; remote: origin; url: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.675 [main] INFO  nextflow.cli.CmdRun - Launching `SGI-Pipelines/GeneFlagging` [big_jennings] - revision: e8dc1db4ec [aws_changes]
May-31 19:18:01.686 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/nextflow.config
May-31 19:18:01.690 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/nextflow.config
May-31 19:18:01.825 [main] DEBUG nextflow.config.ConfigBuilder - Setting config profile: 'standard'
May-31 19:18:01.874 [main] WARN  nextflow.config.ConfigBuilder - It seems you never run this project before -- Option `-resume` is ignored
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Session uuid: 57c9f9a1-cc45-41cd-be69-4bc9818b2b3e
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Run name: big_jennings
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Executor pool size: 4
May-31 19:18:01.951 [main] DEBUG nextflow.cli.CmdRun -
  Version: 0.24.4 build 4341
  Modified: 22-05-2017 11:18 UTC
  System: Linux 4.4.0-64-generic
  Runtime: Groovy 2.4.10 on OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11
  Encoding: UTF-8 (UTF-8)
  Process: 4001@ip-172-20-22-81 [172.20.22.81]
  CPUs: 4 - Mem: 15.7 GB (14.3 GB) - Swap: 0 (0)
May-31 19:18:02.043 [main] DEBUG nextflow.Session - Work-dir: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work [nfs]
May-31 19:18:02.044 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/bin
May-31 19:18:02.428 [main] DEBUG nextflow.Session - Session start invoked
May-31 19:18:02.432 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
May-31 19:18:02.433 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/trace.tsv
May-31 19:18:02.446 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
May-31 19:18:10.194 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-31 19:18:10.197 [main] WARN  nextflow.script.ScriptBinding - Access to undefined parameter `clean_up` -- Initialise it to a default value eg. `params.clean_up = some_value`
May-31 19:18:10.234 [main] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
May-31 19:18:10.245 [main] DEBUG nextflow.file.FileHelper - AWS S3 config details: {region=us-west-2}
May-31 19:18:10.989 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:10.989 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.000 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.002 [main] INFO  nextflow.executor.Executor - [warm up] executor > slurm
May-31 19:18:11.009 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'slurm' > capacity: 100; pollInterval: 5s; dumpInterval: 5m
May-31 19:18:11.012 [main] DEBUG nextflow.processor.TaskDispatcher - Starting monitor: TaskPollingMonitor
May-31 19:18:11.013 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: slurm)
May-31 19:18:11.016 [main] DEBUG nextflow.executor.Executor - Invoke register for executor: slurm
May-31 19:18:11.017 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.056 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_blast_nr_db)
May-31 19:18:11.069 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_blast_nr_db -- maxForks: 4
May-31 19:18:11.091 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.091 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.091 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.092 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.092 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_pfam_db)
May-31 19:18:11.095 [Actor Thread 2] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> Poison pill arrived
May-31 19:18:11.102 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_blast_nr_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.110 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_pfam_db -- maxForks: 4
May-31 19:18:11.128 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.128 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.128 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.129 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.129 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_superfam_db)
May-31 19:18:11.136 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_superfam_db -- maxForks: 4
May-31 19:18:11.138 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> Poison pill arrived
May-31 19:18:11.138 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_pfam_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.157 [Actor Thread 6] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> Poison pill arrived
May-31 19:18:11.162 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.162 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.162 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_superfam_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.163 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.163 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.164 [main] DEBUG nextflow.Session - >>> barrier register (process: order_by_gc)
May-31 19:18:11.172 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > order_by_gc -- maxForks: 4
May-31 19:18:11.197 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.198 [Actor Thread 8] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> Poison pill arrived
May-31 19:18:11.203 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <order_by_gc> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.198 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.205 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.205 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.206 [main] DEBUG nextflow.Session - >>> barrier register (process: contig_stats)
May-31 19:18:11.224 [Actor Thread 9] DEBUG nextflow.processor.TaskProcessor - Copying to process workdir foreign file: s3:///sgi-pipeline-dev/geneflagging_data/contigs.fa
May-31 19:18:11.240 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > contig_stats -- maxForks: 4
May-31 19:18:11.269 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - <contig_stats> Poison pill arrived
May-31 19:18:11.271 [Actor Thread 12] DEBUG nextflow.processor.TaskProcessor - Copying to process workdir foreign file: s3:///sgi-pipeline-dev/geneflagging_data/contigs.fa
May-31 19:18:11.278 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <contig_stats> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.288 [Actor Thread 3] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_blast_nr_db
May-31 19:18:11.293 [Actor Thread 7] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_superfam_db
May-31 19:18:11.293 [Actor Thread 5] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_pfam_db
May-31 19:18:11.295 [Actor Thread 1] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> Sending poison pills and terminating process
May-31 19:18:11.298 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> Sending poison pills and terminating process
May-31 19:18:11.299 [Actor Thread 6] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> After stop
May-31 19:18:11.299 [Actor Thread 2] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> After stop
May-31 19:18:11.299 [Actor Thread 14] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> Sending poison pills and terminating process
May-31 19:18:11.301 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> After stop
May-31 19:18:11.303 [Actor Thread 1] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_pfam_db)
May-31 19:18:11.304 [Actor Thread 14] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_blast_nr_db)
May-31 19:18:11.308 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_superfam_db)
May-31 19:18:11.892 [Actor Thread 9] DEBUG nextflow.executor.GridTaskHandler - Launching process > order_by_gc (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c2/2abdaf28c7057602e6edbf6dd85444
May-31 19:18:12.072 [Actor Thread 12] DEBUG nextflow.executor.GridTaskHandler - Launching process > contig_stats (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c3/9e29e54e126749a4cda9a366422f24
May-31 19:18:12.073 [Actor Thread 9] INFO  nextflow.Session - [c2/2abdaf] Submitted process > order_by_gc (1)
May-31 19:18:12.090 [Actor Thread 8] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> After stop
May-31 19:18:12.181 [Actor Thread 12] INFO  nextflow.Session - [c3/9e29e5] Submitted process > contig_stats (1)
May-31 19:18:12.185 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - <contig_stats> After stop
May-31 19:23:16.031 [Thread-3] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 2 -- first: TaskHandler[jobId: 2; id: 4; name: order_by_gc (1); status: RUNNING; exit: -; workDir: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c2/2abdaf28c7057602e6edbf6dd85444 started: 1496258571038; exited: -; ]
May-31 19:23:26.058 [Thread-3] WARN  nextflow.processor.TaskProcessor - Process `order_by_gc (1)` terminated with an error exit status (125) -- Execution is retried (1)
May-31 19:23:26.077 [pool-2-thread-1] DEBUG nextflow.executor.GridTaskHandler - Launching process > order_by_gc (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/88/a767bf3eb8c490cc267e6e25ec21b2
May-31 19:23:26.201 [pool-2-thread-1] INFO  nextflow.Session - [88/a767bf] Re-submitted process > order_by_gc (1)
May-31 19:23:26.203 [Thread-3] WARN  nextflow.processor.TaskProcessor - Process `contig_stats (1)` terminated with an error exit status (125) -- Execution is retried (1)
May-31 19:23:26.220 [pool-2-thread-2] DEBUG nextflow.executor.GridTaskHandler - Launching process > contig_stats (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/23/d2a1c73986ef90e0e685cf32def4df
May-31 19:23:26.329 [pool-2-thread-2] INFO  nextflow.Session - [23/d2a1c7] Re-submitted process > contig_stats (1)
May-31 19:24:01.085 [Thread-3] ERROR nextflow.processor.TaskProcessor - Error executing process > 'order_by_gc (1)'

Caused by:
  Process `order_by_gc (1)` terminated with an error exit status (125)

Command executed [/tools/nextflow/assets/SGI-Pipelines/GeneFlagging/templates/orderFastaByGc.py]:

  #!/usr/bin/env python

  from gene_annotation.fasta.sortFastaByGc import sortFastaByGc


  def main():

      inputFile = 'contigs.fa'
      outputFile = 'contigsByGc.fa'

      sortFastaByGc(inputFile, outputFile)


  if __name__ == '__main__':
      main()

Command exit status:
  125

Command output:
  (empty)

Command error:
  Unable to find image 'dockreg-dev01.awsv.l.synthgeno.global/compbio/gene_annotation:0.1.0' locally
  docker: Error response from daemon: Get https://dockreg-dev01.awsv.l.synthgeno.global/v1/_ping: dial tcp 172.17.0.30:443: getsockopt: no route to host.
  See 'docker run --help'.

Work dir:
  /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/88/a767bf3eb8c490cc267e6e25ec21b2

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
May-31 19:24:01.088 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> Sending poison pills and terminating process
May-31 19:24:01.088 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: order_by_gc)
May-31 19:24:01.097 [Thread-3] DEBUG nextflow.Session - Session aborted -- Cause: Process `order_by_gc (1)` terminated with an error exit status (125)
May-31 19:24:01.119 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <contig_stats> Sending poison pills and terminating process
May-31 19:24:01.120 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: contig_stats)
May-31 19:24:01.122 [Thread-3] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm)
~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions