Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using nextflow with moab and msub #1224

Closed
MJKampmann opened this issue Jul 12, 2019 · 34 comments
Closed

Using nextflow with moab and msub #1224

MJKampmann opened this issue Jul 12, 2019 · 34 comments
Milestone

Comments

@MJKampmann
Copy link

I would like to use nextflow on a cluster which uses moab cluster suite as a workload manager to run a RNA-seq analysis pipeline.
In moab jobs are submitted using the msub command, the resource manager in moab is torque.
Is there a way to implement moab as an executor in nextflow?
Thank you

@pditommaso
Copy link
Member

pditommaso commented Jul 12, 2019

It would not be too hard to add it. Could you provide a command-line example to submit a job, delete it and checking the queue status?

@MJKampmann
Copy link
Author

Sure,
for example for a job named test running the script job.sh
msub -q single -N test -l nodes=1:ppn=1,walltime=3:00:00,pmem=5000mb job.sh
Resources are requested with -l, the job requires 1 core, 3h walltime and 5000mb memory. -q specifies the queue.
Jobs are deleted by
mjobctl -c <job-id>
And the queue status can be checked with showq

@pditommaso
Copy link
Member

To me looks very similar to PBS. Have you check if there's a qsub command in your cluster?

@MJKampmann
Copy link
Author

Yes, it is very similar to PBS. It is not possible to submit jobs with qsub on the cluster.

@pditommaso
Copy link
Member

Then copy and paste here the exact output of the following commands:

msub help

msub -h

msub submission

Submit a job and include the command and its exact output

showq help

showq -h

showq example

Copy and paste here the exact output of the showq reporting at least a couple of jobs

showq job status codes

Can you find all possible job status codes that can be reported by showq and report here?

mjobctl help

mjobctl -h

@MJKampmann
Copy link
Author

Showq Active Jobs

showq, two active jobs:

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

881196             hd_ow424    Running    16     2:59:58  Fri Jul 12 11:41:49
881197             hd_ow424    Running    16     2:59:58  Fri Jul 12 11:41:49

2 active jobs          32 of 11920 processors in use by local jobs (0.27%)
                        491 of 705 nodes active      (69.65%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 blocked jobs

Total jobs:  2

Showq help

showq -h :

Usage: showq [FLAGS]
  --about
  --help
  --host=<SERVERHOSTNAME>
  --loglevel=<LOGLEVEL>
  --port=<SERVERPORT>
  --timeout=<SECONDS>
  --version
  --xml

  --blocking

  -b // BLOCKED JOBS
  -c // COMPLETED QUEUE
  -g // DISPLAY SRC. PEER GRID NAME
  -i // IDLE QUEUE
  -l // LOCAL/REMOTE VIEW
  -n // DISPLAY USER/ALTERNATE JOB NAMES
  -N // DISPLAY NODE/TASK USAGE BY JOB
  -o <ORDER> // DISPLAY ACTIVE JOBS IN SPECIFIED SORT ORDER:
             //  REMAINING REVERSEREMAINING JOB USER STATE STARTTIME
  -p <PARTITION> // PARTITION
  -r // ACTIVE QUEUE
  -R <RSVID> // show jobs in reservation
  -s // DISPLAY WORKLOAD SUMMARY
  -S // SYSTEM JOBS
  -v // VERBOSE
  -w {user,group,acct,class,qos,jobgroup}=<VAL> // where constraint

mjobctl help

mjobctl --help:

Usage: mjobctl [FLAGS]
  --about
  --help
  --host=<SERVERHOSTNAME>
  --loglevel=<LOGLEVEL>
  --port=<SERVERPORT>
  --timeout=<SECONDS>
  --version
  --xml

  -c <JOBID>[,<JOBID> ...] // CANCEL
  -e <JOBID> // RERUN (TORQUE RM only)
  -F <JOBID>[,<JOBID> ...] // FORCE CANCEL
  -C <JOBID> // CHECKPOINT
  { -h | -u } [<TYPE>] <JOBID> // HOLD
     <TYPE>=user|system|batch|defer|ALL
  -m <ATTR>{=|+=}<VAL> <JOBID> // MODIFY
  -N [signal=]<SIGID> <JOBID> // NOTIFY
  -p [+=|-=] <VAL> <JOBID> // MODIFY SYSTEM PRIORITY
  -q {diag|hostlist|starttime|wiki|json} { ALL | <JOBID> } [ --flags=COMPLETED ] // QUERY
  -r <JOBID> // RESUME
  -R <JOBID> // REQUEUE
  -s <JOBID> // SUSPEND
  -w <ATTR>=<VAL> // WHERE
  -x <JOBID> // EXECUTE

  <ATTR>={account|advres|allocnodelist|awduration|class|eeduration|env|flags|gres|group|hostlist|jobid|jobname|maxmem|messages|minstarttime|nodecount|qos|releasetime|reqreservation|rmxstring|state|sysprio|tpn|trig|user|userprio|var|wclimit}

msub help

msub --help:

Usage: msub [FLAGS] [<CMDFILE> [<ARG>] [<ARG>]...]
  --about
  --help
  --host=<SERVERHOSTNAME>
  --loglevel=<LOGLEVEL>
  --port=<SERVERPORT>
  --timeout=<SECONDS>
  --version
  --xml

 DATA STAGING
  --stagein=<STAGEIN-SPEC>
  --stageinsize=<STAGEINSIZE-SPEC>
  --stageinfile=<FILENAME>
  --stageout=<STAGEOUT-SPEC>
  --stageoutsize=<STAGEOUTSIZE-SPEC>
  --stageoutfile=<FILENAME>

 Workflow Job IDs
  --workflowjobids

  [-a date_time] [-A account_string] [-b retry_count] [-c interval] [-C directive_prefix] [-d initdir]
  [-e errorpath] [-F "<args>"] [-h] [-I] [-j join] [-k keep] [-l resource_list] [-L task] [-m mail_options]
  [-M user_list] [-n] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-t jobarrays] [-S path_list]
  [-u user_list] [-v variable_list] [-V] [-w path] [-W additional_attributes] [-z] [script]

  -L tasks=#[:lprocs=#|all][:{usecores|usethreads|allowthreads}]
            [:place={node|socket|numanode|core|thread}[=#]][:memory=#]
            [:swap=#][:maxtpn=#][:gpus=#[:<mode>]][:mics=#]
            [:gres=<gres>][:feature=<feature>][[:{cpt|cgroup_per_task}]|[:{cph|cgroup_per_host}]]

@pditommaso
Copy link
Member

Good, I would need also the job submit output and the complete list of job possible statuses.

@MJKampmann
Copy link
Author

Job submission example:
msub -N testjob -l 'walltime=00:10:00,nodes=1:ppn=1,pmem=500mb' -o 'output.txt' test.sh
Output:
output.txt

@pditommaso
Copy link
Member

It looks there's a xml output option. Could you please include the msub and showq output specifying the --xml option?

@MJKampmann
Copy link
Author

MJKampmann commented Jul 12, 2019

So in showq, jobs are either active, eligible or blocked.
Active jobs have as status either Running or Starting.

Blocked Jobs can be in the following states:

State | Description
-- | --
Idle | Job violates a fairness policy. Use diagnose -q for more      information.
UserHold | A user hold is in place.
SystemHold | An administrative or system hold is in place.
BatchHold | A scheduler batch hold is in place (used when the job cannot be run      because the requested resources are not available in the system or because     the resource manager has repeatedly failed in attempts to start the job).
Deferred | A scheduler defer hold is in place (a temporary hold used when a job      has been unable to start after a specified number of attempts. This hold     is automatically removed after a short period of time).
NotQueued | Job is in the resource manager state NQ (indicating the job's controlling     scheduling daemon in unavailable).

@MJKampmann
Copy link
Author

MJKampmann commented Jul 12, 2019

showq --xml:

<Data><Object>queue</Object><cluster LocalActiveNodes="479" LocalAllocProcs="16" LocalConfigNodes="747" LocalIdleNodes="229" LocalIdleProcs="3910" LocalUpNodes="708" LocalUpProcs="11968" RemoteActiveNodes="0" RemoteAllocProcs="0" RemoteConfigNodes="0" RemoteIdleNodes="0" RemoteIdleProcs="0" RemoteUpNodes="0" RemoteUpProcs="0" time="1562925910"/><queue count="1" option="active"><job AWDuration="1392" Account="bw18k008" Class="single" DRMJID="881196.admin2" GJID="881196" Group="hd_hd" JobID="881196" JobName="JOBNAME.deeptools_bamCompare.cell=HMG007,treat=repl1,chip=PU1,scale=SES,ratio=subtract" MasterHost="m13s0703" PAL="torque" ReqAWDuration="10800" ReqNodes="1" ReqProcs="16" RsvStartTime="1562924509" RunPriority="99" StartPriority="99" StartTime="1562924509" StatPSDed="22271.200000" StatPSUtl="1829.560100" State="Running" SubmissionTime="1562924505" SuspendDuration="0" User="hd_ow424"/></queue><queue count="0" option="eligible"/><queue count="0" option="blocked"/></Data>

Update:

<?xml version="1.0" encoding="UTF-8"?>
<Data>
   <Object>queue</Object>
   <cluster LocalActiveNodes="479" LocalAllocProcs="16" LocalConfigNodes="747" LocalIdleNodes="229" LocalIdleProcs="3910" LocalUpNodes="708" LocalUpProcs="11968" RemoteActiveNodes="0" RemoteAllocProcs="0" RemoteConfigNodes="0" RemoteIdleNodes="0" RemoteIdleProcs="0" RemoteUpNodes="0" RemoteUpProcs="0" time="1562925910" />
   <queue count="1" option="active">
      <job AWDuration="1392" Account="bw18k008" Class="single" DRMJID="881196.admin2" GJID="881196" Group="hd_hd" JobID="881196" JobName="JOBNAME.deeptools_bamCompare.cell=HMG007,treat=repl1,chip=PU1,scale=SES,ratio=subtract" MasterHost="m13s0703" PAL="torque" ReqAWDuration="10800" ReqNodes="1" ReqProcs="16" RsvStartTime="1562924509" RunPriority="99" StartPriority="99" StartTime="1562924509" StatPSDed="22271.200000" StatPSUtl="1829.560100" State="Running" SubmissionTime="1562924505" SuspendDuration="0" User="hd_ow424" />
   </queue>
   <queue count="0" option="eligible" />
   <queue count="0" option="blocked" />
</Data>

@MJKampmann
Copy link
Author

And for
msub -N testjob -l 'walltime=00:10:00,nodes=1:ppn=1,pmem=500mb' -o 'output.txt' --xml test.sh
I get
<Data><job JobID="881218"/></Data>

@pditommaso
Copy link
Member

pditommaso commented Jul 12, 2019

Excellent. Regarding job status, there's no status for job completion successfully or failed?

@MJKampmann
Copy link
Author

No, the job status is just given as completed, errors are listed in a separate error file.

@pditommaso
Copy link
Member

I see. Can you also include an example of the error file?

@MJKampmann
Copy link
Author

If no erros are raised the error file is empty, otherwise it contains errors raised by the program.
You can set the error file path with msub -N testjob -l 'walltime=00:10:00,nodes=1:ppn=1,pmem=500mb' -o 'output.txt' -e 'error.txt' test.sh
Here is an example where test.sh ends with open quotation marks, so a warning is raised
error.txt

@MJKampmann
Copy link
Author

Here is a different example. TrimGalore is executed as part of a snakemake pipeline to process a reads in a fastq file:
run_trimGalore.cell=HMG017,treat=repl1,chip=PU1,reps=rep1.txt

@pditommaso
Copy link
Member

I've pushed a possible implementation. There little chance that it will work at the first try, but if you can manage to test it, it should not be so too difficult.

To compile and test it, do the following:

  1. clone this projet and checkout maob-executor with this command:

    git clone -b maob-executor https://github.com/nextflow-io/nextflow.git
    
  2. compile, assemble the executable

    make compile pack
    cp build/releases/nextflow-19.08.0-SNAPSHOT-all ./nextflow
    chmod +x ./nextflow
    ./nextflow info
    
  3. use the above binary in place of the stock nextflow luncher adding this setting in your nextflow.config file.

     process.executor='moab'
    

@MJKampmann
Copy link
Author

Hello, I tried the implementation.
I tried to launch the pipeline with the command
nextflow run ~/my-pipelines/nf-core/rnaseq-master
but get the following errror message:

N E X T F L O W  ~  version 19.08.0-SNAPSHOT
Launching `~my-pipelines/nf-core/rnaseq-master/main.nf` [agitated_poitras] - revision: 653dedd4d2
Unable to parse config file: '~/my-pipelines/nf-core/rnaseq-master/nextflow.config'

  Compile failed for sources FixedSetSources[name='/groovy/script/Script93D6D5AA4C3D38DC945138B35C17E9B0']. Cause: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
  /groovy/script/Script93D6D5AA4C3D38DC945138B35C17E9B0: 13: unexpected char: '#' @ line 13, column 3.
       # executer = 'pbs'
       ^

  1 error

@pditommaso
Copy link
Member

# is not a valid character for the config file .

@MJKampmann
Copy link
Author

MJKampmann commented Jul 14, 2019

Right, fixed that.
Now I just get the following output:

N E X T F L O W  ~  version 19.08.0-SNAPSHOT
Launching `~/my-pipelines/nf-core/rnaseq-master/main.nf` [exotic_shockley] - revision: 653dedd4d2
originalHostName

The runs' logfile looks as follows:
nextflow.log.txt

@pditommaso
Copy link
Member

pditommaso commented Jul 16, 2019

Oh, never seen such error before:

ERROR nextflow.cli.Launcher - @unknown
java.lang.NoSuchFieldError: originalHostName
	at java.net.InetAddress.init(Native Method)
	at java.net.InetAddress.<clinit>(InetAddress.java:277)
	at java.net.PlainSocketImpl.initProto(Native Method)
	at java.net.PlainSocketImpl.<clinit>(PlainSocketImpl.java:45)
	at java.net.Socket.setImpl(Socket.java:503)
	at java.net.Socket.<init>(Socket.java:84)
	at javax.net.ssl.SSLSocket.<init>(SSLSocket.java:145)
	at sun.security.ssl.BaseSSLSocketImpl.<init>(BaseSSLSocketImpl.java:61)
	at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:524)
	at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:72)
	at sun.net.www.protocol.https.HttpsClient.createSocket(HttpsClient.java:409)

Please, include the script and config files you are using.

update: which version of Java are you using?

@MJKampmann
Copy link
Author

Hello,
yes, i think the error occurred because I had installed a different version of java then was running on the cluster by default. Now I am using:
java -version

openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

and the error doesn't occurr and I can launch nextflow.
However, there appears to be a different issue.
When I launch the pipeline, jobs get submitted to the cluster and also re, but are completed after a couple of seconds and no output files are created. I attached you the .nextflow.log file:
nextflow_log_new.txt

@pditommaso
Copy link
Member

You need to investigate why it's failing. as the error message is suggesting try

  #
  #  Detailed information about the job (available <24h after job exit):
  #    checkjob -v 884646
  #    checkjob -v -v 884646 

@MJKampmann
Copy link
Author

The job terminated because of the following error:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

However, I can activate the conda environment with
conda activate <environment-name>
Also if I use
conda init --all
no action is taken.

@pditommaso
Copy link
Member

Try the following:

  • change to the failing task work dir
  • edit the script .command.run
  • add the directive #MSUB -S /bin/bash
  • try to execute again the job using the command: msub .command.run

Does it solve the problem?

@MJKampmann
Copy link
Author

MJKampmann commented Jul 16, 2019

No, does not solve the problem. I have atttached a jobfile from a snakemake pipeline that worked as an example.
jobfile-796879.txt
If I use #!/bin/sh in the script the error
/var/spool/torque/mom_priv/jobs/884845.admin2.SC: line 33: syntax error near unexpected token <'`
is returned instead.

@MJKampmann
Copy link
Author

Using the script:

#!/bin/bash
source activate <environment>

the environments are activated
while

#!/bin/bash
conda activate <environment>

fails with the error message as seen above.
Also, if I modify .command.run accordingly, the job is executed.

@pditommaso
Copy link
Member

Not sure to understand, can you include both . command.run scripts (original not working and the one you modified and executed successfully)?

@MJKampmann
Copy link
Author

Sure.
This is the original script
command.run.txt
The conda environment is activated in line 274, with conda activate this is where the execution always failed.
In the modified script I changed that to source activate , which is a command used to activate conda environments in older versions of conda.
command.run_modified.txt
Using the modified script the job is executed.

@pditommaso
Copy link
Member

pditommaso commented Jul 18, 2019

I see. source activate is a legacy Conda activation style and not supported any more by NF 2bdc925.

@MJKampmann
Copy link
Author

I now executed the pipeline on the cluster and everything else appears to be working, job submission etc works fine.

@pditommaso
Copy link
Member

nice to read that

@pditommaso pditommaso added this to the v19.07.0 milestone Jul 27, 2019
@pditommaso
Copy link
Member

pditommaso commented Jul 27, 2019

I've included this feature in the latest stable 19.07.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants