Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax enhancement aka DLS-2 #984

Closed
pditommaso opened this issue Dec 27, 2018 · 114 comments

Comments

Projects
None yet
@pditommaso
Copy link
Member

commented Dec 27, 2018

This is a request for comments for the implementation of modules feature for Nextflow.

This feature allows the definition of NF processes in the main script or a separate library file, that can be invoked, one or multiple times, as any other routine passing the requested input channels as arguments.

Process definition

The syntax for the definition of a process is nearly identical to the usual one, it only requires the use of processDef instead of process and the omission of the from/into declarations. For example:

processDef index {
    tag "$transcriptome_file.simpleName"

    input:
    file transcriptome 

    output:
    file 'index' 

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

The semantic and supported features remain identical to current process. See a complete example here.

Process invocation

Once a process is defined it can be invoked like any other function in the pipeline script. For example:

transcriptome = file(params.transcriptome)
index(transcriptome)

Since the index defines an output channel its return value can be assigned to a channel variable that can be used as usual eg:

transcriptome = file(params.transcriptome)
index_ch = index(transcriptome)
index_ch.println()

If the process were producing two (or more) output channels the multiple assignment syntax can be used to get a reference to the output channels.

Process composition

The result of a process invocation can be passed to another process like any other function, eg:

processDef foo {
  input: 
    val alpha
  output: 
    val delta
    val gamma
  script:
    delta = alpha
    gamma = 'world'
    "some_command_here"
}

processDef bar {
  input:
    val xx
    val yy 
  output:
    stdout()
  script:
    "another_command_here"        
}

bar(foo('Hello'))

Process chaining

Processes can also be invoked as custom operators. For example a process foo taking one input channel can be invoked as:

ch_input1.foo()

when taking two channels as:

ch_input1.foo(ch_input2)

This allows the chaining of built-in operators and processes together eg:

Channel
    .fromFilePairs( params.reads, checkIfExists: true )
    .into { read_pairs_ch; read_pairs2_ch }

index(transcriptome_file)
    .quant(read_pairs_ch)
    .mix(fastqc(read_pairs2_ch))
    .collect()
    .multiqc(multiqc_file)

See the complete script here.

Library file

A library is just a NF script containing one or more processDef declarations. Then the library can be imported using the importLibrary statement, eg:

importLibrary 'path/to/script.nf'

Relative paths are resolved against the project baseDir variable.

Test it

You can try to the current implementation using the version 19.0.0.modules-draft2-SNAPSHOT eg.

NXF_VER=19.0.0.modules-draft2-SNAPSHOT nextflow run rnaseq-nf -r modules

Open points

  1. When a process is defined in a library file, should it be possible to access to the params values? Currently it's possible, but I think this is not a good idea because makes the library depending on the script params making it very fragile.

  2. How to pass parameters to a process defined in library files eg. For example memory and cpus settings? It could be done using config file as usual, still I expect there could be the need to parametrise the process definition and specify the parameters at invocation time.

  3. Should a namespace be used when defining the processes in library? What if two or more processes have the same name in different library files?

  4. One or many processes per library file? Currently it can be defined any number of processes, I'm starting to think that it would be better to allow the definition only of one process per file. This would simplify the reuse across different pipelines, the import in tools such as dockstore and it would make the dependencies of the pipeline more intelligible.

  5. Remote library file? Not sure it's a good idea to being able to import remote hosted files e.g. http://somewhere/script.nf. Remote paths tend to change over time.

  6. Should a versioning number be associated with the process definition? how to use or enforce it?

  7. How test process components? ideally it should be possible to include the required contained in the process definition and unit test each process independently.

  8. How chain a process retuning multiple channels?

@LukeGoodsell

This comment has been minimized.

Copy link

commented Dec 27, 2018

Fantastic stuff, Paolo! I've tried it out and played with having set-based inputs and outputs and it works nicely so far. I also note that this will make unit testing individual process far easier!

My opinions on the points you raise:

  1. Imported process should be entirely isolated from other code -- i.e. no access to mutable globals like params (is workflow mutable?) -- to prevent long-range, unintended effects. However, it'd be useful to use the params global within the imported processes. Perhaps at process invocation the params variable can be set. E.g.:

     index(transcriptome_file)
         .quant(read_pairs_ch)
         .mix(fastqc(read_pairs2_ch, params: [outdir: 'my_out_dir']))
         .collect()
         .multiqc(multiqc_file)
    

    Personally, I'd always want the params object to be null unless otherwise specified, and to use params: params if I need to pass the global parameters, but perhaps a config value could specify whether it should take the global params value or null by default?

  2. I would favour the config file options being inherited from the importing workflow, and other variables set at process invocation as described for params above.

  3. Absolutely, we need different namespaces - I can imagine there being multiple processes from different packages sharing the same name. Importing each individual process would be onerous (see answer to q4 below), so namespacing will be essential. Perhaps we can declare something analogous to a package at the head of each library file, and then call package.namespace?

  4. I think it would be very burdensome to have to import each individual process separately we have many, many process and specifying them each would be tiresome and prone to error. Much better would be to have namespacing and then have users specify the namespace and process name - process names could much more easily be unique within a single namespace.

  5. I would never use remote file loading, but it is very convenient for one-off scripts. The more stable solution would be to have a package repo, or to be able to import an entire git repo's nextflow scripts. E.g.:

     importPackageFromGithub 'nf-core/nextflow-common'
    
  6. I would version code at the package level rather than script level. As with my above answers, this reduces the amount of repeated code. Therefore, within a single project/repo, the user wouldn't specify version numbers for importing individual scripts. I also wouldn't apply version numbers within scripts (again, to reduce duplication) but only at the package level.

  7. Unit testing might be out-of-scope here. However, the approach you've implemented so far means that it is easy to call individual processes with arbitrary inputs and act on outputs in any way desired. I would therefore hope to be able to write JUnit (or similar) tests for individual processes (or sets of processes) and be able to run them multiple times, with different parameters and configuration settings.

  8. I would favour having an additional parameter to the process call specifying the destination of each output channel. The first null value indicates the channel should be used in the current chain. Unhandled channels should raise an exception. E.g.:

     myProcess(inputChannelA, inputChannelB, outputs: [outputChannel1, null, OutputChannel3])
         .subscribe { println "outputChannel2: ${it}" }
    
@fstrozzi

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

Hello! Tried this new feature and looks amazing, thank you !

Coming to your points:

  1. I think params values should not be accessible but on the other side I'd second the idea of defining those needed params values at import time and for the current session. Without it, libraries re-usability will be hampered imo.

  2. I think the ideal would be to have something like

 index(transcriptome_file)
     .quant(read_pairs_ch, task: [cpus: 4, memory: '8 GB'])

where the task specific parameters can be defined at execution time, similarly to what could be done with params.

  1. Yes, absolutely.

  2. One process per library will definitely lead to a jungle of files to be imported/managed either locally or from a remote repository. I see the point of re-usability but it will make much more sense for the end-users to have a library which is scope specific (i.e. QC, or Salmon or even Chip-Seq or Metagenomics) and then import and combine single processes at run-time using namespaces.

  3. That could be interesting, but I will only allow few "trusted" repositories to pull from, where code is checked and verified. It could be on GitHub or under nf-core and Nextflow URLs.

  4. Only on the library itself, and versioning should be linked to a repository in my opinion. It should be something like Conda for instance, so no version specified means take the most recent version. If thinking about a Git repository, then libraries versions could be the tags (I find commit hashes cumbersome to use, but maybe it's just me).

  5. I would not enforce unit testing here but hopefully, as already stated, this new feature will provide a much simpler common ground to implement testing for both libraries and pipelines using one of the many testing libraries available in Java or Groovy.

  6. Unsure here, from one side I think Luke's idea is interesting as flagging one specific output channel to be passed to the next process is very useful. From the other side, I think processes having multiple output channels can be also branching points in the DAG and so you need to deal explicitly with the remaining output channels too, and this will break the "chain" of processes anyway.

@winni2k

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2019

Great stuff indeed!

In regards to point 3: I also think that namespacing will be invaluable. I really like python's semantics in this regard (import fastqc from qctools and import qctools). However, using the point as suggested earlier (for example qctools.fastqc) would conflict with chaining. Perhaps the double colon semantics could work in that case instead? (qctools::fastqc)

@aunderwo

This comment has been minimized.

Copy link

commented Jan 9, 2019

Conversion from a monolithic script to a slim main.nf with imported processes is perfect!! Barriers were minimal.

I would second not having access to params without passing them explicitly, but I would need some way of accessing them since many of my processes have a conditional that executes a different variant of a script depending on a param.

If it were possible to use process rather than processDef it would be cleaner but I can live with that difference. Perhaps the keyword moduleProcess would be more explicit.

@minillinim

This comment has been minimized.

Copy link

commented Jan 15, 2019

First, this looks awesome. I'm working with a few people to build some pretty complex NF stuff and this type of thing should make our lives much much easier. 🎉

As for RFC:

  1. I don't think the modules should be given any access the params object. It just encourages bad habits. If the use really wants globals then they could just define them via the config file.

  2. Would it be possible to expose an api to the object / class (or create one) that actual config files get boiled down to. Then each process could work out it's config in the usual way or we could do something like this:

# define the process (assuming that param ordering under 'input:' matches the ordering used when calling
processDef assemble_minia {
    input:
    file $reads from reads
    val $prefix from prefix

    output:
    file "${prefix}asm.gfa" into gfa

    script:
    """
    minia -kmer-size 31 -in $reads -out ${prefix}asm
    """
}

And then when we use it:

# load a config file -> All values in this file override any prev set values
reads = file(params.reads)
assemble_minia.load_config("/path/to/file/or/similar")
assem_ch = assemble_minia(reads, "with_custom_config")

Or just update config values individually:

# change the container only - specifically override one value
# NOTE: accessing "params" values outside of processDef
assemble_minia.set_container("${params.docker_repository}company/minia:${params.minia_old_commit}")
old_assem_ch = assemble_minia(reads, "old_version")

assemble_minia.set_container("${params.docker_repository}company/minia:${params.minia_new_commit}")
new_assem_ch = assemble_minia(reads, "new_version")

@pditommaso pditommaso added this to DSL enhancements in On going activity Jan 16, 2019

@pditommaso pditommaso removed this from DSL enhancements in On going activity Jan 16, 2019

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2019

Thanks a lot for all your comments! I've uploaded another snapshot introducing some refinements and suggestions you provided:

NXF_VER=19.0.0.modules-draft3-SNAPSHOT nextflow info 

Main changes:

  1. I've realised that adding the processDef keyword could be confusing and above not strictly necessary. In this version, when process is used in the main script, it works as usual, instead when it's used in a module definition file, it allows to define a process and therefore from/into should not be used.

  2. importLibrary as been replaced by require that's a bit more readable.

  3. Parameters. I agree with you that modules should be isolated from command line parameters. At the same time I think there should be a way to inject options to a module component when it's referenced. this would allow the parametrisation of the inner tasks. In last snapshot I've added the possibility to specify a map of values when the module is referenced via the require statement, e.g.

    require 'module-file.nf', params: [ foo: val1, bar: val2 ]
    

Then in module-file.nf we can have the usual syntax for params as in the main script:

   params.foo = 'x'
   params.bar = 'y'

   process something {
    ''' 
    your_command --here
    '''
   }
  1. Namespace. It can be useful, but I don't think it's dramatically urgent. I think we can add in a separate iteration.

  2. Remote module repository. The idea is tempting, it could work along the same line of the nextflow pull command. The module is downloaded from a Git repository and commit ID or tag can be used to identified a specific version. For example:

    require from: 'nf-core/common-lib', revision: <tag-name>
    

These are the main points. In the next iteration I will try to extend the module concept to allow the definition also of custom functions that can be imported both the in the config and script context.

@aunderwo

This comment has been minimized.

Copy link

commented Jan 24, 2019

Thanks for the update @pditommaso.

To clarify on injection of modules. If you wanted to inject params that has been passed as arguments to the nextflow run command would you do something like below to have default values that could be overridden by args on the nextflow run command line and then passed on to the module?

params.foo = false
params.bar = 50

require 'module-file.nf', params: [ foo: params.foo , bar: params.bar ]
@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2019

Yes, exactly like that, you can even do

require 'module-file.nf', params: params

Tho both ways are the only thing that I don't like in this approach.

@pditommaso pditommaso pinned this issue Jan 30, 2019

@mes5k

This comment has been minimized.

Copy link

commented Jan 31, 2019

Of course you release this feature after I can't use nextflow anymore. Sigh. :)

I think this feature looks great. Reading through this it seems like this only lets you separate and reuse the definition of single processes, but it doesn't have a way of collecting or aggregating multiple processes into single entity (like a subworkflow). Is that right? Have you given any thought to that or is that still future work?

Regardless, I think this is awesome and I'll continue to wish I was using nextflow instead of what I'm using now...

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

@mes5k Ah-ah, you have to back to NF !

but it doesn't have a way of collecting or aggregating multiple processes into single entity (like a subworkflow)

This approach is extremely flexible and the idea is to use a similar mechanism also for sub-workflows.

@mes5k

This comment has been minimized.

Copy link

commented Feb 1, 2019

Awesome! So happy to hear that you're working on this. Will definitely make the job of selling nextflow internally easier!

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2019

Uploaded 19.0.0.modules-draft4-SNAPSHOT that allows the definition of custom function and nested require inclusions. You can see in action in this pipeline CRG-CNAG/CalliNGS-NF@1cad86b

However still not happy, I'll try experimenting with the ability to define subworkflows.

@blacky0x0

This comment has been minimized.

Copy link

commented Feb 18, 2019

@pditommaso does this feature relate to #238 and also #777, #844? I guess, yes.
Please, keep in mind and consider also the following features:

  • dry-run or plan to see the end graph structure;
  • print output channels(variables) if the value can be inferred and doesn't have dependencies;
  • execution of specified file, module or process to be able to run isolated part;
  • syntax checking for *.nf files;

It makes sense to allow to run a target process or module of very large script separately like a portion of work. Just look the definition of targeting for Terraform tool. It makes possible to uniquely refer to each module or any resource or data source within any provider context by full qualified item name. So, examples of CLI for NF can be written as:

nextflow run -target=process.1A_prepare_genome_samtools
nextflow run -target=module.'rnaseq.nf'.fastqc
nextflow plan -target=process.1A_prepare_genome_samtools
nextflow plan -target=module.'rnaseq.nf'.fastqc

Besides introducing the modules feature to extract common code to a separate file I hope it will lead to implementation of the described above features because they are useful and desired.

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2019

#238 yes, the others are out of the scope of this enhancement.

@blacky0x0

This comment has been minimized.

Copy link

commented Feb 18, 2019

@pditommaso let's assume that the feature is done and can be released as an experimental.
Let's simply add an extra -enable-modules option which will enable new module feature. It will save backward compatibility and allow end users to test this feature. It's compromise when you need a new release and feed-back. For example, an experimental -XX:+UnlockExperimentalVMOptions flag for Java 11 in release notes.

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2019

That's the plan.

@blacky0x0

This comment has been minimized.

Copy link

commented Feb 19, 2019

It is worth to add a version designation to the nf script to help end user identify version and produce clear error descriptions. For example:

apiVersion: "nextflow.io/v19.0.0-M4-modules"
   or
dslVersion: "nextflow.io/v19.0.0-M4-modules"

where M is stands for milestone.

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

Ok, just upload 19.0.0.modules-draft5-SNAPSHOT. Things starts to become exciting, it's not now possibile to define subworkflow either the module script or in the main script composing the defined processes e.g.

process foo {
   /your_command/ 
}

process bar {
  /another_command/
}

workflow sub1 {
  foo()
  bar()  
}

Then invoke it as a function ie. sub1. Sub-workflows can have parameter as regular function e.g.

 workflow sub1(ch_x, ch_y) {
  foo(ch_x)
  bar(ch_y)  
}

The output of the last invoked process (bar) is implicitly the output of the sub-workflow and it can be referenced in the outer scope a sub.output.

In the main script it can be defined an anonymous workflow that's supposed to be the application entry-point and therefore it's implicitly executed e.g.

fasta  = Channel.fromPath(...)
reads = Channel.fromFilePairs(...)
workflow {
  sub1( fasta, reads )
}

Bonus (big one): within a workflow scope the same channel can be used as input in different processes (finally!)

@mes5k

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi @pditommaso I've started experimenting and I'm having a hard time getting something working. I'm getting this error:

[master]$ NXF_VER=19.0.0.modules-draft5-SNAPSHOT nextflow run main.nf
N E X T F L O W  ~  version 19.0.0.modules-draft5-SNAPSHOT
Launching `main.nf` [boring_kare] - revision: 66747d681c
ERROR ~ No such variable: x

 -- Check script 'main.nf' at line: 8 or see '.nextflow.log' file for more details

With this code: https://github.com/mes5k/new_school_nf

Can you point me in the right direction?

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

The processes can only be defined in the module script (to keep compatibility with existing code).

In the main there must be a workflow to enable the new syntax. Finally the operator like syntax was removed because I realised that was useful only on a restricted examples and generating confusing in most cases. You example should be written as:

   to_psv(to_tsv(gen_csv(ch1)))

or

gen_csv(ch1)
to_tsv(gen_csv.outout)
to_psv(to_tsv.output)
@mes5k

This comment has been minimized.

Copy link

commented Feb 21, 2019

Awesome, thanks! My first example is now working.

My next experiment was to see if I could import an entire workflow. I can't tell from your comments whether that's something that's supported or whether I've just got a mistake in my code.

@aunderwo

This comment has been minimized.

Copy link

commented Feb 21, 2019

Is it possible to assign module process outputs to a variable so that you can do something like

modules.nf

process foo {
    input:
    file(x)

    output:
    file(y)

   script:
    .....
}

process bar {
    input:
    file(a)

    output:
    file(b)

   script:
    .....
}

main.nf

require 'modules.nf'

workflow {
  Channel
    .from('1.txt', '2.txt', '3.txt')
    .set{ ch1 }

  foo_output = foo(ch1)
  bar_output = bar(foo_output)

  bar_output.view()
}
@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2019

Yes, but it's not necessary. The process can be accessed as a variable to retried the the output value ie

workflow {
  Channel
    .from('1.txt', '2.txt', '3.txt')
    .set{ ch1 }

  foo(ch1)
  bar(foo.output)
  bar.output.view()
}
@rspreafico-vir

This comment has been minimized.

Copy link

commented Apr 27, 2019

If the aliasing strategy works, that is perfect for me. Thanks for addressing it!

@rspreafico-vir

This comment has been minimized.

Copy link

commented Apr 28, 2019

The aliasing should be supported by draft10 already, correct? 'cause I am trying it but turning this

include fastqc from 'modules/fastqc'

into this

include fastqc as fastqc_raw from 'modules/fastqc'

produces the following error

ERROR ~ Unexpected error [NullPointerException]

 -- Check script 'main.nf' at line: 6 or see '.nextflow.log' file for more details
@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

@rspreafico-vir You need to clone and to build the master branch or use the 19.05.0-SNAPSHOT version.

@rspreafico-vir

This comment has been minimized.

Copy link

commented Apr 29, 2019

This works great with 19.05.0-SNAPSHOT. Thank you!!

@gerlachry

This comment has been minimized.

Copy link

commented May 9, 2019

Is there any current plan for when this might be officially released?

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@aunderwo

This comment has been minimized.

Copy link

commented May 9, 2019

@pditommaso Is this available on the 19.04.1 release?

@rspreafico-vir

This comment has been minimized.

Copy link

commented May 9, 2019

Nope, kindly see a few comments up. Requires 19.05.0-SNAPSHOT

@aunderwo

This comment has been minimized.

Copy link

commented May 9, 2019

Thanks @rspreafico-vir. I am on the point of submitting something to nf-core and would dearly love it to be using DSL-2!

@rspreafico-vir

This comment has been minimized.

Copy link

commented May 9, 2019

DSL-2 will be great for nf-core! It is easy to envision a carefully crafted library of modules, one per tool, in nf-core. In addition to being great for nf-core pipelines, such nf-core modules would be useful per se for end users.

@drpatelh

This comment has been minimized.

Copy link

commented May 9, 2019

@aunderwo Finally! I remember talking to you about this at the NF conference last year 😄 Really looking forward to this functionality being added to nf-core. Be nice to create a standardised set of modules for the community.

@arnaudbore

This comment has been minimized.

Copy link

commented May 21, 2019

Should we re-open this issue ? Very excited about the new release!

@pditommaso pditommaso unpinned this issue May 22, 2019

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

@pachiras

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

Thank you Paolo! This makes our research a lot easier!

sivkovic added a commit to sivkovic/nextflow that referenced this issue Jun 6, 2019

Syntax enhancement aka DLS-2 nextflow-io#984
This commit implements a major enhancenent for Nextflow DLS
that provides support for:
- module libraries and processes inclusion
- ability to use an outout channel multiple times as input
- implicit process output variable
- pipe style process and operator compositon

sivkovic added a commit to sivkovic/nextflow that referenced this issue Jun 6, 2019

Syntax enhancement aka DLS-2 nextflow-io#984
This commit implements a major enhancenent for Nextflow DLS
that provides support for:
- module libraries and processes inclusion
- ability to use an outout channel multiple times as input
- implicit process output variable
- pipe style process and operator compositon

Signed-off-by: Ivkovic <sinisa.ivkovic@gmail.com>
@rspreafico-vir

This comment has been minimized.

Copy link

commented Jun 11, 2019

Hi @pditommaso , just a question. Do you think that the publishDir operator could come soon to DSL-2? If not I will spend the time to write a little script that parses the trace file and collects all the output for me. But won't do if that feature is coming soon.

@madkinsz

This comment has been minimized.

Copy link

commented Jun 11, 2019

Hi @pditommaso , just a question. Do you think that the publishDir operator could come soon to DSL-2? If not I will spend the time to write a little script that parses the trace file and collects all the output for me. But won't do if that feature is coming soon.

Just a +1 that I've been hunting through the documentation for this feature.

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

The publishDir is still an open point, consider that can implement something similar using subscribe and copyTo, for example:

your_channel.subscribe { file-> file.copyTo('/some/path') }

Regarding the publishDir operator, I'm thinking we should have a more declarative approach to declare outputs in the workflow definition context. Also the proposal of named outputs #1181 could be relevant in this context.

@JonathanCSmith

This comment has been minimized.

Copy link

commented Jun 12, 2019

Hi - couple of questions. Can sub-workflows have outputs? Can we specify which channels are output specifically? If not, is it a planned feature? Perhaps supporting a similar syntax to process (e.g. input & output declaration) would be useful?

@rspreafico-vir

This comment has been minimized.

Copy link

commented Jun 13, 2019

Thanks for your suggestion @pditommaso , the copyTo alternative is working great!

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2019

@rspreafico-vir nice to read that. @JonathanCSmith actually that was the first implementation I've tried, but I was not convinced because the semantic is different. I'll open a separate issue to discuss it.

@mes5k

This comment has been minimized.

Copy link

commented Jun 13, 2019

Isn't the channel output by the last process in a sub-workflow the implicit output of that sub-workflow?

@rspreafico-vir the problem I had in the past with copyTo is that it doesn't transparently handle filesystem and S3 paths the way the publishDir directive does. Or did. It's been a while, so it's possible that has been changed.

@rspreafico-vir

This comment has been minimized.

Copy link

commented Jun 13, 2019

@mes5k You are right about copyTo, it's not as straightforward as publishDir. With S3, I noticed that if the output path does not exist and you are trying to copy a folder, then Nextflow creates the output path but it copies the folder content into the specified path, as opposed to the folder itself. However, it the path already exists, then the folder is copied as a whole and becomes a subfolder in the output path, which is the intended behavior. I think this is due to the following statement from the Nextflow docs:

While Linux tools often treat paths ending with a slash (e.g. /some/path/name/) as directories, and those not (e.g. /some/path/name) as regular files, Nextflow (due to its use of the Java files API) views both these paths as the same file system object. If the path exists, it is handled according to its actual type (i.e. as a regular file or as a directory). If the path does not exist, it is treated as a regular file, with any missing parent directories created automatically.

I have not tried the behavior with local execution, just AWS Batch with S3. The other catch is that copyTo overwrites files (the intended behavior), but when trying to copy a folder, the content of the folder is not overwritten (it throws an error).

For now I am fixing the two catches above by first checking that the output folder does not exist, and then by creating it myself before calling copyTo. Since S3 has no concept of folders, creating a folder here really means creating a hidden empty file.

The other way around it would be to create a process that takes as an input all the pipeline outputs (generated from different processes), with the intention of consolidating all files in a single subfolder of the work folder. However, this does not work in S3 as symlinks to input files are only maintained in the EC2 instance, but not in S3. I have not tried specifying that inputs should be copied rather than symlinked by this process with DSL-2 yet, if that works it could provide a workaround.

@JonathanCSmith

This comment has been minimized.

Copy link

commented Jun 19, 2019

I ended up 'faking' a process that would ensure my desired channel is the last output of a sub-workflow which should work as a stopgap for now.

I have, however, encountered another issue (which may just be a result of my becoming accustomed to DSL2). I have created a basic process with a file as an input.

process customProcess {

    input:
        file(item)

    output:
        set file(item), file('output.txt')

    """
        less $item > 'output.txt'
    """
}

I have called the process successfully with a channel containing a list of files using the syntax:

output = channelWithListOfFiles | customProcess

whereas:

output = customProcess(channelWithListOfFiles)

did not work. In addition, I attempted to reuse the "customProcess" but I received the following error:

Process `customProcess` declares 2 input channels but 1 were specified

Pseudocode for the subworkflow is as follows:

output = channelWithListOfFiles | customProcess
output2 = channelWithListOfFiles2 | customProcess

Is this expected? For now I can overcome by duplicating the process but personally I would expect processes to be re-useable?

@pachiras

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2019

Hello, Paolo,

I have a question.
Recently I have found Channel.create() is not supported under DSL2

$ nextflow run . --input 'test/input.yaml'
N E X T F L O W  ~  version 19.06.0-SNAPSHOT
Launching `./main.nf` [fabulous_goodall] - revision: c0a9d64eb6
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Channel `create` method is not supported any more

If that's so, how do you rewrite Channel.choice() example code shown in the reference document?

source = Channel.from 'Hello world', 'Hola', 'Hello John'
queue1 = Channel.create()
queue2 = Channel.create()

source.choice( queue1, queue2 ) { a -> a =~ /^Hello.*/ ? 0 : 1 }

queue1.subscribe { println it }
@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2019

@pachiras interesting point, please report as a separate issue and let's continue the discussion there.

@pditommaso

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2019

I'm locking this thread. Please open a new issue for DSL2 problem or general discussion. Thanks!

@nextflow-io nextflow-io locked as resolved and limited conversation to collaborators Jun 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.