-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ext config as params #308
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
|
SRATOOLS_PREFETCH.out.sra, | ||
ch_ncbi_settings, | ||
ch_dbgap_key, | ||
params.sratools_fasterqdump_args ?: '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should always be defined, as it's in nextflow.config
, so no need for an elvis operator, no?
Or is this being defensive in case of this being a shared subworkflow and pipeline authors forgetting to add the param into the Nextflow config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the fallback isn't really needed here. I think my intent was that the process should define a default value for this input e.g. val args, defaultValue: ''
(or String args = ''
once we have static types 😄 ). But that isn't supported yet so I put it here instead
I personally do agree that the |
As Nicolas said, one of the reasons to use this was because of being able to use the I did initially suggest also supplying args as params too to make it less confusing to the user, but it was left out. nf-core/rnaseq#701, but again it didn't support closures from the command line. Edit: Slack chat also suggests appending the meta map controlled stuff to the args, but then that means |
This feels like circling back to the very attempt at nf-core/modules. Another issue was that the passing around of all modules_args through several places in the pipeline was a bit painful: https://github.com/nf-core/sarek/blob/8ae205c171fb53132b2205b231a779347341a9f0/main.nf |
nf-core subworkflows are also written such that all inputs come in through the channels (i.e. |
Thank you all for the historical context I was missing. It's interesting how similar it is to the old DSL1 approach. My only difference would be to treat each args setting as its own param instead of bundling them in a map.
I would handle dynamic args like this: ASPERA_CLI (
ch_sra_reads.aspera,
'era-fasp',
ch_sra_reads.aspera.map { meta, fastq -> "${meta.key}" }
) You could customize it further with params, e.g. to toggle between static and dynamic args. I agree that setting a param to a closure is an anti-pattern, so I'm not claiming it would be as powerful as the ext config. But I would also ask how much flexibility do you actually need? Are you just adding a lot of flexibility because you can, or are you addressing real use cases that can't be solved in other ways?
I think it's better to be transparent about the complexity of the workflow, rather than try to hide it with shortcuts like ext. It forces you to confront the true complexity of your pipeline and consider more broadly whether the complexity is justified and how else you might address it in the pipeline code. Admittedly, this sort of dynamic ext probably shouldn't have been allowed in the first place, and I can understand how it might be hard to like any other solution when ext config already exists. But at the end of the day, these ext args are inputs, not config. I know this because you (Mahesh) requested that ext settings invalidate the cache: nextflow-io/nextflow#2382 . Process directives normally don't affect the cache because they define how a task is executed rather than what is executed and produced. I am also thinking about this in the context of all the other language improvements we are working on -- formal script and config grammar, record types, incorporating the params schema, workflow publish definition, etc. All of these efforts work together to make the language much simpler and make many bad patterns unnecessary. I suspect that the use of ext config is driven in part by a desire to avoid the pain and messiness of writing pipeline code... so if the pipeline code is made simpler, maybe the transition to params / inputs doesn't seem so bad. |
I just want to say that supplying process args through params is something I think we should be doing. I'm just wary of making the pipelines more difficult to read, as that's a big factor in getting new potential contributors. That tool specific parameters can come through Currently setting Any workflow specific optional options (say, tool flags necessary for RNAseq data, but are not mandatory for the nf-core module ) can still be mistakenly changed by the user. Passing them as a process input would help mitigate this, since the workflow developer can hard code these extra options into the workflow. As you show, it's fairly clear using a
At the time, I was just looking for ways to cut down the code, because it was very hard for me to follow what was going where. I'm sure we must have discussed passing these args as an Sorry for the rambling. So in summary: |
It's a fair point about the readability of the pipeline code. I think that proper IDE tooling and better static type support will help a lot here, so many of these changes I'm pushing for are predicated on the overall developer experience being much better. I also have some ideas to simplify the channel logic, potentially by a lot, but that is longer term. I'm working on a "visionary" PR for fetchngs which combines a bunch of different language improvements that are in development so that I can get a clearer picture of everything, and maybe also show you guys where I'm trying to go. As for this PR, it's something you can implement in your pipelines today, if you want. It's totally up to you whether you'd rather have this complexity in the pipeline code / params or in the config. But I will say, while I don't think we will remove anything anytime soon, we aren't going to make the config approach any easier. The ext directive, config selectors, those things aren't gonna get any prettier, but the pipeline code will. |
Sorry more ramblings related to this. So one reason why
is so nice to use is that I can see exactly where this input is applied. Using Concrete issue: nf-core/rnaseq#1111 I'm not familiar with the current state of nf-core/rnaseq, but as a contributor, I had an easy time to trace why the flag didn't work (and could provide a low effort solution). In the parameters as process inputs solution this may have been hard-coded as a string input to the process. This means the user has no way to change this unless they make an issue, make a PR, change the code themselves, or just leave it. I'm not sure how often this occurs, but more generally, many users also prefer to be able to find what they need through the documentation (sidenote: it would be monumental for a developer to code for every param possibility too). I see colleagues that will get as far as they can with a workflow, and if they can't achieve something, they'll run the tool outside the workflow with the flags they want. They won't make an issue, PR, or interact with developers to get the feature they need added because it's a time cost. Another thing is parameter naming. The process selectors are nice because the code enforces a 1-to-1 relationship. It would be nice if we could achieve something similar through a schema or some config feature that could achieve a similar kind of behaviour. I would like to be able to see where my param is likely going based on it's name, and there be some mechanism to warn if there isn't a one to one correspondence like in the case of the Sorry, there's a few things to untangle here. |
Process selectors are a convenient syntax sugar and a quick fix for this particular use case, but it's also very brittle -- the selector name is easy to get wrong, the priority resolution is easy to get wrong, dynamic behavior is very limited because you aren't in the pipeline code yet. I wonder whether the quick fix that it provides is offset by poor developer experience that it creates over the long term. But I agree that the traceability is nice, and I would support making it easier to trace params through the pipeline code. I would like to do exactly this with the workflow DAG -- show each param and how it connects to processes, without necessarily showing all of the glue logic it has to go through. With the IDE tooling, I could see both (1) side-by-side code with DAG preview that updates as you type and (2) hover over a param to see the processes that it connects to. Right now you have to run the pipeline (even if in preview mode) just to render the DAG. But if we can produce the DAG just from parsing the script, we can get better information (e.g. about the params) and do it in real-time (i.e. live preview). |
Can you document somewhere that
doesn't unsync the inputs?
I'm not sure what you mean exactly by this, but you've also got to remember that this small change creates a lot of maintenance burden (also part of developer experience imo) on the contributors here. Short term, the core team has to organise the change across the modules, which includes documentation, tooling, and trying to think what might break. Then manage all the PR's to the modules and subworkflows as well as finally rolling it out in all the pipelines. Then there's the long term maintenance of dealing with anything that is missing from the conversion. It took several months and a lot of work from everyone to roll out what I proposed back then. The possibility for passing through ext.args has to be removed ( DSL3? ) if you want to force the switch or more people really need to speak up in favour of passing through params given the work it entails. Thanks for taking the time to make the improvements. I like the ideas that are coming forward and I'm looking forward to seeing the DSL3 info. |
Here's my proof-of-concept for DSL2+ / DSL3 with fetchngs: #309 Code changes like this always take a long time, but I'm in no hurry. We aren't going to drop support for |
Co-authored-by: Adam Talbot <12817534+adamrtalbot@users.noreply.github.com>
Pieter Moris just (re-)discovered an interesting problem of passing command-line options on the command-line:
It's solved by using a params-file or config file, but there are a lot who will pass using the command-line. |
This is because the Nextflow CLI inserts
But I think this will be fully solved when we incorporate the parameter schema into core Nextflow. Then the CLI will know that this param is a string and take the following string argument instead of treating it as a boolean. |
Folding into #312 |
See nextflow-io/nextflow#4510 (comment)
tl;dr -- instead of implementing module config for the sole purpose of using
ext
as a shortcut to inject process inputs, let's just make them process inputs and control them with pipeline params as needed.An interesting case with fetchngs is that the default args for
SRATOOLS_FASTERQDUMP
are overwritten by the subworkflow that uses it. For now I just left the default args in a comment, but it could also be hardcoded in the script body as a fallback value. Long term, it could be a default value with the declared input (as I have commented), likely as part of nextflow-io/nextflow#4553 .Between this refactor and replacing
publishDir
with the workflow publish DSL, we wouldn't need the module config at all, at least for the use cases we have gathered so far.