-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A plugin system for pre-submission hooks to modify samples and change yaml representations #285
Comments
Ok, I've implemented this. I really like this idea. I have only one complaint, which I raised in #288. With this approach, you'd use this kind of a pipeline interface for a CWL pipeline:
and you'd use this kind of approach for an @afrendeiro sample yaml pipeline:
The only question is: should the basic yaml approach be built-in? Or should we require you to specify this Arguments for doing it anyway: maybe it's nice to have those yaml files as a record of what was run. At the moment I'm leaning toward following the 'explicit is better than implicit' rule-of-thumb, but I could be persuaded otherwise. Thoughts? |
@afrendeiro I'd like to know if you have any objections to requiring the pipeline interface for any pipelines that want the sample yaml to include Or, should we simply make that run all the time? |
No objections, but would prefer if done automatically. |
Yeah that's better for backwards compatibility, but in that case, would there be a way to turn it off? Or would it always produce that file, and then could just produce others as well?
Hmm...possibly, but there's nothing tying those together, really. I don't really like that idea.
Well, I had proposed a revision that would generalize that concept here: #288 (comment). what do you think of that? I guess this issue is that this particular key, if made generic, would just be a value to be used. I'm just trying to make the thing generic and modular, so it's easier to accomodate this functionality as "just another way" rather than being a special case, I guess. What about just making it so that if there is no entry in |
You're right, it should be a general thing. I warmed up to the idea of Are there other things that would go under it? I guess the CWL, right? Although the basic part is a bit weird. Maybe something like "write_sample_yaml". In the end, could it maybe be something like this? pre_submit:
- looper.write_sample_yaml
- looper.write_sample_cwl |
what you wrote is almost exactly how it is now! You just changed the names of the built-in functions. I just meant "basic" was the built-in yaml; the cwl one is also a yaml. I think you're confused about the terminology there; it's a yaml file that's used as input to parameterize a cwl -- it's not a cwl file. So I might do pre_submit:
- looper.write_sample_yaml
- looper.write_sample_cwl as built-in options makes sense to me. then, users can write their own stuff and add it like pre_submit:
- my_package.my_function the downside here is that existing tools that require the sample yamls will have to update the pipeline interface to be compatible with the new looper version. so, we break backwards compatibility. The remianig questions are:
|
Ah sorry I didn't see the example above again... Sorry I'm distracted :/
Is this just
Yeah that makes sense to me. But if it is really just I'm okay with |
I think I'm using it in a few cases. Can't remember exactly though... |
Should it be possible to put a script in there? Maybe instead of a list of python functions, we make it operate the same way the for example, the dynamic_variables_command_template is essentially taking as input some sample attributes, and modifying the compute namespace before submitting. the if we made the |
Yeah I guess compute variables could be populated this way if there are functions that take the sample as input and add those attributes. I'm assuming the called functions are allowed to modify the sample object, right? (don't even think there's a mechanism to prevent that in Python since everything is mutable by default) |
well, if it's happening as a script rather than a function then it can't. this is already how it works with the dynamic compute variables. it prints out variables, which are then read in by looper and used to modify the compute variables. |
Oh I see, I wasn't familiar with. Well, that script could return a YAML/JSON version of the sample too I guess. |
Two ideas for how to implement this that checks all these boxes: Option 1The pre_submit:
project:
python_functions:
- ...
command_template: null
sample:
python_functions:
- ...
command_template:
- ...
pipeline:
python_functions:
- ...
command_template:
- ...
looper:
python_functions:
- ...
command_template:
- ...
compute:
python_functions:
- ...
command_template:
- ... So, for example: pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl # relative to this pipeline_interface.yaml file
pre_submit:
sample:
python_function:
- looper.write_sample_yaml
command_template:
- dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
compute:
command_template:
- build-compute-params.py -a {sample.asset}
command_template: >
cwl-runner {pipeline.path} {sample.yaml_file} the it returns: {
"mem": "6000"
} You can use these sections to modify any of the looper namespaces. Option 2There are no sub-sections for namespaces, but all namespaces are included in a dict. pre_submit:
python_functions:
- ...
command_template:
- ... example: pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl # relative to this pipeline_interface.yaml file
pre_submit:
python_function:
- looper.write_sample_yaml
command_template:
- dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
- build-compute-params.py -a {sample.asset}
command_template: >
cwl-runner {pipeline.path} {sample.yaml_file} In this case the external script would return: {
"sample": {},
"compute": {
"mem": "6000"
}
} The function would have to return a dict with a |
Currently the def write_sample_yaml(sample, subcon=None):
"""
Produce a complete, basic yaml representation of the sample
:param peppy.Sample sample: A sample object
"""
_LOGGER.info("Calling write_sample_yaml plugin.")
sample.to_yaml(subcon._get_sample_yaml_path(sample))
return(sample) For the more universal version we'd do something like: def write_sample_yaml(sample, compute, looper, pipeline):
_LOGGER.info("Calling write_sample_yaml plugin.")
sample.to_yaml(subcon._get_sample_yaml_path(sample))
return_value = {
"sample": sample,
"compute": compute,
"looper": looper,
"pipeline": pipeline
}
return(return_value) Or maybe: def write_sample_yaml(namespaces):
"""
Produce a complete, basic yaml representation of the sample
:param dict namespaces: A dict with namespaces: sample, pipeline, looper, compute
"""
sample = namespaces["sample"]
compute = namespaces["compute"]
looper = namespaces["looper"]
pipeline = namespaces["pipeline"]
_LOGGER.info("Calling write_sample_yaml plugin.")
sample.to_yaml(subcon._get_sample_yaml_path(sample))
return_value = {
"sample": sample,
"compute": compute,
"looper": looper,
"pipeline": pipeline
}
return(return_value) |
I like the idea of a single As for implementation/format selection: From this comment I'd prefer "Option 2", for the reasons you named there. And I don't think the script from From this comment I don't have preference, but again, I like that all the namespace handling could happen in single hook fucntion. So, if we implemented that, the problem we talked about in #288 goes away? The target YAML path for sample can be created from any namespace attributes in a hook function. |
I agree -- I didn't mean it would return an empty dict, it was just a placeholder. Either the function and the script would only return whatever it wanted to update. null/missing entries would simply not be updated.
Yeah. for these functions, the advantage of the signature being
I think it changes but doesn't get solved completely; I still think we'd want to write a built-in function that accepted some sample attributes and gave a decent path, or something. with this new approach, I think it becomes easier because now those functions will have access to all the namespace variables directly, that's what you mean right? |
I'm trying to implement this and found a potential problem: there's no easy way to make the scripts used in Should we allow only one command? Or do this: pre_submit:
my_path1: dir/build-sample-params.py
path2: build-compute-params.py
python_function:
- looper.write_sample_yaml
command_template:
- {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
- {pipeline.pre_submit.path2} -a {sample.asset} we would need to assume that every pre_submit:
script_paths: [my_path1, path2]
my_path1: dir/build-sample-params.py
path2: build-compute-params.py
python_function:
- looper.write_sample_yaml
command_template:
- {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
- {pipeline.pre_submit.path2} -a {sample.asset} That gets a bit complex and it feels like "one command only" option makes the most sense.. IDK |
should we add a new So I guess instead of this: path: path/to/main/script.py
command_template: {pipeline.path} ...
pre_submit:
script_paths: [my_path1, path2]
my_path1: dir/build-sample-params.py
path2: build-compute-params.py
python_function:
- looper.write_sample_yaml
command_template:
- {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
- {pipeline.pre_submit.path2} -a {sample.asset} I'm proposing this: script_paths:
main_path: path/to/main/script.py
my_path1: dir/build-sample-params.py
path2: build-compute-params.py
command_template: {pipeline.main_path} ...
pre_submit:
python_function:
- looper.write_sample_yaml
command_template:
- {pipeline.my_path1} -a {sample.attr_a} -b {sample.attr_b}
- {pipeline.path2} -a {sample.asset} |
good thought. Maybe we can make it even more versatile and name this section "paths" -- this might be useful beyond scripts. For example, some configuration or sample input files could be shipped with pipelines and embedded in command templates. |
Yeah I wasn't sold on I guess it would be more straightforward than my original proposal if you referred to them as |
Actually, as of now it's not a YAML list but mapping paths:
key: x And thanks to that we can refer to these paths with: It's already implemented and seems to work, you can give it a try. |
Well, we could deprecate it; so, provide a warning but allow it for now? I think the same path should be followed for the dynamic_variables_command_template, which will be superseded.
wow great, thanks! are you referring to just the |
No, I don't use them currently, so it should be okay to adjust without bumping up against anything I've done. |
@stolarczyk I noticed: it seems that you're requiring the return value to be the whole set of namespaces, like this:
I'd rather just only have to return ones I'm updating. For example, in this example I'm only updating
but when you change to that you get an error:
almost like it's only populating the sample namespace if it's run through the function. |
I just pushed an update with an example. you can try this:
I changed |
you're right, this is how it works and that's what I wrote above:
I thought that it's not a pain to return the entire namespaces dict since it's the input to the hook function. But I don't feel strongly about it. |
well, I think users will commonly not be aware of all these namespaces, so it will be simpler if they can return only the things that they updated. this is what I meant when I wrote:
that seems to be clearer to me, if it's not harder to implement. do you see any disadvantages? |
it's a bit more complex since otherwise the namespaces just become what's returned from the hook funcion. But I implemented the other way for the
I don't see any disadvantages. Conversely, I see an advantage of the same output type requirement. |
Ok sounds good. I was thinking it might be as simple as just calling the |
Just realized in writing the docs... the new it's not the same thing as
Do we need to generalize the pipeline_name: PIPELINE1
pipeline_type: sample
paths:
main: pipelines/pipeline1.py
hook1: pipelines/hook1.py
output_paths:
sample_yaml_path: {looper.output_dir}/{sample.sample_name}.yaml
some_other_path: {sample.sample_name}.txt
sample_yaml_cwl: {sample.sample_name}_cwl.yaml
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
python_function:
- looper.write_sample_yaml_basic
command_template:
- "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"
command_template: >
{pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome} Then you can refer to those variables from within your plug-in function with: Then a follow-up questions:
paths:
input:
main: pipelines/pipeline1.py
output:
sample_yaml_path: {sample.sample_name}.yaml An alternative idea: should we eliminate the implied relative interpretation and simply force you to state the relative folder if you want to use it? they are already available in some namespace variables: Yet a third idea is to just keep it how it is, where thoughts? |
I thought that the point of the plugin functions is, among other things, to build the path to the sample yaml and write it since all the namespaces are available as an input. |
that it gives control to the pipeline runner, I suppose, rather than the pipeline author. in other words, if I'm just using a pipeline, I may want things to be put in a different output space. I shouldn't have to write code to do that, it should be configurable. that's why we originally made the sample output path configurable, right? right now that's no longer configurable, right? |
oh I see.. I considered the plugin creation the new "configuration". I think this might be a source of confusion and potential problems. As a user I still need to go over the plugin code and make sure it uses the path I'm expecting it to ( |
Yeah, that makes some sense. I guess I see the plugins as being an intermediate step. they're not quite "built-in" (although the current ones are in fact built-in), but not quite as configurable as just tweaking a piface... so i don't expect there to be tons of plugins and I figure they should be reused. The plugin author would have to say "to use this plugin, you must/may provide the following Essentially, we make the plugins configurable. |
actually this is basically already how it is... the plugin has access to the question here is just: do we need a special category for output file attributes that are assumed relative? or do we need to provide a special class of attributes that are populated from templates by the namespace variables? this is what we currently lack. |
I think I like this option the most:
It's both least confusing to use and seems easiest to implement; we'd just need to allow the values under |
I've just implemented the approach to solve this from the comment above, and now I was able to do: pipeline_name: PIPELINE1
pipeline_type: sample
paths:
main: "{looper.piface_dir}/pipelines/pipeline1.py"
hook1: "{looper.piface_dir}/pipelines/hook1.py"
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
python_function:
- looper.write_sample_yaml
command_template:
- "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"
command_template: >
{pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome} |
nice! just curious, is it required to have those template values in quotes? |
Yes, when the value starts with curly braces. Otherwise it would be parsed as a mapping |
Ok, so this is what we have. old way:
new way:
|
Ok, one issue is that the
You have to do:
how can we reconcile the need to accommodate multiple hooks with the need to make the interface to |
syntax should become: pre_submit:
python_functions:
command_templates: functionality doesn't need to change. |
@afrendeiro we've finalized the design and implemented this now. So, this will introduce some breaking changes for some of your pipelines, which will now be required to specify that they need to use that plugin in the new way. Should be as easy as as adding:
The output file path can be customized using var_templates.sample_yaml_path. If this parameter is not provided, the file will be saved as {looper.output_dir}/submission/{sample.sample_name}_sample.yaml. |
Related to #284 and #283.
We originally wrote a sample yaml file. Now I think that format should change (#284). And, we're wanting to write a slightly tweaked version for CWL inputs (#283). What if there's other stuff we want to do with the sample? Well -- what if we instead implemented a plugin system?
looper, before job submission, would call a
pre_submit()
hook function where we currently run theto_yaml
process. Users could write python packages that would provide functions that would be called bypre_submit
. Plugin functions must:For example, a cwl plugin would be a package is called
looper_cwl
, which provides a function calledadd_cwl_yaml()
. Then, In the pipeline interface, you'd specify which plugin functions to run, withpre_submit_plugin
like this:In this case, the cwl plugin would provide a new attribute, called
cwl_yaml
, so you can use{sample.cwl_yaml}
in the command template as shown. This function would also write the cwl-version yaml to the file that would be specified by the.cwl_yaml
attribute.We could start by implementing:
The text was updated successfully, but these errors were encountered: