A plugin system for pre-submission hooks to modify samples and change yaml representations #285

nsheff · 2020-08-25T17:46:48Z

Related to #284 and #283.

We originally wrote a sample yaml file. Now I think that format should change (#284). And, we're wanting to write a slightly tweaked version for CWL inputs (#283). What if there's other stuff we want to do with the sample? Well -- what if we instead implemented a plugin system?

looper, before job submission, would call a pre_submit() hook function where we currently run the to_yaml process. Users could write python packages that would provide functions that would be called by pre_submit. Plugin functions must:

take as input exactly a Sample object.
return as output exactly a (potentially modified) Sample object.
(optionally), they could do whatever else they want to do (like produce a file).

For example, a cwl plugin would be a package is called looper_cwl, which provides a function called add_cwl_yaml(). Then, In the pipeline interface, you'd specify which plugin functions to run, with pre_submit_plugin like this:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit_plugin: looper_cwl.add_cwl_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.cwl_yaml}

In this case, the cwl plugin would provide a new attribute, called cwl_yaml, so you can use {sample.cwl_yaml} in the command template as shown. This function would also write the cwl-version yaml to the file that would be specified by the .cwl_yaml attribute.

We could start by implementing:

a cwl pre-hook plugin, which would write the cwl format
a current-version yaml plugin
a new version yaml plugin that solves Proposal: introduce structure to sample yaml output #284

The text was updated successfully, but these errors were encountered:

nsheff · 2020-08-25T21:11:38Z

Ok, I've implemented this. I really like this idea. I have only one complaint, which I raised in #288.

With this approach, you'd use this kind of a pipeline interface for a CWL pipeline:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  - looper.add_cwl_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.cwl_yaml}

and you'd use this kind of approach for an @afrendeiro sample yaml pipeline:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  - looper.add_basic_yaml
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

The only question is: should the basic yaml approach be built-in? Or should we require you to specify this pre_submit? Arguments for requiring it to be explicit is that for pipelines that don't need the sample yaml written, it costs disk space and time to write all those.

Arguments for doing it anyway: maybe it's nice to have those yaml files as a record of what was run.

At the moment I'm leaning toward following the 'explicit is better than implicit' rule-of-thumb, but I could be persuaded otherwise.

Thoughts?

nsheff · 2020-08-27T17:46:10Z

@afrendeiro I'd like to know if you have any objections to requiring the pipeline interface for any pipelines that want the sample yaml to include pre_submit: looper.add_basic_yaml.

Or, should we simply make that run all the time?

afrendeiro · 2020-08-27T18:00:36Z

No objections, but would prefer if done automatically.
Could it be inferred by the use of "{sample.yaml_file}" in the command_template value?
Or probably better if adding the sample_yaml_path key as described here? http://looper.databio.org/en/latest/pipeline-interface-specification/#sample_yaml_path

nsheff · 2020-08-27T18:37:22Z

Yeah that's better for backwards compatibility, but in that case, would there be a way to turn it off? Or would it always produce that file, and then could just produce others as well?

Could it be inferred by the use of "{sample.yaml_file}" in the command_template value?

Hmm...possibly, but there's nothing tying those together, really. I don't really like that idea.

Or probably better if adding the sample_yaml_path key

Well, I had proposed a revision that would generalize that concept here: #288 (comment). what do you think of that?

I guess this issue is that this particular key, if made generic, would just be a value to be used.

I'm just trying to make the thing generic and modular, so it's easier to accomodate this functionality as "just another way" rather than being a special case, I guess.

What about just making it so that if there is no entry in pre_submit, it assumes you want looper.add_basic_yaml. but if there is one, then dyou'd have to add that in to get it?

afrendeiro · 2020-08-27T19:32:46Z

You're right, it should be a general thing. I warmed up to the idea of pre_submit. I'm fine with writing the yaml only with a pre_submit section.

Are there other things that would go under it? I guess the CWL, right?
Like it could be a list with functions to be called, kind of like a before_script section in a .travis.yml?

Although the basic part is a bit weird. Maybe something like "write_sample_yaml". In the end, could it maybe be something like this?

pre_submit:
  - looper.write_sample_yaml
  - looper.write_sample_cwl

nsheff · 2020-08-27T19:45:04Z

what you wrote is almost exactly how it is now! You just changed the names of the built-in functions. I just meant "basic" was the built-in yaml; the cwl one is also a yaml. I think you're confused about the terminology there; it's a yaml file that's used as input to parameterize a cwl -- it's not a cwl file. So I might do write_sample_cwl_yaml. but anyway, yes.

pre_submit:
  - looper.write_sample_yaml
  - looper.write_sample_cwl

as built-in options makes sense to me. then, users can write their own stuff and add it like

pre_submit:
  - my_package.my_function

the downside here is that existing tools that require the sample yamls will have to update the pipeline interface to be compatible with the new looper version. so, we break backwards compatibility.

The remianig questions are:

should we also make generic the path to the sample yaml file? Moving _get_sample_yaml_path to a sample #288 (comment)
is "pre_submit" the right name for the interface key? maybe pre_submit_hook or pre_submit_functions ?
should we keep looper.write_sample_yaml as a default (if no pre_submit section specified) for a few cycles to maintain backwards compatibility, with a warning?

afrendeiro · 2020-08-27T19:59:23Z

Ah sorry I didn't see the example above again... Sorry I'm distracted :/

the downside here is that existing tools that require the sample yamls will have to update the pipeline interface to be compatible with the new looper version. so, we break backwards compatibility.

Is this just open_pipelines or do you know of something else using it?

should we keep looper.write_sample_yaml as a default (if no pre_submit section specified) for a few cycles to maintain backwards compatibility,

Yeah that makes sense to me. But if it is really just open_pipelines we can coordinate and drop it as soon as that is undergoing some maintenance now anyway.

I'm okay with pre_submit as a key name.

nsheff · 2020-08-27T20:10:06Z

Is this just open_pipelines or do you know of something else using it?

I think I'm using it in a few cases. Can't remember exactly though...

nsheff · 2020-08-27T20:16:38Z

Should it be possible to put a script in there? Maybe instead of a list of python functions, we make it operate the same way the dynamic_variables_command_template works? We had originally thought about implementing that as python functions, like this. The difference is just that the inputs to those functions we envisioned as being very simple, like a number or a string, whereas the inputs to these pre_hook functions are the entire sample object -- and maybe even also the other variable namespace objects. Still, the similarity between these two things is interesting to the point that I'm wondering if there's something even more general possible here.

for example, the dynamic_variables_command_template is essentially taking as input some sample attributes, and modifying the compute namespace before submitting.

the write_sample_yaml and write_sample_cwl are taking sample attributes and writing an output file before submitting.

if we made the pre_submit idea more general than just taking the sample as input, and allow it to actually modify the variables, then the dynamic_variables_command_template job could also be accomplished with the pre_hook machinery.

afrendeiro · 2020-08-27T20:25:27Z

Yeah I guess compute variables could be populated this way if there are functions that take the sample as input and add those attributes. I'm assuming the called functions are allowed to modify the sample object, right? (don't even think there's a mechanism to prevent that in Python since everything is mutable by default)

nsheff · 2020-08-27T20:30:25Z

well, if it's happening as a script rather than a function then it can't. this is already how it works with the dynamic compute variables. it prints out variables, which are then read in by looper and used to modify the compute variables.

afrendeiro · 2020-08-27T20:37:56Z

Oh I see, I wasn't familiar with. Well, that script could return a YAML/JSON version of the sample too I guess.
Probably overkill though.

nsheff · 2020-08-28T18:44:44Z

Two ideas for how to implement this that checks all these boxes:

Option 1

The pre_submit section has a subsection for each looper variable namespace

pre_submit:
  project:
    python_functions:
      - ...
    command_template: null
  sample:
    python_functions:
      - ...
    command_template:
      - ...
  pipeline:
    python_functions:
      - ...
    command_template:
      - ...
  looper:
    python_functions:
      - ...
    command_template:
      - ...
  compute:
    python_functions:
      - ...
    command_template:
      - ...

So, for example:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  sample:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
  compute:
    command_template:
      - build-compute-params.py -a {sample.asset}
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

the compute.command_template would supplant the dynamic variables command template functionality.

it returns:

{
	"mem": "6000"
}

You can use these sections to modify any of the looper namespaces.

Option 2

There are no sub-sections for namespaces, but all namespaces are included in a dict.

pre_submit:
  python_functions:
    - ...
  command_template:
    - ...

example:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
path: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - dir/build-sample-params.py -a {sample.attr_a} -b {sample.attr_b}
      - build-compute-params.py -a {sample.asset}
command_template: >
  cwl-runner {pipeline.path} {sample.yaml_file}

In this case the external script would return:

{
	"sample": {},
	"compute": {
		"mem": "6000"
	}
}

The function would have to return a dict with a Sample object as one item, etc. This second one is more universal; it allows you to use information from all namespaces to modify any of them, rather than dividing it out. but, the objects passed to and from are a bit heavier. on the other hand, they're more explicit.

nsheff · 2020-08-28T18:52:01Z

Currently the python_functions are:

def write_sample_yaml(sample, subcon=None):
    """
    Produce a complete, basic yaml representation of the sample
    
    :param peppy.Sample sample: A sample object
    """    
    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))
    return(sample)

For the more universal version we'd do something like:

def write_sample_yaml(sample, compute, looper, pipeline):
    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline
    }

    return(return_value)

Or maybe:

def write_sample_yaml(namespaces):
    """
    Produce a complete, basic yaml representation of the sample
    
    :param dict namespaces: A dict with namespaces: sample, pipeline, looper, compute
    """    
    sample = namespaces["sample"]
    compute = namespaces["compute"]
    looper = namespaces["looper"]
    pipeline = namespaces["pipeline"]

    _LOGGER.info("Calling write_sample_yaml plugin.")
    sample.to_yaml(subcon._get_sample_yaml_path(sample))

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline
    }

    return(return_value)

stolarczyk · 2020-08-28T19:43:45Z

I like the idea of a single pre_submit section taking care of all the namespace management!

As for implementation/format selection:

From this comment I'd prefer "Option 2", for the reasons you named there. And I don't think the script from command_template would need to return empty dicts for namespaces that are not to be modified, did you put "sample": {} on purpose there?

From this comment I don't have preference, but again, I like that all the namespace handling could happen in single hook fucntion.

So, if we implemented that, the problem we talked about in #288 goes away? The target YAML path for sample can be created from any namespace attributes in a hook function.

nsheff · 2020-08-28T19:52:35Z

I don't think the script from command_template would need to return empty dicts for namespaces that are not to be modified,

I agree -- I didn't mean it would return an empty dict, it was just a placeholder. Either the function and the script would only return whatever it wanted to update. null/missing entries would simply not be updated.

I like that all the namespace handling could happen in single hook fucntion.

Yeah. for these functions, the advantage of the signature being (sample, compute, pipeline, looper) is that you have everything directly right there. the disadvantage is that if we change in the future, those external functions would lose compatibility. the function(namespaces_dict) signature is more universal, I guess.

So, if we implemented that, the problem we talked about in #288 goes away?

I think it changes but doesn't get solved completely; I still think we'd want to write a built-in function that accepted some sample attributes and gave a decent path, or something. with this new approach, I think it becomes easier because now those functions will have access to all the namespace variables directly, that's what you mean right?

stolarczyk · 2020-08-29T17:42:01Z

I'm trying to implement this and found a potential problem: there's no easy way to make the scripts used in piface.pre_submit.command_template portable. For piface.compute.dynamic_variables_command_template we solved it by specifying an additional key piface.compute.dynamic_variables_script_path that is made relative to the pipeline interface and can be used in the command template. Since in this case there can be multiple templates we can't follow the same scheme. Any ideas how to solve it?

Should we allow only one command? Or do this:

pre_submit:
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

we would need to assume that every str attribute in pre_submit is a path that should be made relative to piface, or add another attribute that would list such attributes:

pre_submit:
    script_paths: [my_path1, path2]
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

That gets a bit complex and it feels like "one command only" option makes the most sense.. IDK

nsheff · 2020-08-29T19:48:56Z

should we add a new relative_paths section for general use? because this is essentially the same thing as the top-level path key we allow now, right? Should we gather all of those into one place, which would be available via the pipeline namespace? This is close to what you're doing already, but they can be defined now in multiple places...

So I guess instead of this:

path: path/to/main/script.py
command_template: {pipeline.path} ...
pre_submit:
    script_paths: [my_path1, path2]
    my_path1: dir/build-sample-params.py
    path2: build-compute-params.py
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.pre_submit.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.pre_submit.path2}  -a {sample.asset}

I'm proposing this:

script_paths:
  main_path: path/to/main/script.py
  my_path1: dir/build-sample-params.py
  path2: build-compute-params.py
command_template: {pipeline.main_path} ...
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - {pipeline.my_path1} -a {sample.attr_a} -b {sample.attr_b}
      - {pipeline.path2}  -a {sample.asset}

stolarczyk · 2020-08-30T20:00:04Z

good thought. Maybe we can make it even more versatile and name this section "paths" -- this might be useful beyond scripts. For example, some configuration or sample input files could be shipped with pipelines and embedded in command templates.

nsheff · 2020-08-31T13:11:36Z

Yeah I wasn't sold on script_paths. Also think of relative_paths. But paths is fine. So in other word, our current path keyword just becomes paths and now accepts a list instead of a single path.

I guess it would be more straightforward than my original proposal if you referred to them as pipeline.path.path_key rather than just pipeline.path_key, -- the later would allow for clashes with other top-level variables. what do you think?

stolarczyk · 2020-08-31T13:20:21Z

So in other word, our current path keyword just becomes paths and now accepts a list instead of a single path.

Actually, as of now it's not a YAML list but mapping

paths:
  key: x

And thanks to that we can refer to these paths with: {pipeline.paths.<key>}. Therefore, all pifaces that we currently have need to update the pipeline script path source, unless we support the top level path key as well.

It's already implemented and seems to work, you can give it a try.

nsheff · 2020-08-31T13:39:57Z

ll pifaces that we currently have need to update the pipeline script path source, unless we support the top level path key as well.

Well, we could deprecate it; so, provide a warning but allow it for now? I think the same path should be followed for the dynamic_variables_command_template, which will be superseded.

It's already implemented and seems to work, you can give it a try.

wow great, thanks! are you referring to just the paths component, or also to the pre_submit proposal with functions and templates? I think my use case for the former is really going to be in a use of the latter...

jpsmith5 · 2020-08-31T15:03:52Z

No, I don't use them currently, so it should be okay to adjust without bumping up against anything I've done.

nsheff · 2020-08-31T16:56:07Z

@stolarczyk I noticed: it seems that you're requiring the return value to be the whole set of namespaces, like this:

    return_value = {
        "sample": sample,
        "compute": compute,
        "looper": looper,
        "pipeline": pipeline,
        "project": project
    }
    return return_value

I'd rather just only have to return ones I'm updating. For example, in this example I'm only updating sample so I should be fine with:

    return {"sample" : sample }

but when you change to that you get an error:

## [2 of 2] sample: frog_2; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_basic
Missing sample, project or pipeline attributes required by command template: 'cwl-runner {pipeline.paths.main} {sample.yaml_file}
{%- if sample.command_extra is defined %} {sample.command_extra}   {% endif -%}'
> Not submitted: 'pipeline' is undefined

almost like it's only populating the sample namespace if it's run through the function.

nsheff · 2020-08-31T16:58:56Z

I just pushed an update with an example. you can try this:

pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
paths:
  main: wc-tool.cwl  # relative to this pipeline_interface.yaml file
pre_submit:
  python_function:
  - looper.write_cwl_yaml
command_template: >
  cwl-runner {pipeline.paths.main} {sample.cwl_yaml}

I changed write_sample_yaml_basic to just return {"sample" : sample }, which I think should work, but doesn't
Also, I extracted some code into a new function: get_sample_yaml_path -- where do you think this should live? We should export it so that others can use it, I guess. Maybe with a param for filename or something.

stolarczyk · 2020-08-31T17:01:07Z

you're right, this is how it works and that's what I wrote above:

python_functions take a dict with all namespaces as an input and need to return an updated one

I thought that it's not a pain to return the entire namespaces dict since it's the input to the hook function. But I don't feel strongly about it.

nsheff · 2020-08-31T17:30:57Z

well, I think users will commonly not be aware of all these namespaces, so it will be simpler if they can return only the things that they updated. this is what I meant when I wrote:

Either the function and the script would only return whatever it wanted to update. null/missing entries would simply not be updated.

that seems to be clearer to me, if it's not harder to implement. do you see any disadvantages?

stolarczyk · 2020-08-31T17:50:10Z

that seems to be clearer to me, if it's not harder to implement

it's a bit more complex since otherwise the namespaces just become what's returned from the hook funcion. But I implemented the other way for the pre_submit.command_template anyway, so we will just need to reuse abstract that away and reuse.

do you see any disadvantages?

I don't see any disadvantages. Conversely, I see an advantage of the same output type requirement.

nsheff · 2020-08-31T18:12:40Z

Ok sounds good. I was thinking it might be as simple as just calling the update() functions instead of just replacing the objects. Or, you could just only replace the objects if the entry exists, I guess.

nsheff · 2020-09-01T13:15:36Z

Just realized in writing the docs... the new paths section is nice for recording multiple possibilities for script paths, but how are we going to handle the custom sample_yaml_path key? Right now this is used to specify where the sample yaml is put, but since we've made this flexible, this has to have some different mechanism...this represents additional information that is passed to the plugin functions.

it's not the same thing as paths...

paths:

intended for paths to scripts
are static paths to specific files, one per pipeline (not a different one per sample)
relative entries are automatically made relative to the pipeline interface
is available to custom plugin functions via pipeline.paths...

sample_yaml_path

intended for output produced by the pipeline
is a template with {namespace.variables} that are populated, which differ per sample
relative paths are interpreted as being relative to the looper.output_dir.
is not currently available to custom plugin functions (I think), which is a problem, since these functions are the ones that write the sample yaml files.

Do we need to generalize the sample_yaml_paths the same way we generalized path ? So you'd have something like output_paths:

pipeline_name: PIPELINE1
pipeline_type: sample
paths:
  main: pipelines/pipeline1.py
  hook1: pipelines/hook1.py
output_paths:
  sample_yaml_path: {looper.output_dir}/{sample.sample_name}.yaml
  some_other_path: {sample.sample_name}.txt
  sample_yaml_cwl: {sample.sample_name}_cwl.yaml
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
    python_function:
      - looper.write_sample_yaml_basic
    command_template:
      - "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"

command_template: >
  {pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome}

Then you can refer to those variables from within your plug-in function with: pipeline.output_paths.sample_yaml_cwl

Then a follow-up questions:

should we allow the input paths to use the namespace variables as well? It seems like there would be advantages to this.
should we change from paths to input_paths ?
In this case the only difference between input_paths and output_paths would be the implied relative directory. would it make more sense to organize these like:

paths:
  input:
    main: pipelines/pipeline1.py
  output:
    sample_yaml_path: {sample.sample_name}.yaml

An alternative idea: should we eliminate the implied relative interpretation and simply force you to state the relative folder if you want to use it? they are already available in some namespace variables: looper.output_dir for the output paths, and I guess we could add something like looper.piface_dir or something. The advantage of this is 1) things are explicit instead of implicit; and 2) there's no distinction between "input paths" and "output paths", so we'd be back again to just the paths variable.

Yet a third idea is to just keep it how it is, where paths are relative to the pipeline interface unless specified otherwise...and then just include a folder variable for any output paths, like looper.output_dir/blahblahblah.

thoughts?

stolarczyk · 2020-09-01T13:58:48Z

I thought that the point of the plugin functions is, among other things, to build the path to the sample yaml and write it since all the namespaces are available as an input.
What is the advantage of creating the path in pipeline.paths.output and using it in the plugin function over just creating it in the plugin function?

nsheff · 2020-09-01T14:06:55Z

that it gives control to the pipeline runner, I suppose, rather than the pipeline author.

in other words, if I'm just using a pipeline, I may want things to be put in a different output space. I shouldn't have to write code to do that, it should be configurable. that's why we originally made the sample output path configurable, right?

right now that's no longer configurable, right?

stolarczyk · 2020-09-01T14:19:29Z

oh I see.. I considered the plugin creation the new "configuration".

I think this might be a source of confusion and potential problems. As a user I still need to go over the plugin code and make sure it uses the path I'm expecting it to (pipeline.paths.output.sample_yaml_path) and not some other path or even a custom one, created in the plugin funcion.
So I guess what I'm trying to say is that the information you put in the piface could be modified without user's knowledge.

nsheff · 2020-09-01T14:26:06Z

Yeah, that makes some sense. I guess I see the plugins as being an intermediate step. they're not quite "built-in" (although the current ones are in fact built-in), but not quite as configurable as just tweaking a piface... so i don't expect there to be tons of plugins and I figure they should be reused.

The plugin author would have to say "to use this plugin, you must/may provide the following path or output_path variables: etc. Or, "this plugin can be configured in the following way: ...".

Essentially, we make the plugins configurable.

nsheff · 2020-09-01T14:30:30Z

actually this is basically already how it is... the plugin has access to pipeline.attr, so the plugin author can say 'you have to set pipeline.attr to use this plugin. so, it's already like that.

the question here is just: do we need a special category for output file attributes that are assumed relative? or do we need to provide a special class of attributes that are populated from templates by the namespace variables? this is what we currently lack.

stolarczyk · 2020-09-08T14:47:18Z

I think I like this option the most:

An alternative idea: should we eliminate the implied relative interpretation and simply force you to state the relative folder if you want to use it? they are already available in some namespace variables: looper.output_dir for the output paths, and I guess we could add something like looper.piface_dir or something. The advantage of this is 1) things are explicit instead of implicit; and 2) there's no distinction between "input paths" and "output paths", so we'd be back again to just the paths variable.

It's both least confusing to use and seems easiest to implement; we'd just need to allow the values under path use namespaces, right?

stolarczyk · 2020-09-11T20:38:40Z

I've just implemented the approach to solve this from the comment above, and now I was able to do:

pipeline_name: PIPELINE1
pipeline_type: sample
paths:
  main: "{looper.piface_dir}/pipelines/pipeline1.py"
  hook1: "{looper.piface_dir}/pipelines/hook1.py"
input_schema: https://schema.databio.org/pep/2.0.0.yaml
pre_submit:
    python_function:
      - looper.write_sample_yaml
    command_template:
      - "{pipeline.paths.hook1} --sample-name {sample.sample_name} --attr 10"

command_template: >
  {pipeline.paths.main} --sample-name {sample.sample_name} --genome {sample.genome}

nsheff · 2020-09-11T22:50:39Z

nice! just curious, is it required to have those template values in quotes?

stolarczyk · 2020-09-11T23:11:02Z

Yes, when the value starts with curly braces. Otherwise it would be parsed as a mapping

nsheff · 2020-09-15T21:22:22Z

Ok, so this is what we have.

old way:

pipeline_name: refgenie
pipeline_type: sample
input_schema: build_inschema.yaml
compute:
  dynamic_variables_command_template: pipeline_interfaces/build-compute-params.py -a {sample.asset} -g {sample.genome}
command_template: >
  refgenie build { sample.genome }/{ sample.asset }
  {% if sample.genome_description is defined %}--genome-description '{ sample.genome_description }'{% endif %}
  {%- if sample.input_value is defined -%}
    {%- for it in sample.input_type -%}
      {%- if it == 'files' -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.local_input + sample.input_id[loop.index0]}
      {%- else -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.input_value[loop.index0]}
      {%- endif -%}
    {%- endfor -%}
  {%- endif -%}

new way:

pipeline_name: refgenie
pipeline_type: sample
input_schema: build_inschema.yaml
paths:
  refgenie: "refgenie"
  compute_vars: "{looper.piface_dir}/build-compute-params.py"
pre_submit:
  command_template: >
    {pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}
command_template: >
  {pipeline.paths.refgenie} build { sample.genome }/{ sample.asset }
  {% if sample.genome_description is defined %}--genome-description '{ sample.genome_description }'{% endif %}
  {%- if sample.input_value is defined -%}
    {%- for it in sample.input_type -%}
      {%- if it == 'files' -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.local_input + sample.input_id[loop.index0]}
      {%- else -%}
        {' --' + it + ' ' + sample.input_id[loop.index0] + '=' + sample.input_value[loop.index0]}
      {%- endif -%}
    {%- endfor -%}
  {%- endif -%}

nsheff · 2020-09-15T21:27:57Z

Ok, one issue is that the command_template under pre_submit is a list, while the other one is a singleton... So the above doesn't work because:

Ignoring invalid pipeline interface source: /sfs/qumulo/qproject/shefflab/deploy/plantref/plantref/pipeline_interfaces/build_piface.yaml. Caught exception: '{pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}\n' is not of type 'array'

You have to do:

pre_submit:
  command_template: 
    - "{pipeline.paths.compute_vars} -a {sample.asset} -g {sample.genome}"

how can we reconcile the need to accommodate multiple hooks with the need to make the interface to command_template uniform? Make the 2nd one also require a list?

nsheff · 2020-09-17T18:33:16Z

syntax should become:

pre_submit:
  python_functions:
  command_templates:

functionality doesn't need to change.

nsheff · 2020-10-06T12:50:38Z

@afrendeiro we've finalized the design and implemented this now. So, this will introduce some breaking changes for some of your pipelines, which will now be required to specify that they need to use that plugin in the new way. Should be as easy as as adding:

pre_submit:
  python_functions:
    - looper.write_sample_yaml
var_templates:
  sample_yaml_path: "{looper.output_dir}/submission/{sample.sample_name}.yaml"

The output file path can be customized using var_templates.sample_yaml_path. If this parameter is not provided, the file will be saved as {looper.output_dir}/submission/{sample.sample_name}_sample.yaml.

nsheff changed the title ~~Sample representation plugins~~ A plugin system for pre-submission hooks to modify samples and change yaml representations Aug 25, 2020

nsheff mentioned this issue Aug 25, 2020

Introducing CWL compatibility #283

Closed

nsheff added a commit that referenced this issue Aug 25, 2020

add pre_submit hook. See #285

fd7a95f

nsheff modified the milestones: 1.3.0, 1.4.0 Aug 27, 2020

stolarczyk added a commit that referenced this issue Aug 30, 2020

paths section in piface; #285 (comment)

4bc17b8

stolarczyk added a commit that referenced this issue Aug 30, 2020

first pass at command templates and python functions in pre_submit; #285

44d127a

nsheff mentioned this issue Aug 31, 2020

Add support for pre-submit hooks and CWL workflows #292

Merged

11 tasks

stolarczyk added a commit that referenced this issue Aug 31, 2020

make hooks output structure uniform, refactor code; #285

3ca90f3

stolarczyk added a commit that referenced this issue Sep 11, 2020

dont expand values under 'paths', make them templates; #285

9f84d10

This was referenced Sep 15, 2020

Paths section should not be required #293

Closed

Post-submission hooks for checks #294

Closed

nsheff mentioned this issue Sep 17, 2020

Customizing location of flag files in the results_pipeline dir pepkit/pipestat#34

Closed

nsheff mentioned this issue Sep 17, 2020

make presubmit keys plural #297

Closed

stolarczyk closed this as completed Oct 7, 2020

afrendeiro added a commit to epigen/open_pipelines that referenced this issue Oct 7, 2020

adapt to resolved pepkit/looper#285

b7cfac0

A plugin system for pre-submission hooks to modify samples and change yaml representations #285

A plugin system for pre-submission hooks to modify samples and change yaml representations #285

Comments

nsheff commented Aug 25, 2020

nsheff commented Aug 25, 2020

nsheff commented Aug 27, 2020

afrendeiro commented Aug 27, 2020

nsheff commented Aug 27, 2020

afrendeiro commented Aug 27, 2020

nsheff commented Aug 27, 2020 • edited Loading

afrendeiro commented Aug 27, 2020

nsheff commented Aug 27, 2020

nsheff commented Aug 27, 2020

afrendeiro commented Aug 27, 2020

nsheff commented Aug 27, 2020

afrendeiro commented Aug 27, 2020

nsheff commented Aug 28, 2020 • edited Loading

Option 1

Option 2

nsheff commented Aug 28, 2020

stolarczyk commented Aug 28, 2020

nsheff commented Aug 28, 2020 • edited Loading

stolarczyk commented Aug 29, 2020 • edited Loading

nsheff commented Aug 29, 2020 • edited Loading

stolarczyk commented Aug 30, 2020

nsheff commented Aug 31, 2020

stolarczyk commented Aug 31, 2020 • edited Loading

nsheff commented Aug 31, 2020

jpsmith5 commented Aug 31, 2020

nsheff commented Aug 31, 2020

nsheff commented Aug 31, 2020

stolarczyk commented Aug 31, 2020

nsheff commented Aug 31, 2020

stolarczyk commented Aug 31, 2020

nsheff commented Aug 31, 2020

nsheff commented Sep 1, 2020 • edited Loading

stolarczyk commented Sep 1, 2020

nsheff commented Sep 1, 2020

stolarczyk commented Sep 1, 2020 • edited Loading

nsheff commented Sep 1, 2020

nsheff commented Sep 1, 2020

stolarczyk commented Sep 8, 2020

stolarczyk commented Sep 11, 2020 • edited Loading

nsheff commented Sep 11, 2020

stolarczyk commented Sep 11, 2020

nsheff commented Sep 15, 2020

old way:

new way:

nsheff commented Sep 15, 2020 • edited Loading

nsheff commented Sep 17, 2020

nsheff commented Oct 6, 2020 • edited Loading

nsheff commented Aug 27, 2020 •

edited

Loading

nsheff commented Aug 28, 2020 •

edited

Loading

nsheff commented Aug 28, 2020 •

edited

Loading

stolarczyk commented Aug 29, 2020 •

edited

Loading

nsheff commented Aug 29, 2020 •

edited

Loading

stolarczyk commented Aug 31, 2020 •

edited

Loading

nsheff commented Sep 1, 2020 •

edited

Loading

stolarczyk commented Sep 1, 2020 •

edited

Loading

stolarczyk commented Sep 11, 2020 •

edited

Loading

nsheff commented Sep 15, 2020 •

edited

Loading

nsheff commented Oct 6, 2020 •

edited

Loading