Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline interface specification #244

Closed
5 of 7 tasks
nsheff opened this issue Apr 6, 2020 · 9 comments
Closed
5 of 7 tasks

pipeline interface specification #244

nsheff opened this issue Apr 6, 2020 · 9 comments
Assignees
Milestone

Comments

@nsheff
Copy link
Contributor

nsheff commented Apr 6, 2020

We discussed the final updates to complete the new pipeline interface format. The new format contains exactly 1 pipeline per interface, which can contain two components, sample_pipeline (formerly pipelines:...) and project_pipeline (formerly collators:...). These new sections have only 1 pipeline, so we drop the pipeline key that was used in protocol mappings.

Here is the list of updates:

  • Remove protocol mappings and collator mappings.
  • bioconductor section is external to this PI spec and should therefore be its own section.
  • collapse demo command template args onto a single line.
  • rename schema to input_schema.
  • eliminate the name key and move it to a top-level attribute called pipeline_name (or just name?).
  • the old compute resources functionality should be removed.
  • define a jsonschema schema that can validate these pipeline interface files

example:

pipeline_name: pepatac
sample_pipeline:
  command_template: ...
  ...
project_pipeline:
  command_template: ...
  ...

looper config

The protocol mappings (and collator mappings) functionality will be moved into a new looper configuration file. this file operates like a refgenie configuration file, which can be passed by looper -c config.yaml or via the $LOOPER env var. It is optional. The advantage of moving the mappings to a separate file is that they behave has a connector between projects and pipelines, which seem more appropriate to be configured by the analysis running user, rather than by the pipeline author (or the project author). I am still thinking about the format for the looper config mappings.

@stolarczyk stolarczyk self-assigned this Apr 6, 2020
stolarczyk added a commit to databio/schema.databio.org that referenced this issue Apr 7, 2020
@stolarczyk
Copy link
Member

stolarczyk commented Apr 7, 2020

Needed a sample looper config to build the LooperConfig class around and went with this for now:

protocol_mapping:
  bedstat: $CODE/bedstat/pipeline_interface_newer.yaml
  bedstat1: /Users/mstolarczyk/code/bedstat/pipeline_interface_newer.yaml

@nsheff
Copy link
Contributor Author

nsheff commented Apr 7, 2020

The looper config spec gets a bit complicated. I had this as an idea:

sample_pipeline_mappings:
  mapping_attribute: "protocol"
  mappings:
    PRO-seq: relative/to/peppro/pipeline_interface.yaml
    atac-seq: relative/to/pepatac/pipeline_interface.yaml

But can it allow multiples?

sample_pipeline_mappings:
  mapping_attribute: "protocol"
  mappings:
    PRO-seq: [/path/to/peppro/pipeline_interface.yaml, /pipeline.yaml]
    atac-seq: /path/to/pepatac/pipeline_interface.yaml

Would be nice if we could list lots of keys that have the same behavior all together. Also, what if protocol is different for different projects/pipelines? So maybe we want:

sample_pipeline_mappings:
  - mapping_attribute: "protocol"
    keys: [PRO-seq, pro-seq, GRO-seq, gro-seq]
    interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
  - mapping_attribute: "protocol"
    keys: [ATAC, ...]
    interfaces: relative/path/to/pepatac/pipeline_interface.yaml

Would we want those keyed?

sample_pipeline_mappings:
  runon:
    mapping_attribute: "protocol"
    keys: [PRO-seq, pro-seq, GRO-seq, gro-seq]
    interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
  accessibility:
    mapping_attribute: "protocol"
    keys: [ATAC, ...]
    interfaces: path/to/pepatac/pipeline_interface.yaml

No real utility to those keys, I guess, just makes the thing easier to understand perhaps?

Would we want to filter on more than just 1 attribute? Maybe use the implied attribute conditional ideas?

sample_pipeline_mappings:
  runon:
    interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
    if: 
      "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
      "species": "human"

or maybe

sample_pipeline_mappings:
  runon:
    interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
    conditions: 
      "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
      "species": "human"

@nsheff
Copy link
Contributor Author

nsheff commented Apr 7, 2020

generalizing this, isn't this basically just a case of implied_attributes, except it spans projects? perhaps then, this could be generalized into some kind of attribute imply functionality in peppy (or a third-party plugin) that is global. So, how could this work... if you had a generic project that looked like:

...
sample_modifiers:
  imply:
    if:
      protocol: [x,y,z]
    then:
      pipeline_interface: [p1, p2, p3]

Then, you'd just import this project in every one of your projects, and voila, you've accomplished the global protocol mappings. This doesn't quite work as is, because the pipeline interface is a project attribute, not a sample attribute. But in the end, isn't this happening at the sample level anyway? Is the 'pipeline interface' really even a project-level attribute at all? it seems it only is for the project-level analysis. Looper could, in theory, look for the pipeline_interface attribute at the sample level instead of at the project level, at least for sample processing.

project-level stuff

Which leads to another point: there's some murkiness for me still in the sample-level mappings vs. the project-level mappings. How would this system deal with project_pipeline_mappings?

if project pipelines used a pipeline_interface attribute at project level, then perhaps we could use the idea we had previously: a 'promoted_attributes' section of the project that contains sample-level attributes that are constant across the project. But what would you do for projects where each sample wants a different pipeline? Well, these projects are less likely to make use of a collator anyway.

@stolarczyk
Copy link
Member

stolarczyk commented Apr 8, 2020

But in the end, isn't this happening at the sample level anyway? Is the 'pipeline interface' really even a project-level attribute at all? it seems it only is for the project-level analysis. Looper could, in theory, look for the pipeline_interface attribute at the sample level instead of at the project level, at least for sample processing.

I believe our initial reasoning was to make peppy as simple as possible and putting pipeline_interface in the looper section of the config was an explicit indication that this is outside of the PEP spec and that peppy is unaware of looper connection.

But, if we wanted to break that rule what you're saying is really straightforward and clear. Look for sample-level pipeline interface in Sample.pipeline_interface and use it. Similarly for project-level in Project.pipeline_interface. In case of sample we beautifully reuse the powerful system of sample modifiers (implied in most cases) to set the paths to pipeline interfaces. Obviously, this is not possible on the project level, but I think the pipeline interface selection is not going to be that complex in this case, so it could be just set explicitly in the config.
This shortcoming could be resolved by the "promoted attributes" idea. We would need to un-overload it and look for interfaces in Sample.sample_pipeline_interface and Project.project_pipeline_interface, though. Generally, I'm not a fan of this promoting concept, as I expect this would happen without user's intent in 90% of cases... And append modifier concept would get somewhat unclear

In case we don't want to do that, I like the config format pasted below the most.

sample_pipeline_mappings:
  - mapping_attribute: "protocol"
    keys: [PRO-seq, pro-seq, GRO-seq, gro-seq]
    interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
  - mapping_attribute: "protocol"
    keys: [ATAC, ...]
    interfaces: relative/path/to/pepatac/pipeline_interface.yaml

@nsheff
Copy link
Contributor Author

nsheff commented Apr 8, 2020

make peppy as simple as possible and putting pipeline_interface in the looper section of the config was an explicit indication that this is outside of the PEP spec and that peppy is unaware of looper connection.

are you referring to the idea of project-spanning implied attributes ? peppy would still be unaware of the looper connection. I don't believe this would violate this -- I don't really see where you're proposing this connection is happening. this would be no different than it currently is: looper looks for looper-specific settings in the PEP, independent of peppy. it is no different. I'm just saying that the looper config idea can almost already be solved with existing peppy functionality, and we wouldn't even need to code separate functionality for looper config. with a small improvement to peppy functionality it might actually be possible in practice. I'll think a bit more about this.

Generally, I'm not a fan of this promoting concept, as I expect this would happen without user's intent in 90% of cases

I'm not sold on it either. but I don't think it would promote to Project.project_pipeline_interface... peppy would automatically add: Project.promoted_attributes.x ... for all sample attributes. So I don't think it would cause any problems with any user-specified project stuff. it would just be available for use by any tools that wanted an automatic project-level summary of samples, for cases where it's simple to do so. like looper wants... to be clear: I think promoted_attributes could be a new peppy feature that does not tie it to looper at all, it could be used for anything.

In case we don't want to do that, I like the config format pasted below the most.

I like this one, too. it has the weakness of only allowing 1 attribute to filter on. is that ok? probably.

but I also really like this construction, which is less verbose:

"protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]

not sure how to fit that in. how about this hybrid?

sample_pipeline_mappings:
  - interfaces:  [path/to/peppro/pipeline_interface.yaml, someone/elses/pipeline.yaml]
    filters:
      "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
  - interfaces: relative/path/to/pepatac/pipeline_interface.yaml
    filters:
      "protocol": [ATAC, ...]

it's both more concise and more powerful...

@stolarczyk
Copy link
Member

stolarczyk commented Apr 8, 2020

I'm not sold on it either. but I don't think it would promote to Project.project_pipeline_interface... peppy would automatically add: Project.promoted_attributes.x ... for all sample attributes

Alright, that's great. I'd say that's the way to go then, if we can spare another configuration file (looper) and not limit the flexibility.

Does that mean that for sample pipelines we look for pipeline interface location in Sample.pipeline_interface and for project pipelines in Project.promoted_attributes.pipeline_interface, but select the project-level pipeline within this file? What if Sample.pipeline_interface is not constant across all the samples and therefore there's no promoted attribute ? Do we look in Project.looper.pipeline_interface as well? This location should be prioritized anyway (?)

@nsheff
Copy link
Contributor Author

nsheff commented Apr 8, 2020

I think I have made some conceptual progress...

One difficulty is that removing protocol_mappings from the piface sort of complicated a looper need. looper needs 2 pieces of information: 1) for a project, which pifaces to consider; then, 2) for each sample in that project, which of the pifaces (formerly pipelines) to use for it. So, we need to provide both of these pieces of info.

We do the first with looper.pipeline_interfaces in the project, as before. The second, without protocol_mappings, we discussed moving this into a looper config or something. But here's another way that doesn't require a looper config: What if the PEP provides both?

So I could add this to my PEP:

looper:
  pipeline_interfaces: [x.yaml, y.yaml, z.yaml]
  protocol_mappings:
    - interfaces:  [x.yaml]
      filters:
        "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
    - interfaces: y.yaml
      filters:
        "protocol": [ATAC, ...]

Ok. Looper has all the information it needs -- the pifaces to consider, and how to match which samples to which pipelines. But this gets annoying to put protocol_mappings in every PEP... especially because they often are constant across projects, which is why we originally had them in the pipeline interface. Hence the idea of the looper config. But what if we used an imported PEP to supply these globally instead of a looper config?

Making a global protocol mappings

So I'd make a project with just the mappings, name it path/to/global/protocol_mappings.yaml,

looper:
  protocol_mappings:
    - interfaces:  [x.yaml]
      filters:
        "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
    - interfaces: y.yaml
      filters:
        "protocol": [ATAC, ...]

and then say in our PEPs:

name: My ATAC-seq project
pep_type: ATAC-seq
pep_config: 2.0.0
imports:
  - path/to/global/protocol_mappings.yaml
looper:
  pipeline_interfaces: [x.yaml, y.yaml, z.yaml]

Now we have a global mappings, and no looper config needed. A nice thing about this is it makes it clearer how projects can import whatever mappings they want... so you could have global_mappings_v1.yaml, and then test_mappings.yaml, and change the import (even with an amendment), to test something else. no global config used.

But wait, there's more! Can we also move the pipeline interfaces globally, so I can just say "ATAC-seq projects automatically use these pipelines"?

Making pipeline interfaces global with implied_project_attributes

Ok, here's the idea fleshed out a bit more. Say peppy offered a project_modifiers.imply, which is exactly the same as the sample_modifiers version but operates on a project. Then, I could do:

name: global looper pipeline interfaces
pep_config: 2.0.0
project_modifiers:
  imply:
    if:
      pep_type: [ATAC, ATAC-seq, DNAse, blah, blah, blah]
    then:
       looper.pipeline_interfaces: [x.yaml, y.yaml, z.yaml]

Here, I put this under imply because the interfaces change depending on the project.

Now in my project, I do:

name: My ATAC-seq project
pep_type: ATAC-seq
pep_config: 2.0.0
imports:
  /path/to/looper_pipeline_interfaces.yaml

And I just add this imports directive to each of my PEPs.

Putting them together

Both of these can work simultaneously. Here's a combined global_settings_pep.yaml, which does both:

name: global looper interfaces and protocol mappings PEP
pep_config: 2.0.0
looper:
  protocol_mappings:
    - interfaces:  [x.yaml]
      filters:
        "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
    - interfaces: y.yaml
      filters:
        "protocol": [ATAC, ...]
project_modifiers:
  imply:
    if:
      pep_type: [ATAC, ATAC-seq, DNAse, blah, blah, blah]
    then:
       looper.pipeline_interfaces: [x.yaml, y.yaml, z.yaml]

Now my project PEP says:

name: My ATAC-seq project
pep_type: ATAC-seq
pep_config: 2.0.0
imports:
  /path/to/global_settings_pep.yaml

It automatically inherits the correct looper.pipeline_interfaces for this project type, and also the correct protocol mappings.

It's kind of interesting that pipeline_interfaces gets implied, because it depends on the project type, but protocol_mappings doesn't -- it would be possible to put protocol_mappings under an imply as well, if they were different depending on the project. But I guess I figure the mappings are probably pretty universal... Similarly, it would be possible to just put the pipeline_interfaces global, and then let the protocol mappings sort out which actual pipelines get run on which samples.

Now, for a thought experiment, imagine we didn't do project_modifiers.imply and just stuck protocol_mappings and pipeline_interfaces in a global PEP, that we imported into all our PEPs. this would kind of already work... but it would really require that every sample have a 'protocol' and that every pipeline was correctly mapped. so, pipelines like sra_convert that just run on everything wouldn't fit into this. therefore, it's kind of important that we be able to map individual interfaces to individual projects. But, the imply capability of interfaces is really a separate figure, it's independent of the original problem we were trying to solve with protocol_mappings. I guess I'm saying it's basically already solved by PEP imports.

what about promoted attributes?

the promoted attributes idea is not required for this, it's independent. if we also implemented promoted attributes (which I'm not really sure about yet), then it could eliminate the need to put pep_type: ATAC-seq, and you could imply based on project.promoted_attributes.protocol instead. I'm not sure that's worth implementing.

final thoughts

what's cool about this is that: nothing is hard coded or required for looper or peppy. someone else could do something totally different. this is just a system that we can build within the looper/peppy framework because it's so powerful and modular.

@nsheff
Copy link
Contributor Author

nsheff commented Apr 8, 2020

Related thought:

looper needs 2 pieces of information: 1) for a project, which pifaces to consider; then, 2) for each sample in that project, which of those pifaces (formerly pipelines) to use for it. So, we need to provide both of these pieces of info.

Before, it made sense to have the mappings because a piface could have >1 pipeline. Now, it's almost starting to feel redundant... since we're mapping protocols to pifaces instead of pipelines, do we really need to map the project to a piface, too? So, could we eliminate pipeline_interfaces and just have protocol_mappings?

The answer is: we could have... except we implemented the collators. So now, the protocol mappings suffices for what we were previously using pipeline_interfaces for, but with collators, we introduced a new need for them.

BUT... can we infer the collator need from the protocol mappings? Probably. So...we would:

  1. Remove looper.pipeline_interfaces in favor of looper.protocol_mappings (which now points to interface files instead of .
  2. instead of protocol_mappings, since protocol is now flexible, it should be something else. maybe pipeline_interfaces ? 😄 or pipeline_mapping ? piface_mapping ? interface_mapping ?
  3. Implement a default for no mappings required.
  4. when running a project_pipeline, with runp, we would simply run every collator for any mapping that has at least 1 sample in the project. Now, no pipeline_interfaces section required.

complex:

looper:
  interface_mapping:
    - interfaces:  [x.yaml]
      filters:
        "protocol": [PRO-seq, pro-seq, GRO-seq, gro-seq]
    - interfaces: y.yaml
      filters:
        "protocol": [ATAC, ...]

simple:

looper:
  interface_mapping:
    - interfaces:  [x.yaml]

no filters/conditions required!

maybe even allow a collapsed version, which becomes exactly what pipeline_interfaces was? simplest:

looper:
  interface_mapping: [x.yaml]

@stolarczyk
Copy link
Member

stolarczyk commented Apr 8, 2020

Sample-level pipelines selection

Eventually, we decided to look for sample-level pipelines in Sample.pipeline_interfaces. This is a list of sources, which can be both paths and URLs. This allows us to utilize the power of sample_modifiers to accomplish complex pipeline-sample mappings and still keep it very simple if the complexity is not required. See the examples pasted below:

pipeline_interfaces column already exists in the CSV

pep_version: 2.0.0
sample_table: path/to/samples.csv 

pipeline_interfaces attribute is added to every sample

pep_version: 2.0.0
sample_table: path/to/samples.csv

sample_modifiers:
  append:
    pipeline_interfaces: "test.yaml"

extreme case, where we use a remote pipeline interface based on a Sample.protocol value:

pep_version: 2.0.0
sample_table: path/to/samples.csv

sample_modifiers:
  imply:
    - if:
        protocol: [PRO-seq, pro-seq, GRO-seq, gro-seq] # OR
      then:
        pipeline: peppro
  append:
    s: pipeline_interfaces
  derive:
    attributes: pipeline_interfaces
    sources: 
      s: [https://piface.databio.org/{pipeline}.yaml, https://piface.databio.org/{pipeline}1.yaml]

Project-level pipelines selection

Project level pipelines are selected based on the sources that were mapped to the samples within the project (but project_pipelines key within the file is considered) or overwritten by Project.looper.pipeline_interfaces, which becomes the only way to set a project-level pipeline_interface for a project with no samples. Like this:

pep_version: 2.0.0
...
looper:
  pipeline_interfaces: [/path/to/pro_piface.yaml, https://piface.databio.org/peppro.yaml]
  pipeline_interfaces_key: pipeline_interfaces

The special key we use in looper could be configured in the Project.looper.pipeline_interfaces_key or in a looper config in the future

Note: default value for pipeline_interfaces_key up for debate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants