-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we define what a pipeline produces? #237
Comments
The output schema could actually also subsume the from peppro:
I think this would be a separate file from the outputs schema above. But it's interesting that we added these descriptive attributes here organically. Why wouldn't we also want these for the sample-level |
See #216 for a similar, simpler proposal. Advantages of going to the schema over the simpler yaml proposal are similar to those listed above:
|
A generic tool could use this schema to produce sample objects with populated paths. The tool would take this schema, plus either a PEP (to provide a whole project), or a sample dict (if it's just one sample), and produce either a project or a sample object with all of the path attributes populated. This simple tool could be implemented in both R and Python. Our pipelines could use this tool at the sample level to instantiate sample objects, then use the attribute names instead of encoding any paths perego would use this tool (at the project level) to get both sample files and project files for display. |
All for building In the additional flexibility now. I find myself usually happy for some up front investment later on. For me the future desire for that flexibility behaves in a way similar Hofstadter's law for time req.'d |
Re: Advantages of output schema
Re: disadvantages
|
does this now subsume all functionality previous accomplished by sample-level |
yes, I believe it does. |
Related to: #32, #61, #94, #201, #216
How should we define what a pipeline produces?
Originally the inputs were specified in the pipeline interface, now, we switched to a schema, which is used to validate a PEP for input into a pipeline. Here's an example: https://schema.databio.org/pipelines/pepatac.yaml
The schema is superior because it's decoupled from looper and can be used with
eido
now to validate the PEP. it's also reusable across pipelines.So, what about outputs ? Right now the pipeline specifies what outputs it produces in the pipeline interface, like so:
This is very similar to what we used to do with required attributes for input. Is there therefore similar value in abstracting this concept out to a schema of some sort? What would it look like?
Here, we extend the jsonschema vocabulary with a new term called 'path'. This is not used to validate objects at all, it's actually just used internally to populate them (which is our extension). It's exactly what we're already doing with
outputs
. So, what's the advantage of switching to a schema like this?summary_results
section)I'm not totally sold on this but want to throw it out there for comments.
Disadvantages:
it only adds utility in the case of someone visualizing output with looper summarize/peragoit adds utility for visualization, and also potentially for validating at a collate step.The text was updated successfully, but these errors were encountered: