How should we define what a pipeline produces? #237

nsheff · 2020-03-19T16:45:43Z

How should we define what a pipeline produces?

Originally the inputs were specified in the pipeline interface, now, we switched to a schema, which is used to validate a PEP for input into a pipeline. Here's an example: https://schema.databio.org/pipelines/pepatac.yaml

description: A PEP for ATAC-seq samples for the PEPATAC pipeline.
imports: http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        organism: 
          type: string
          description: "Organism"
        protocol: 
          type: string
          description: "Must be an ATAC-seq or DNAse-seq sample"
          enum: ["ATAC", "ATAC-SEQ", "ATAC-seq", "DNase", "DNase-seq"]
        genome:
          type: string
          description: "Refgenie genome registry identifier"
        read_type:
          type: string
          description: "Is this single or paired-end data?"
          enum: ["SINGLE", "PAIRED"]
        read1:
          anyOf:
            - type: string
              description: "Fastq file for read 1"
            - type: array
              items:
                type: string
        read2:
          anyOf:
            - type: string
              description: "Fastq file for read 2 (for paired-end experiments)"
            - type: array
              items:
                type: string
      required_input_attrs:
        - read1
      input_attrs:
        - read1
        - read2
      required:
        - sample_name
        - protocol
        - read1
        - genome
required:
  - samples

The schema is superior because it's decoupled from looper and can be used with eido now to validate the PEP. it's also reusable across pipelines.

So, what about outputs ? Right now the pipeline specifies what outputs it produces in the pipeline interface, like so:

pipelines:
  pepatac.py:
    name: PEPATAC
    path: pipelines/pepatac.py
    schema: https://schema.databio.org/pipelines/pepatac.yaml
    ...
    outputs:
      smooth_bw: "aligned_{sample.genome}/{sample.sample_name}_smooth.bw"
      exact_bw:  "aligned_{sample.genome}_exact/{sample.sample_name}_exact.bw"
      aligned_bam: "aligned_{sample.genome}/{sample.sample_name}_sort.bam"

This is very similar to what we used to do with required attributes for input. Is there therefore similar value in abstracting this concept out to a schema of some sort? What would it look like?

description: Sample objects produced by the PEPATAC pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        smooth_bw: 
          type: string
          description: "Smoothed bigwig output file"
          path: "aligned_{sample.genome}/{sample.sample_name}_smooth.bw"

Here, we extend the jsonschema vocabulary with a new term called 'path'. This is not used to validate objects at all, it's actually just used internally to populate them (which is our extension). It's exactly what we're already doing with outputs. So, what's the advantage of switching to a schema like this?

this schema could then also be used to validate the output of the pipeline, building on json-schema.
the input schema and output schema are parallel
it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.
adds the possibility to have project-level outputs, not just sample-level outputs (which in fact I just realized is exactly the summary_results section)
we would refer to it with an external file, which decouples it from the pipeline interface, so it could in theory be re-used. -- this would also be satisfied by a simple file that listed outputs as key-value pairs.
(see below) it standardizes the outputs and summary results into one format

I'm not totally sold on this but want to throw it out there for comments.

Disadvantages:

it's more complex than a simple key-value pair for outputs
~~it only adds utility in the case of someone visualizing output with looper summarize/perago~~ it adds utility for visualization, and also potentially for validating at a collate step.
requires some new tooling to realize the validation advantages, which is not yet entirely clear.

The text was updated successfully, but these errors were encountered:

nsheff · 2020-03-19T17:54:10Z

The output schema could actually also subsume the summary_results section -- which has, in fact, almost reproduced everything in it:

from peppro:

    summary_results:
      - library_complexity_file:
        caption: "Library complexity file"
        description: "Plots each sample's library complexity on a single plot."
        thumbnail_path: "summary/{name}_libComplexity.png"
        path: "summary/{name}_libComplexity.pdf"
      - counts_table:
        caption: "Gene counts table"
        description: "Combines all sample gene count files into a project level gene counts table."
        thumbnail_path: "summary/{name}_countData.csv"
        path: "summary/{name}_countData.csv"

caption actually has an official term in jsonschema called title. So in an output schema this could look very similar:

description: Summary results produced by the PEPATAC pipeline.
  properties:
    library_complexity_file:
      title: "Library complexity file"
      type: image (?)
      description: "Plots each sample's library complexity on a single plot."
      thumbnail_path: "summary/{name}_libComplexity.png"
      path: "summary/{name}_libComplexity.pdf"
    counts_table:
      title: "Gene counts table"
      type: string
      description: "Combines all sample gene count files into a project level gene counts table."
      thumbnail_path: "summary/{name}_countData.csv"
      path: "summary/{name}_countData.csv"

I think this would be a separate file from the outputs schema above. But it's interesting that we added these descriptive attributes here organically. Why wouldn't we also want these for the sample-level outputs? We should make them match.

nsheff · 2020-03-19T18:21:42Z

See #216 for a similar, simpler proposal. Advantages of going to the schema over the simpler yaml proposal are similar to those listed above:

schema could be used to validate the output of the pipeline
the input schema and output schema are parallel
more info possible: description, title, type, etc. it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.

nsheff · 2020-03-19T18:34:59Z

A generic tool could use this schema to produce sample objects with populated paths. The tool would take this schema, plus either a PEP (to provide a whole project), or a sample dict (if it's just one sample), and produce either a project or a sample object with all of the path attributes populated. This simple tool could be implemented in both R and Python.

Our pipelines could use this tool at the sample level to instantiate sample objects, then use the attribute names instead of encoding any paths

perego would use this tool (at the project level) to get both sample files and project files for display.

vreuter · 2020-03-24T13:57:30Z

See #216 for a similar, simpler proposal. Advantages of going to the schema over the simpler yaml proposal are similar to those listed above:
1. schema could be used to validate the output of the pipeline

2. the input schema and output schema are parallel

3. more info possible: description, title, type, etc.   it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.

All for building In the additional flexibility now. I find myself usually happy for some up front investment later on. For me the future desire for that flexibility behaves in a way similar Hofstadter's law for time req.'d

jpsmith5 · 2020-03-24T16:31:12Z

Re: Advantages of output schema

Validate the output of the pipeline in what sense? Confirming a run of the pipeline successfully completed, by the presence of the expected output?
Yes, logical.
Good!
Yes, these two sections do feel linked and it will be more appropriate in the long run, particularly in regards to the proposed collate functionality to have a standardized way of also defining pipeline output from both the sample and project level.
Makes sense. Feels like the pipeline interface then is more a structure linking the various parts together. But each of those parts becomes independent and modular. Ideally that makes future proofing and upgrading easier.
Yup.

Re: disadvantages

Doesn't feel significantly more complex. Particularly since the input has already moved this direction. If neither had gone down this path, it may feel that way, but in this case, I don't think that's a concern really. If anything, not going this direction is the disadvantage, because then you need to understand two approaches to handling i/o.
Is this a disadvantage?
Right.

As I touch on in disadvantages:1, feels like it's actually more straightforward for future users/pipeline builders et cetera to treat input and output the same, versus using a simplified yaml approach.
And I do like the idea of just using attribute names to refer to pipeline files, as opposed to the construction of the path's within the pipeline. Would make the pipeline code simpler and, conceptually, more straightforward to interpret I would think.

nsheff · 2020-04-06T14:30:22Z

does this now subsume all functionality previous accomplished by sample-level outputs and project-level summary results sections?

stolarczyk · 2020-04-06T14:44:46Z

yes, I believe it does.

nsheff mentioned this issue Mar 19, 2020

looper collate #238

Closed

This was referenced Mar 24, 2020

a platform-agnostic way to add sample attributes #94

Closed

sample_structure yaml #216

Closed

nsheff added this to the 0.13 milestone Mar 24, 2020

stolarczyk added a commit to pepkit/eido that referenced this issue Mar 27, 2020

add property populating functions; pepkit/looper#237 (comment)

fde0727

stolarczyk added the likely-solved label Apr 3, 2020

stolarczyk closed this as completed Apr 23, 2020

stolarczyk mentioned this issue Apr 29, 2020

looper documentation; complete(?) todo list #254

Closed

17 tasks

nsheff pushed a commit to pepkit/eido that referenced this issue Jun 21, 2023

add property populating functions; pepkit/looper#237 (comment)

d1039be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we define what a pipeline produces? #237

How should we define what a pipeline produces? #237

nsheff commented Mar 19, 2020 •

edited

Loading

nsheff commented Mar 19, 2020 •

edited

Loading

nsheff commented Mar 19, 2020

nsheff commented Mar 19, 2020

vreuter commented Mar 24, 2020

jpsmith5 commented Mar 24, 2020 •

edited

Loading

nsheff commented Apr 6, 2020

stolarczyk commented Apr 6, 2020

How should we define what a pipeline produces? #237

How should we define what a pipeline produces? #237

Comments

nsheff commented Mar 19, 2020 • edited Loading

How should we define what a pipeline produces?

nsheff commented Mar 19, 2020 • edited Loading

nsheff commented Mar 19, 2020

nsheff commented Mar 19, 2020

vreuter commented Mar 24, 2020

jpsmith5 commented Mar 24, 2020 • edited Loading

nsheff commented Apr 6, 2020

stolarczyk commented Apr 6, 2020

nsheff commented Mar 19, 2020 •

edited

Loading

nsheff commented Mar 19, 2020 •

edited

Loading

jpsmith5 commented Mar 24, 2020 •

edited

Loading