Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we define what a pipeline produces? #237

Closed
nsheff opened this issue Mar 19, 2020 · 7 comments
Closed

How should we define what a pipeline produces? #237

nsheff opened this issue Mar 19, 2020 · 7 comments
Milestone

Comments

@nsheff
Copy link
Contributor

nsheff commented Mar 19, 2020

Related to: #32, #61, #94, #201, #216

How should we define what a pipeline produces?

Originally the inputs were specified in the pipeline interface, now, we switched to a schema, which is used to validate a PEP for input into a pipeline. Here's an example: https://schema.databio.org/pipelines/pepatac.yaml

description: A PEP for ATAC-seq samples for the PEPATAC pipeline.
imports: http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        organism: 
          type: string
          description: "Organism"
        protocol: 
          type: string
          description: "Must be an ATAC-seq or DNAse-seq sample"
          enum: ["ATAC", "ATAC-SEQ", "ATAC-seq", "DNase", "DNase-seq"]
        genome:
          type: string
          description: "Refgenie genome registry identifier"
        read_type:
          type: string
          description: "Is this single or paired-end data?"
          enum: ["SINGLE", "PAIRED"]
        read1:
          anyOf:
            - type: string
              description: "Fastq file for read 1"
            - type: array
              items:
                type: string
        read2:
          anyOf:
            - type: string
              description: "Fastq file for read 2 (for paired-end experiments)"
            - type: array
              items:
                type: string
      required_input_attrs:
        - read1
      input_attrs:
        - read1
        - read2
      required:
        - sample_name
        - protocol
        - read1
        - genome
required:
  - samples

The schema is superior because it's decoupled from looper and can be used with eido now to validate the PEP. it's also reusable across pipelines.

So, what about outputs ? Right now the pipeline specifies what outputs it produces in the pipeline interface, like so:

pipelines:
  pepatac.py:
    name: PEPATAC
    path: pipelines/pepatac.py
    schema: https://schema.databio.org/pipelines/pepatac.yaml
    ...
    outputs:
      smooth_bw: "aligned_{sample.genome}/{sample.sample_name}_smooth.bw"
      exact_bw:  "aligned_{sample.genome}_exact/{sample.sample_name}_exact.bw"
      aligned_bam: "aligned_{sample.genome}/{sample.sample_name}_sort.bam"

This is very similar to what we used to do with required attributes for input. Is there therefore similar value in abstracting this concept out to a schema of some sort? What would it look like?

description: Sample objects produced by the PEPATAC pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        smooth_bw: 
          type: string
          description: "Smoothed bigwig output file"
          path: "aligned_{sample.genome}/{sample.sample_name}_smooth.bw"

Here, we extend the jsonschema vocabulary with a new term called 'path'. This is not used to validate objects at all, it's actually just used internally to populate them (which is our extension). It's exactly what we're already doing with outputs. So, what's the advantage of switching to a schema like this?

  1. this schema could then also be used to validate the output of the pipeline, building on json-schema.
  2. the input schema and output schema are parallel
  3. it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.
  4. adds the possibility to have project-level outputs, not just sample-level outputs (which in fact I just realized is exactly the summary_results section)
  5. we would refer to it with an external file, which decouples it from the pipeline interface, so it could in theory be re-used. -- this would also be satisfied by a simple file that listed outputs as key-value pairs.
  6. (see below) it standardizes the outputs and summary results into one format

I'm not totally sold on this but want to throw it out there for comments.

Disadvantages:

  1. it's more complex than a simple key-value pair for outputs
  2. it only adds utility in the case of someone visualizing output with looper summarize/perago it adds utility for visualization, and also potentially for validating at a collate step.
  3. requires some new tooling to realize the validation advantages, which is not yet entirely clear.
@nsheff
Copy link
Contributor Author

nsheff commented Mar 19, 2020

The output schema could actually also subsume the summary_results section -- which has, in fact, almost reproduced everything in it:

from peppro:

    summary_results:
      - library_complexity_file:
        caption: "Library complexity file"
        description: "Plots each sample's library complexity on a single plot."
        thumbnail_path: "summary/{name}_libComplexity.png"
        path: "summary/{name}_libComplexity.pdf"
      - counts_table:
        caption: "Gene counts table"
        description: "Combines all sample gene count files into a project level gene counts table."
        thumbnail_path: "summary/{name}_countData.csv"
        path: "summary/{name}_countData.csv"

caption actually has an official term in jsonschema called title. So in an output schema this could look very similar:

description: Summary results produced by the PEPATAC pipeline.
  properties:
    library_complexity_file:
      title: "Library complexity file"
      type: image (?)
      description: "Plots each sample's library complexity on a single plot."
      thumbnail_path: "summary/{name}_libComplexity.png"
      path: "summary/{name}_libComplexity.pdf"
    counts_table:
      title: "Gene counts table"
      type: string
      description: "Combines all sample gene count files into a project level gene counts table."
      thumbnail_path: "summary/{name}_countData.csv"
      path: "summary/{name}_countData.csv"

I think this would be a separate file from the outputs schema above. But it's interesting that we added these descriptive attributes here organically. Why wouldn't we also want these for the sample-level outputs? We should make them match.

@nsheff nsheff mentioned this issue Mar 19, 2020
@nsheff
Copy link
Contributor Author

nsheff commented Mar 19, 2020

See #216 for a similar, simpler proposal. Advantages of going to the schema over the simpler yaml proposal are similar to those listed above:

  1. schema could be used to validate the output of the pipeline
  2. the input schema and output schema are parallel
  3. more info possible: description, title, type, etc. it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.

@nsheff
Copy link
Contributor Author

nsheff commented Mar 19, 2020

A generic tool could use this schema to produce sample objects with populated paths. The tool would take this schema, plus either a PEP (to provide a whole project), or a sample dict (if it's just one sample), and produce either a project or a sample object with all of the path attributes populated. This simple tool could be implemented in both R and Python.

Our pipelines could use this tool at the sample level to instantiate sample objects, then use the attribute names instead of encoding any paths

perego would use this tool (at the project level) to get both sample files and project files for display.

@vreuter
Copy link
Member

vreuter commented Mar 24, 2020

See #216 for a similar, simpler proposal. Advantages of going to the schema over the simpler yaml proposal are similar to those listed above:

1. schema could be used to validate the output of the pipeline

2. the input schema and output schema are parallel

3. more info possible: description, title, type, etc.   it adds some self-documentation to what these items are, which would then be useful later, for example, in the looper summarize/perago output.

All for building In the additional flexibility now. I find myself usually happy for some up front investment later on. For me the future desire for that flexibility behaves in a way similar Hofstadter's law for time req.'d

@jpsmith5
Copy link
Contributor

jpsmith5 commented Mar 24, 2020

Re: Advantages of output schema

  1. Validate the output of the pipeline in what sense? Confirming a run of the pipeline successfully completed, by the presence of the expected output?
  2. Yes, logical.
  3. Good!
  4. Yes, these two sections do feel linked and it will be more appropriate in the long run, particularly in regards to the proposed collate functionality to have a standardized way of also defining pipeline output from both the sample and project level.
  5. Makes sense. Feels like the pipeline interface then is more a structure linking the various parts together. But each of those parts becomes independent and modular. Ideally that makes future proofing and upgrading easier.
  6. Yup.

Re: disadvantages

  1. Doesn't feel significantly more complex. Particularly since the input has already moved this direction. If neither had gone down this path, it may feel that way, but in this case, I don't think that's a concern really. If anything, not going this direction is the disadvantage, because then you need to understand two approaches to handling i/o.
  2. Is this a disadvantage?
  3. Right.
  • As I touch on in disadvantages:1, feels like it's actually more straightforward for future users/pipeline builders et cetera to treat input and output the same, versus using a simplified yaml approach.
  • And I do like the idea of just using attribute names to refer to pipeline files, as opposed to the construction of the path's within the pipeline. Would make the pipeline code simpler and, conceptually, more straightforward to interpret I would think.

@nsheff
Copy link
Contributor Author

nsheff commented Apr 6, 2020

does this now subsume all functionality previous accomplished by sample-level outputs and project-level summary results sections?

@stolarczyk
Copy link
Member

yes, I believe it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants