sample_structure yaml #216

nsheff · 2019-08-26T12:58:21Z

Related to #201, #61

What if a pipeline had a sample representation in yaml format that had all of its attributes; both input and output attributes?

The pipeline itself would find this useful to refer to the sample paths and other things; the interface would find this useful as well (it could replace and extend the outputs section).

for example, the sample_structure.yaml file would be produced for each pipeline. It would look like:

sample_structure:
  cutadapt_report: "cutadapt_folder/{sample.sample_name}_cutadapt.txt"

or, if the pipeline author wants more structure:

sample_structure:
  qc_results:
    cutadapt_report: "cutadapt_folder/{sample.sample_name}_cutadapt.txt"
  alignments:
    bt2_aligned: "{sample.genome}_aligned.bam"
    ...

So, it's totally flexible. Then, the pipeline would use this file (it accompanies the pipeline), and would provide: sample.cutadapt_report or sample.alignments.bt2_aligned, which would give the populated strings for the current sample (produced with str.format(*sample). This is superior to the current mode of defining sample subclasses with attributes because it is not tied to python, and...

It's useful also for the pipeline interface, which would add a new path:

protocol mappings: ...
pipelines:
  pname:
    path: path/to/pipeline.py
    sample_structure: path/to/sample_structure.yaml

Downstream tools that understand the pipeline interface (in any language) also now know about the sample structure. We would use this in place of the current outputs section. Would it be useful for anything else? An R package that wants to read in a peak file, for example. Right now, we're having to hard-code these kinds of outputs. for example:

https://github.com/databio/pepatac/blob/a0e4347b199c91bbfd7d994b0705da2ca8d51015/BiocProject/readPepatacPeakBeds.R#L11

the outputs solves this I suppose. This is just a more universal solution that would solve it also at the python level. A disadvantage of sticking it directly in the piface (like the current outputs approach) is that the pipeline itself can't use it...unless it became aware of the pipeline interface. But given that the piface is conceptually external to the pipeline, dividing these seems to add flexibility and make more sense.

This concept could be built into the peppy Sample object.

The text was updated successfully, but these errors were encountered:

nsheff · 2020-03-24T23:48:11Z

Closed in favor of #237

nsheff mentioned this issue Aug 26, 2019

cutadapt_report should not need to be defined in 2 locations databio/peppro#38

Open

nsheff modified the milestones: 0.13, 0.14 Jan 28, 2020

nsheff mentioned this issue Mar 19, 2020

How should we define what a pipeline produces? #237

Closed

nsheff closed this as completed Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample_structure yaml #216

sample_structure yaml #216

nsheff commented Aug 26, 2019

nsheff commented Mar 24, 2020

sample_structure yaml #216

sample_structure yaml #216

Comments

nsheff commented Aug 26, 2019

nsheff commented Mar 24, 2020