Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample_structure yaml #216

Closed
nsheff opened this issue Aug 26, 2019 · 1 comment
Closed

sample_structure yaml #216

nsheff opened this issue Aug 26, 2019 · 1 comment
Milestone

Comments

@nsheff
Copy link
Contributor

nsheff commented Aug 26, 2019

Related to #201, #61

What if a pipeline had a sample representation in yaml format that had all of its attributes; both input and output attributes?

The pipeline itself would find this useful to refer to the sample paths and other things; the interface would find this useful as well (it could replace and extend the outputs section).

for example, the sample_structure.yaml file would be produced for each pipeline. It would look like:

sample_structure:
  cutadapt_report: "cutadapt_folder/{sample.sample_name}_cutadapt.txt"

or, if the pipeline author wants more structure:

sample_structure:
  qc_results:
    cutadapt_report: "cutadapt_folder/{sample.sample_name}_cutadapt.txt"
  alignments:
    bt2_aligned: "{sample.genome}_aligned.bam"
    ...

So, it's totally flexible. Then, the pipeline would use this file (it accompanies the pipeline), and would provide: sample.cutadapt_report or sample.alignments.bt2_aligned, which would give the populated strings for the current sample (produced with str.format(*sample). This is superior to the current mode of defining sample subclasses with attributes because it is not tied to python, and...

It's useful also for the pipeline interface, which would add a new path:

protocol mappings: ...
pipelines:
  pname:
    path: path/to/pipeline.py
    sample_structure: path/to/sample_structure.yaml

Downstream tools that understand the pipeline interface (in any language) also now know about the sample structure. We would use this in place of the current outputs section. Would it be useful for anything else? An R package that wants to read in a peak file, for example. Right now, we're having to hard-code these kinds of outputs. for example:

https://github.com/databio/pepatac/blob/a0e4347b199c91bbfd7d994b0705da2ca8d51015/BiocProject/readPepatacPeakBeds.R#L11

the outputs solves this I suppose. This is just a more universal solution that would solve it also at the python level. A disadvantage of sticking it directly in the piface (like the current outputs approach) is that the pipeline itself can't use it...unless it became aware of the pipeline interface. But given that the piface is conceptually external to the pipeline, dividing these seems to add flexibility and make more sense.

This concept could be built into the peppy Sample object.

@nsheff
Copy link
Contributor Author

nsheff commented Mar 24, 2020

Closed in favor of #237

@nsheff nsheff closed this as completed Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant