Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable user-provided sample cohorts/datasets #361

Open
MattWellie opened this issue May 11, 2023 · 5 comments
Open

Enable user-provided sample cohorts/datasets #361

MattWellie opened this issue May 11, 2023 · 5 comments
Assignees
Labels
core Changes to the cpg-workflows api

Comments

@MattWellie
Copy link
Contributor

Based on the messy-as-heck workflow sketched out in #356, we want to do a number of partial workflows, with specific stopping and restarting points. The workflow I initially had in mind would require 4/5 'workflows' to be run, each needing a specific list of samples defined in config. Basically it would be awful to operate and fragile AF.

Instead of that I drew a picture of how this multi-batch madness would slot neatly into a Dataset/CohortStage workflow, where the manual data aggregation steps in #356 are swapped out for a CohortStage process, running on all the combined DatasetStage outputs.

Screenshot 2023-05-11 at 2 07 39 pm

Grey: separate workflow, operating on a strict list of samples. One SampleStage, Two CohortStages. This generates a file which will be all the present sample IDs grouped into batches. Those batched sample groups are then fed back into production-pipelines as 'Datasets'.

Blue: Each batch runs these stages in parallel as a 'Dataset'

Red: Cohort Stage(s), collects results from the various Batches, aggregates results, and does something.

The first CohortStage would be a chunk of custom code which reads in all the FilterBatch results and merges into a single file, this is then fed into every Dataset batch for the Genotyping stage

The second CohortStage takes a list of all individual Dataset VCFs as input, and creates one CohortVCF.


I guess what I'm trying to say is....

  • Can we swap out Metamist for an alternative (file based) provider of Cohort/Dataset groups
  • If we can't currently, how long would it take to introduce that behaviour
  • Is there a much better way of doing this that I'm not currently aware of

If this approach would be viable, then I'd like to hold off on any more GATK-SV work until then, because frankly this is the only way to make this messy basket of stages runnable in a respectable way.

@MattWellie
Copy link
Contributor Author

Tagging @vivbak

@MattWellie
Copy link
Contributor Author

Having dug into the code a bit more, I don't think any of this is possible (?)
From the production-pipelines definition of a Dataset:

    Each `dataset` at the CPG corresponds to
    * a GCP project: https://github.com/populationgenomics/team-docs/tree/main/storage_policies
    * a Pulumi stack: https://github.com/populationgenomics/analysis-runner/tree/main/stack
    * a metamist project`

Even if we had a flexible way of grouping samples into arbitrary datasets, there's no easy way to spontaneously create everything a dataset needs, including a system user and GCP location to write data specific to that dataset. Something as simple as where a dataset would write its output is not simple at all.

@MattWellie
Copy link
Contributor Author

@vivbak's suggestion: a new type of Stage, sandwiched between the Cohort (across Datasets) and Dataset (single-project) level. This way we could form cohorts from across multiple projects, and designate a non-project location to write data to.

e.g. running a pipeline using seqr as the dataset, we could define multiple custom sub-cohorts from samples across the dataset (somehow), and results from the SandwichStage would be written into the dataset bucket (maybe), with paths generated using the hash representing all the samples involved to keep results stored in parallel.

BRB, I'm going to go and copyright SandwichStage. It's mine.

@illusional
Copy link
Contributor

Be keen to discuss this more (potentially Tuesday in-person would be a great topic?). IMO, we can determine a very small MVP that is basically "a fixed set of sequencing-groups at time"

Consequences:

  • A cohort stage ALWAYS runs on a cohort, and it can just generate a cohort at the start of any run where you don't specify one explicitly
  • Cohorts can be nested, which solves the batching concern.
    • We should define what that semantically means for stages (whether it takes the sum of all, etc)
  • We can add extra information if analyses were computed through a cohort stage (rather than just the straight list of samples).

I don't really like building something outside, when I think it should be first class in metamist, and should be very quick to MVP (1-2 days to get something workable). The only issue I see is there's no point building this before the sequencing-group changes get released, as then it's challenging to migrate for no real reason.

@MattWellie
Copy link
Contributor Author

If we can grab time on Tuesday in person that would be grand, else happy to find a zoom slot in the calendar.

Agreed about holding off until after sequencing-groups, this can be fudged in other ways until then.

@vivbak vivbak added the core Changes to the cpg-workflows api label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to the cpg-workflows api
Projects
None yet
Development

No branches or pull requests

5 participants