Enable user-provided sample cohorts/datasets #361

MattWellie · 2023-05-11T04:23:29Z

Based on the messy-as-heck workflow sketched out in #356, we want to do a number of partial workflows, with specific stopping and restarting points. The workflow I initially had in mind would require 4/5 'workflows' to be run, each needing a specific list of samples defined in config. Basically it would be awful to operate and fragile AF.

Instead of that I drew a picture of how this multi-batch madness would slot neatly into a Dataset/CohortStage workflow, where the manual data aggregation steps in #356 are swapped out for a CohortStage process, running on all the combined DatasetStage outputs.

Grey: separate workflow, operating on a strict list of samples. One SampleStage, Two CohortStages. This generates a file which will be all the present sample IDs grouped into batches. Those batched sample groups are then fed back into production-pipelines as 'Datasets'.

Blue: Each batch runs these stages in parallel as a 'Dataset'

Red: Cohort Stage(s), collects results from the various Batches, aggregates results, and does something.

The first CohortStage would be a chunk of custom code which reads in all the FilterBatch results and merges into a single file, this is then fed into every Dataset batch for the Genotyping stage

The second CohortStage takes a list of all individual Dataset VCFs as input, and creates one CohortVCF.

I guess what I'm trying to say is....

Can we swap out Metamist for an alternative (file based) provider of Cohort/Dataset groups
If we can't currently, how long would it take to introduce that behaviour
Is there a much better way of doing this that I'm not currently aware of

If this approach would be viable, then I'd like to hold off on any more GATK-SV work until then, because frankly this is the only way to make this messy basket of stages runnable in a respectable way.

MattWellie · 2023-05-11T04:23:40Z

Tagging @vivbak

MattWellie · 2023-05-11T05:01:58Z

Having dug into the code a bit more, I don't think any of this is possible (?)
From the production-pipelines definition of a Dataset:

    Each `dataset` at the CPG corresponds to
    * a GCP project: https://github.com/populationgenomics/team-docs/tree/main/storage_policies
    * a Pulumi stack: https://github.com/populationgenomics/analysis-runner/tree/main/stack
    * a metamist project`

Even if we had a flexible way of grouping samples into arbitrary datasets, there's no easy way to spontaneously create everything a dataset needs, including a system user and GCP location to write data specific to that dataset. Something as simple as where a dataset would write its output is not simple at all.

MattWellie · 2023-05-11T06:59:46Z

@vivbak's suggestion: a new type of Stage, sandwiched between the Cohort (across Datasets) and Dataset (single-project) level. This way we could form cohorts from across multiple projects, and designate a non-project location to write data to.

e.g. running a pipeline using seqr as the dataset, we could define multiple custom sub-cohorts from samples across the dataset (somehow), and results from the SandwichStage would be written into the dataset bucket (maybe), with paths generated using the hash representing all the samples involved to keep results stored in parallel.

BRB, I'm going to go and copyright SandwichStage. It's mine.

illusional · 2023-05-11T08:35:35Z

Be keen to discuss this more (potentially Tuesday in-person would be a great topic?). IMO, we can determine a very small MVP that is basically "a fixed set of sequencing-groups at time"

Consequences:

A cohort stage ALWAYS runs on a cohort, and it can just generate a cohort at the start of any run where you don't specify one explicitly
Cohorts can be nested, which solves the batching concern.
- We should define what that semantically means for stages (whether it takes the sum of all, etc)
We can add extra information if analyses were computed through a cohort stage (rather than just the straight list of samples).

I don't really like building something outside, when I think it should be first class in metamist, and should be very quick to MVP (1-2 days to get something workable). The only issue I see is there's no point building this before the sequencing-group changes get released, as then it's challenging to migrate for no real reason.

MattWellie · 2023-05-12T00:38:09Z

If we can grab time on Tuesday in person that would be grand, else happy to find a zoom slot in the calendar.

Agreed about holding off until after sequencing-groups, this can be fudged in other ways until then.

MattWellie assigned jmarshall, MattWellie, cassimons and illusional May 11, 2023

vivbak added the core Changes to the cpg-workflows api label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable user-provided sample cohorts/datasets #361

Enable user-provided sample cohorts/datasets #361

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

illusional commented May 11, 2023

MattWellie commented May 12, 2023

Enable user-provided sample cohorts/datasets #361

Enable user-provided sample cohorts/datasets #361

Comments

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

MattWellie commented May 11, 2023

illusional commented May 11, 2023

MattWellie commented May 12, 2023