-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable user-provided sample cohorts/datasets #361
Comments
Tagging @vivbak |
Having dug into the code a bit more, I don't think any of this is possible (?)
Even if we had a flexible way of grouping samples into arbitrary datasets, there's no easy way to spontaneously create everything a dataset needs, including a system user and GCP location to write data specific to that dataset. Something as simple as where a dataset would write its output is not simple at all. |
@vivbak's suggestion: a new type of Stage, sandwiched between the Cohort (across Datasets) and Dataset (single-project) level. This way we could form e.g. running a pipeline using BRB, I'm going to go and copyright SandwichStage. It's mine. |
Be keen to discuss this more (potentially Tuesday in-person would be a great topic?). IMO, we can determine a very small MVP that is basically "a fixed set of sequencing-groups at time" Consequences:
I don't really like building something outside, when I think it should be first class in metamist, and should be very quick to MVP (1-2 days to get something workable). The only issue I see is there's no point building this before the sequencing-group changes get released, as then it's challenging to migrate for no real reason. |
If we can grab time on Tuesday in person that would be grand, else happy to find a zoom slot in the calendar. Agreed about holding off until after sequencing-groups, this can be fudged in other ways until then. |
Based on the messy-as-heck workflow sketched out in #356, we want to do a number of partial workflows, with specific stopping and restarting points. The workflow I initially had in mind would require 4/5 'workflows' to be run, each needing a specific list of samples defined in config. Basically it would be awful to operate and fragile AF.
Instead of that I drew a picture of how this multi-batch madness would slot neatly into a Dataset/CohortStage workflow, where the manual data aggregation steps in #356 are swapped out for a CohortStage process, running on all the combined DatasetStage outputs.
Grey: separate workflow, operating on a strict list of samples. One SampleStage, Two CohortStages. This generates a file which will be all the present sample IDs grouped into batches. Those batched sample groups are then fed back into production-pipelines as 'Datasets'.
Blue: Each batch runs these stages in parallel as a 'Dataset'
Red: Cohort Stage(s), collects results from the various Batches, aggregates results, and does something.
The first CohortStage would be a chunk of custom code which reads in all the FilterBatch results and merges into a single file, this is then fed into every Dataset batch for the Genotyping stage
The second CohortStage takes a list of all individual Dataset VCFs as input, and creates one CohortVCF.
I guess what I'm trying to say is....
If this approach would be viable, then I'd like to hold off on any more GATK-SV work until then, because frankly this is the only way to make this messy basket of stages runnable in a respectable way.
The text was updated successfully, but these errors were encountered: