GATK-SV re-jigging #356

MattWellie · 2023-05-01T04:03:07Z

Context: https://github.com/broadinstitute/gatk-sv#cohort-mode

GATK-SV works end-to-end as of this PR #345. This version works by taking all samples in a project, running variant calling per-sample for all SV callers, then running all batch cluster/filtering steps with the same sample group. This is an OK first draft, but needs to be substantially altered to enable us to run in GATK-recommended ways. We need to split the e2e workflow off into multiple sub-workflows:

sample evidence only - a workflow including just the first two stages, GatherSampleEvidence and EvidenceQC. This should write files to locations independent of the pipeline output folder, so we can re-run this any time we grab new samples to top up any missing results.
1a. Batching - parse the results of EvidenceQC and splits the samples which ran in this round into homogenous batches based on coverage and Dosage score. This is going to be based on a script passed over by the Broad, and will run as the third step of this first workflow.
We're using a single trained gCNV model at the moment, pulled from the reference data. Should we be re-running the model training here?
For each batch of samples grouped in 1a, GatherBatchEvidence, ClusterBatch, GenerateBatchMetrics, and FilterBatch stages should be run continuously to generate all relevant output files.
MANUAL - Merge the results of the FilterBatch outputs from all separate batches into one set of files
GenotypeBatch runs individually per batch, using the combined FilterBatch results from all batches. This must use the exact same group of samples as 3., but does not run continuously with that step.
VCF polishing - a final workflow incorporating the results of all GenotypeBatch results (this will take lists of files for each relevant input), to generate a final cleaned VCF per cohort.

Note: MANUAL here doesn't really mean manual per se, but due to the way that the pipeline runs these will be stopping points where a non-pipeline step needs to run in order to set up subsequent steps

VCF locations

Individual VCF/coverage files should be generated in a per-project location, not a per-pipeline run location

Currently these per-sample VCFs are written into an output path relative to the pipeline output folder, e.g. gs://cpg-broad-rgp-test/gatk_sv/gathersampleevidence. Don't do that, otherwise each pipeline run will need to regenerate all individual VCFs which is a bit silly. Rewrite the SampleStage output paths based on the logic which exists already for gVCF/CRAM locations.

Batch Generation...

Not too much guidance here, other than binning samples with similar median coverage. We should also be open to removing outlier samples completely, though I'm not sure if we need to do a best-effort with all samples.

The text was updated successfully, but these errors were encountered:

MattWellie · 2023-06-23T08:22:37Z

This will need an overhaul when we move to SpicyCohorts, but for now it should all be live

MattWellie assigned cassimons and vivbak May 1, 2023

MattWellie mentioned this issue May 11, 2023

Enable user-provided sample cohorts/datasets #361

Open

MattWellie closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GATK-SV re-jigging #356

GATK-SV re-jigging #356

MattWellie commented May 1, 2023 •

edited

Loading

MattWellie commented Jun 23, 2023

GATK-SV re-jigging #356

GATK-SV re-jigging #356

Comments

MattWellie commented May 1, 2023 • edited Loading

VCF locations

Batch Generation...

MattWellie commented Jun 23, 2023

MattWellie commented May 1, 2023 •

edited

Loading