Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK-SV re-jigging #356

Closed
MattWellie opened this issue May 1, 2023 · 1 comment
Closed

GATK-SV re-jigging #356

MattWellie opened this issue May 1, 2023 · 1 comment
Assignees

Comments

@MattWellie
Copy link
Contributor

MattWellie commented May 1, 2023

Context: https://github.com/broadinstitute/gatk-sv#cohort-mode

GATK-SV works end-to-end as of this PR #345. This version works by taking all samples in a project, running variant calling per-sample for all SV callers, then running all batch cluster/filtering steps with the same sample group. This is an OK first draft, but needs to be substantially altered to enable us to run in GATK-recommended ways. We need to split the e2e workflow off into multiple sub-workflows:

  1. sample evidence only - a workflow including just the first two stages, GatherSampleEvidence and EvidenceQC. This should write files to locations independent of the pipeline output folder, so we can re-run this any time we grab new samples to top up any missing results.
    1a. Batching - parse the results of EvidenceQC and splits the samples which ran in this round into homogenous batches based on coverage and Dosage score. This is going to be based on a script passed over by the Broad, and will run as the third step of this first workflow.
  2. We're using a single trained gCNV model at the moment, pulled from the reference data. Should we be re-running the model training here?
  3. For each batch of samples grouped in 1a, GatherBatchEvidence, ClusterBatch, GenerateBatchMetrics, and FilterBatch stages should be run continuously to generate all relevant output files.
  4. MANUAL - Merge the results of the FilterBatch outputs from all separate batches into one set of files
  5. GenotypeBatch runs individually per batch, using the combined FilterBatch results from all batches. This must use the exact same group of samples as 3., but does not run continuously with that step.
  6. VCF polishing - a final workflow incorporating the results of all GenotypeBatch results (this will take lists of files for each relevant input), to generate a final cleaned VCF per cohort.

Note: MANUAL here doesn't really mean manual per se, but due to the way that the pipeline runs these will be stopping points where a non-pipeline step needs to run in order to set up subsequent steps

VCF locations

Individual VCF/coverage files should be generated in a per-project location, not a per-pipeline run location

Currently these per-sample VCFs are written into an output path relative to the pipeline output folder, e.g. gs://cpg-broad-rgp-test/gatk_sv/gathersampleevidence. Don't do that, otherwise each pipeline run will need to regenerate all individual VCFs which is a bit silly. Rewrite the SampleStage output paths based on the logic which exists already for gVCF/CRAM locations.

Batch Generation...

Not too much guidance here, other than binning samples with similar median coverage. We should also be open to removing outlier samples completely, though I'm not sure if we need to do a best-effort with all samples.

@MattWellie
Copy link
Contributor Author

This will need an overhaul when we move to SpicyCohorts, but for now it should all be live

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants