You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GATK-SV works end-to-end as of this PR #345. This version works by taking all samples in a project, running variant calling per-sample for all SV callers, then running all batch cluster/filtering steps with the same sample group. This is an OK first draft, but needs to be substantially altered to enable us to run in GATK-recommended ways. We need to split the e2e workflow off into multiple sub-workflows:
sample evidence only - a workflow including just the first two stages, GatherSampleEvidence and EvidenceQC. This should write files to locations independent of the pipeline output folder, so we can re-run this any time we grab new samples to top up any missing results.
1a. Batching - parse the results of EvidenceQC and splits the samples which ran in this round into homogenous batches based on coverage and Dosage score. This is going to be based on a script passed over by the Broad, and will run as the third step of this first workflow.
We're using a single trained gCNV model at the moment, pulled from the reference data. Should we be re-running the model training here?
For each batch of samples grouped in 1a, GatherBatchEvidence, ClusterBatch, GenerateBatchMetrics, and FilterBatch stages should be run continuously to generate all relevant output files.
MANUAL - Merge the results of the FilterBatch outputs from all separate batches into one set of files
GenotypeBatch runs individually per batch, using the combined FilterBatch results from all batches. This must use the exact same group of samples as 3., but does not run continuously with that step.
VCF polishing - a final workflow incorporating the results of all GenotypeBatch results (this will take lists of files for each relevant input), to generate a final cleaned VCF per cohort.
Note: MANUAL here doesn't really mean manual per se, but due to the way that the pipeline runs these will be stopping points where a non-pipeline step needs to run in order to set up subsequent steps
VCF locations
Individual VCF/coverage files should be generated in a per-project location, not a per-pipeline run location
Currently these per-sample VCFs are written into an output path relative to the pipeline output folder, e.g. gs://cpg-broad-rgp-test/gatk_sv/gathersampleevidence. Don't do that, otherwise each pipeline run will need to regenerate all individual VCFs which is a bit silly. Rewrite the SampleStage output paths based on the logic which exists already for gVCF/CRAM locations.
Batch Generation...
Not too much guidance here, other than binning samples with similar median coverage. We should also be open to removing outlier samples completely, though I'm not sure if we need to do a best-effort with all samples.
The text was updated successfully, but these errors were encountered:
Context: https://github.com/broadinstitute/gatk-sv#cohort-mode
GATK-SV works end-to-end as of this PR #345. This version works by taking all samples in a project, running variant calling per-sample for all SV callers, then running all batch cluster/filtering steps with the same sample group. This is an OK first draft, but needs to be substantially altered to enable us to run in GATK-recommended ways. We need to split the e2e workflow off into multiple sub-workflows:
sample evidence only
- a workflow including just the first two stages,GatherSampleEvidence
andEvidenceQC
. This should write files to locations independent of the pipeline output folder, so we can re-run this any time we grab new samples to top up any missing results.1a.
Batching
- parse the results ofEvidenceQC
and splits the samples which ran in this round into homogenous batches based on coverage and Dosage score. This is going to be based on a script passed over by the Broad, and will run as the third step of this first workflow.1a
,GatherBatchEvidence
,ClusterBatch
,GenerateBatchMetrics
, andFilterBatch
stages should be run continuously to generate all relevant output files.MANUAL
- Merge the results of theFilterBatch
outputs from all separate batches into one set of filesGenotypeBatch
runs individually per batch, using the combinedFilterBatch
results from all batches. This must use the exact same group of samples as3.
, but does not run continuously with that step.VCF polishing
- a final workflow incorporating the results of allGenotypeBatch
results (this will take lists of files for each relevant input), to generate a final cleaned VCF per cohort.Note:
MANUAL
here doesn't really mean manual per se, but due to the way that the pipeline runs these will be stopping points where a non-pipeline step needs to run in order to set up subsequent stepsVCF locations
Individual VCF/coverage files should be generated in a per-project location, not a per-pipeline run location
Currently these per-sample VCFs are written into an output path relative to the pipeline output folder, e.g.
gs://cpg-broad-rgp-test/gatk_sv/gathersampleevidence
. Don't do that, otherwise each pipeline run will need to regenerate all individual VCFs which is a bit silly. Rewrite the SampleStage output paths based on the logic which exists already for gVCF/CRAM locations.Batch Generation...
Not too much guidance here, other than binning samples with similar median coverage. We should also be open to removing outlier samples completely, though I'm not sure if we need to do a best-effort with all samples.
The text was updated successfully, but these errors were encountered: