This repository has been archived by the owner on Apr 4, 2024. It is now read-only.
Pipeline version v3.0
Summary
Major overhaul of all somatic analyses (SNVs, INDELs, implied purity, BAF, copy numbers and structural variation (SVs)).
Improvements to somatic SNV / Indel calling
- Consensus method has been replaced by Strelka only calling with custom post processing. Mutect, Freebayes and Varscan callers have been removed from the pipeline.
- BQSR is now run prior to somatic calling (to improve Strelka precision).
- Strelka REPEAT filter is switched off to improve INDEL sensitivity.
- A new post calling filter is added to Strelka output: variants in the low confidence regions are hard filtered unless they have > 10% AF and Strelka Somatic score > 20.
- A soft PON (pool of normals) filter is applied to the final Strelka output to improve precision.
Improvements to somatic SV calling
- ‘Alfredi' (BPI) is run on Manta output which performs the following functions:
- Applies a set of 8 filters to Manta output to remove obvious false positives / improve precision.
- Determines the accurate break point of each variant.
- Calculates a AF for each breakpoint end on each variant.
Tumor purity / BAF / copy numbers
- ‘PURPLE’ (PURity & PLoidy Estimator) replaces FreeC as the primary copy number tool
Key features of PURPLE:- ‘COBALT’ (COunt BAm Lines of Tumor) counts the # of reads per kb window for both normal and tumor.
- GC bias is fit for both normal and tumor.
- ‘AMBER’ (A Minipileup Baf EstimatoR) calculates BAF for a set of HC common heterozygous SNPs.
- A set of candidate copy number breakpoints is determined using a PCF (piecewise constant fitting) algorithm on tumor, normal and BAF.
- Sample ploidy and purity is jointly fit by minimising a penalty function using a integer ploidy and minor allele ploidy model.
- Absolute copy number is determined for each segment.
- Candidate breakpoints are smoothed into a set of final copy number breakpoints.
- CIRCOS and QC plots are produced.
Other changes
- Introduce damage estimator tool to estimate DNA damage.
- BQSR no longer produces a QC report, cutting out 40% of total runtime.
- BQSR writes its BAM using lower zip compression leading to faster compute time but bigger recalibrated BAM files.
- A 26 SNP filter is added for changes to SNP check design.
- Health checker is now run as part of the pipeline.
- CPCT Slicing is removed as it is no longer used.
Changes to versions
- dx_tracks updated for KG from v1 to v1.2.1
Quality
- For assessing the quality of the pipeline we do the following checks:
- Determine germline precision & sensitivity on an internally sequenced NA12878
- Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.
Germline precision & sensitivity
Type | Config | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|---|
SNV | KG | GATK | 3115316 | 10421 | 38943 | 99,7% | 98,8% | 0.0% | 0,0% |
Somatic precision & sentivity
Type | Algo | TP | FP | FN | Prec | Sens | Δ Prec | Δ Sens |
---|---|---|---|---|---|---|---|---|
INDEL | Strelka | 74432 | 577 | 22184 | 99,2% | 77,0% | -0,4% | 7,1% |
SNV | Strelka | 969786 | 1247 | 38058 | 99,9% | 96,2% | 0,1% | 0,0% |
Note: Soft PON filter has not been applied on the GIAB mixin results as that is not possible due to the nature of the sample.