Skip to content
This repository has been archived by the owner on Apr 4, 2024. It is now read-only.

Pipeline version v3.0

Compare
Choose a tag to compare
@kduyvesteyn kduyvesteyn released this 15 Sep 12:21
· 340 commits to master since this release

Summary
Major overhaul of all somatic analyses (SNVs, INDELs, implied purity, BAF, copy numbers and structural variation (SVs)).

Improvements to somatic SNV / Indel calling

  • Consensus method has been replaced by Strelka only calling with custom post processing. Mutect, Freebayes and Varscan callers have been removed from the pipeline.
  • BQSR is now run prior to somatic calling (to improve Strelka precision).
  • Strelka REPEAT filter is switched off to improve INDEL sensitivity.
  • A new post calling filter is added to Strelka output: variants in the low confidence regions are hard filtered unless they have > 10% AF and Strelka Somatic score > 20.
  • A soft PON (pool of normals) filter is applied to the final Strelka output to improve precision.

Improvements to somatic SV calling

  • ‘Alfredi' (BPI) is run on Manta output which performs the following functions:
    • Applies a set of 8 filters to Manta output to remove obvious false positives / improve precision.
    • Determines the accurate break point of each variant.
    • Calculates a AF for each breakpoint end on each variant.

Tumor purity / BAF / copy numbers

  • ‘PURPLE’ (PURity & PLoidy Estimator) replaces FreeC as the primary copy number tool
    Key features of PURPLE:
    • ‘COBALT’ (COunt BAm Lines of Tumor) counts the # of reads per kb window for both normal and tumor.
    • GC bias is fit for both normal and tumor.
    • ‘AMBER’ (A Minipileup Baf EstimatoR) calculates BAF for a set of HC common heterozygous SNPs.
    • A set of candidate copy number breakpoints is determined using a PCF (piecewise constant fitting) algorithm on tumor, normal and BAF.
    • Sample ploidy and purity is jointly fit by minimising a penalty function using a integer ploidy and minor allele ploidy model.
    • Absolute copy number is determined for each segment.
    • Candidate breakpoints are smoothed into a set of final copy number breakpoints.
    • CIRCOS and QC plots are produced.

Other changes

  • Introduce damage estimator tool to estimate DNA damage.
  • BQSR no longer produces a QC report, cutting out 40% of total runtime.
  • BQSR writes its BAM using lower zip compression leading to faster compute time but bigger recalibrated BAM files.
  • A 26 SNP filter is added for changes to SNP check design.
  • Health checker is now run as part of the pipeline.
  • CPCT Slicing is removed as it is no longer used.

Changes to versions

  • dx_tracks updated for KG from v1 to v1.2.1

Quality

  • For assessing the quality of the pipeline we do the following checks:
    • Determine germline precision & sensitivity on an internally sequenced NA12878
    • Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.

Germline precision & sensitivity

Type Config Algo TP FP FN Prec Sens Δ Prec Δ Sens
SNV KG GATK 3115316 10421 38943 99,7% 98,8% 0.0% 0,0%

Somatic precision & sentivity

Type Algo TP FP FN Prec Sens Δ Prec Δ Sens
INDEL Strelka 74432 577 22184 99,2% 77,0% -0,4% 7,1%
SNV Strelka 969786 1247 38058 99,9% 96,2% 0,1% 0,0%

Note: Soft PON filter has not been applied on the GIAB mixin results as that is not possible due to the nature of the sample.