Releases · hartwigmedical/pipeline-perl

04 Jan 13:00

kduyvesteyn

v4.8

2616d8f

Pipeline version 4.8 Latest

Latest

Summary

BAM links are generated in the final links.json also when running from fastq (bug fix)
An additional config and parameterisation is added to run purple in SHALLOW_MODE

Assets 2

27 Dec 10:38

kduyvesteyn

v4.7

8eb3c21

Pipeline version 4.7

Summary

Run post stats prior to indel realignment (bug fix)

Assets 2

20 Dec 15:01

kduyvesteyn

v4.5

09a7848

Pipeline version 4.5

Summary
This release has been made in preparation for pipeline v5 which is build on a completely new architecture and infrastructure. This release only contains some cleanups and bug fixes compare to v4.4.

Various resources and JARs used by the pipeline can be found on https://resources.hartwigmedicalfoundation.nl.

Improvements

Added a GRIDSS somatic filter step which filters down GRIDSS raw output into filtered VCF (using GRIDSS pon)
GRIDSS filtered vcf is fed into purple which uses the structural variants as-usual but also tries to recover structural variants which were not previously called.

Cleanups

We generated a new amber BAF BED file to filter for likely heterozygous germline positions. This new BED file effectively leads to more BAF points, plus this file is now publicly shared on our resources page.
Manta and BPI have been removed
FastQC has been removed
The mappability tracks HDR file (used to annotate somatic variants with a mappability score) has been changed (bug fix).

Version changes

Purple to v2.17
New Rlibs dependencies (mainly for GRIDSS somatic filter), not publicly available. Tested on Rscript version v3.5.0

Somatic precision & sensitivity

The somatic precision and sensitivity of SNVs and Indels is determined on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878 against 100% of NA24385 as reference sample. Results are identical to pipeline v4.0:

Type	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
INDEL	Strelka	74360	641	22412	99,1%	76,8%	0%	0%
SNV	Strelka	955590	1253	38084	99,9%	96,2%	0%	0%
MNV	Strelka	6868	21	0	99,7%	100,0%	0%	0%

Assets 2

18 Oct 14:40

kduyvesteyn

v4.4

1b719d5

Pipeline version 4.4

Upgrade to GRIDSS to v2.0.1

Assets 2

06 Sep 09:12

kduyvesteyn

v4.3

5ae5010

Pipeline version 4.3

Configuration changes in GRIDSS compared to pipeline v4.2

Assets 2

28 Aug 13:20

kduyvesteyn

v4.2

3222169

Pipeline version 4.2

Summary

This pipeline upgrades GRIDSS from v1.8.0 to v1.9.0 compared to v4.0
Various improvements to the GRIDSS somatic SV calling algorithm have made been made based on 163 GRIDSS runs done with pipeline v4.0, and have been released as part of GRIDSS v1.9.0.

Other changes

We retain the metrics generated by the GRIDSS PreProcess steps. These metrics used to be cleaned up after a successful v4.0 run but can be useful for debugging.
BPI is upgraded from v1.6 to v1.7 (bug fix release)
Amber is upgraded from v1.5 to v1.6 (bug fix release)

Assets 2

22 Jul 07:07

kduyvesteyn

v4.0

0a7c440

Pipeline version 4.0

Summary
Many minor changes to all somatic algorithms plus addition of GRIDSS structural variant caller.
Removal of KG pipeline and removal of tumor GATK calling.

Various resources and JARs used by the pipeline can be found on https://resources.hartwigmedicalfoundation.nl.

Improvements to somatic SNV / Indel calling

To improve sensitivity, variants on known pathogenic locations are retained all the way through Strelka if they are called by the initial Strelka (raw) caller. The list used by HMF can be found on the resources page and is based on CiViC, CGI and OncoKB, appended with a few promotor positions in TERT gene.
Post-strelka, variants are annotated with a mapping probability based on information known about the mappability of positions in the ref genome.
Switched from Germline PON v1.1 to Germline PON v2.0
Added a Somatic PON which filters out specific Strelka artefacts.
Added MNV merging. Variants that potentially affect the same codon(s) are checked for phasing and merged if they are phased. This is done within the Strelka Post Process JAR.
Cosmic annotation has been adjusted such that the COSMIC ID for every transcript affected by a variant is included, not just a random single COSMIC ID. Information is provided in the INFO to pick the COSMIC ID for a specific transcript.

Added GRIDSS as an additional somatic structural variant caller

GRIDSS is implemented next to Manta/BPI and our intention is to eventually replace Manta/BPI since we expect it to perform better across our cohort of samples. All documentation on GRIDSS can be found on https://github.com/PapenfussLab/gridss.

Other changes

Germline calling is now only performed on the reference sample and hence the germline VCF contains the calls for just one sample.
Every final VCF (germline, somatic, sv, etc) is gzipped and a tabix index is provided along with the gzipped VCF.
The kinship test to detect sample swaps is replaced by a test based on BAF scores. The main reason is that kinship penalises het-to-hom transitions, which happen in relation to the degree of LOH. Using BAFs, we can detect sample swaps by observing a mean BAF that significantly deviates from 0.5, which is independent of degree of LOH in the tumor.
The QC checks are now run as part of the pipeline while they previously used to be a post-pipeline step.
KG configuration is no longer supported, but there is an INI to analyse just a single sample. This ini runs the algorithms that would normally be run on the reference sample of a somatic pair of samples.

New tool versions

GRIDSS introduced at version v1.8.0 (using bwa v0.7.17)

Version changes

Purple v1.2 to v2.14
Cobalt v1.0 to v1.4
Amber v1.0 to v1.5
BPI v1.2 to v1.6
Strelka Post Process v1.0 to v1.4
HealthChecker v2.1 to v2.4
GATK v3.4.46 to v3.8
snpEff v4.1h to v4.3s

Quality

Since we don't have a KG pipeline anymore we don't report germline precision and sensitivity.

Somatic precision & sensitivity

Type	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
INDEL	Strelka	74360	641	22412	99,1%	76,8%	-0.1%	-0.2%
SNV	Strelka	955590	1253	38084	99,9%	96,2%	0%	0%
MNV	Strelka	6868	21	0	99,7%	100,0%	-	-

Note: The differences between v3 are entirely attributed to changes we made in the way we measure the above numbers. Running the same method between v3 and v4 yields no differences which is as-expected since we made no changes that significantly affects either sensitivity or precision.

In addition, to measure exact false positive rate, we analyse a sample against itself in roughly 30x/100x coverage. With pipeline v4.0 release we find 136 false positives in total across the whole genome (109 SNVs and 27 INDELs).

Assets 2

15 Sep 12:21

kduyvesteyn

v3.0

a5ea26f

Pipeline version v3.0

Summary
Major overhaul of all somatic analyses (SNVs, INDELs, implied purity, BAF, copy numbers and structural variation (SVs)).

Improvements to somatic SNV / Indel calling

Consensus method has been replaced by Strelka only calling with custom post processing. Mutect, Freebayes and Varscan callers have been removed from the pipeline.
BQSR is now run prior to somatic calling (to improve Strelka precision).
Strelka REPEAT filter is switched off to improve INDEL sensitivity.
A new post calling filter is added to Strelka output: variants in the low confidence regions are hard filtered unless they have > 10% AF and Strelka Somatic score > 20.
A soft PON (pool of normals) filter is applied to the final Strelka output to improve precision.

Improvements to somatic SV calling

‘Alfredi' (BPI) is run on Manta output which performs the following functions:
- Applies a set of 8 filters to Manta output to remove obvious false positives / improve precision.
- Determines the accurate break point of each variant.
- Calculates a AF for each breakpoint end on each variant.

Tumor purity / BAF / copy numbers

‘PURPLE’ (PURity & PLoidy Estimator) replaces FreeC as the primary copy number tool
Key features of PURPLE:
- ‘COBALT’ (COunt BAm Lines of Tumor) counts the # of reads per kb window for both normal and tumor.
- GC bias is fit for both normal and tumor.
- ‘AMBER’ (A Minipileup Baf EstimatoR) calculates BAF for a set of HC common heterozygous SNPs.
- A set of candidate copy number breakpoints is determined using a PCF (piecewise constant fitting) algorithm on tumor, normal and BAF.
- Sample ploidy and purity is jointly fit by minimising a penalty function using a integer ploidy and minor allele ploidy model.
- Absolute copy number is determined for each segment.
- Candidate breakpoints are smoothed into a set of final copy number breakpoints.
- CIRCOS and QC plots are produced.

Other changes

Introduce damage estimator tool to estimate DNA damage.
BQSR no longer produces a QC report, cutting out 40% of total runtime.
BQSR writes its BAM using lower zip compression leading to faster compute time but bigger recalibrated BAM files.
A 26 SNP filter is added for changes to SNP check design.
Health checker is now run as part of the pipeline.
CPCT Slicing is removed as it is no longer used.

Changes to versions

dx_tracks updated for KG from v1 to v1.2.1

Quality

For assessing the quality of the pipeline we do the following checks:
- Determine germline precision & sensitivity on an internally sequenced NA12878
- Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.

Germline precision & sensitivity

Type	Config	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
SNV	KG	GATK	3115316	10421	38943	99,7%	98,8%	0.0%	0,0%

Somatic precision & sentivity

Type	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
INDEL	Strelka	74432	577	22184	99,2%	77,0%	-0,4%	7,1%
SNV	Strelka	969786	1247	38058	99,9%	96,2%	0,1%	0,0%

Note: Soft PON filter has not been applied on the GIAB mixin results as that is not possible due to the nature of the sample.

Assets 2

02 Mar 10:38

kduyvesteyn

v1.12

160eaff

Pipeline version 1.12

Summary

Somatic calling improved through improved Strelka and Freebayes filtering
FreeC now uses GC-normalization and produces BAF analysis.
First implementation of SV-calling using Manta.

Improvements to CNV calling

FreeC copy number output is now based on GC content normalisation and assessess BAF, to remove earlier observed "wave effect"
Many corner cases solved and new tool version put in place.

Improvements to somatic calling

Freebayes normalisation and filtering is substantially improved:
- No calls without normal coverage added in the final VCF.
- No SNPs with length > 1 due to INDELs on the same line/position
- Improvement of left-aligned, single-padded INDEL representation
Improved Strelka filtering by new allelic frequency-based filtering:
- We accept lower quality variants provided they have sufficient frequency

Technical changes

Fixes in tools
- Fix a corner-case where VarScan would error due to normal input being longer than tumor input for a given chromosome.
- Fix a bug in GATK where BQSR statistics are not flushed before producing the report.
Completely standardise job creation and submission
- All jobs have their name, template, job ID, script name, log files and .done files standardised and unified; a single function is responsible for submitting jobs to SGE
- As a consequence, every job has its own .done file, which is more granular but more files to delete when re-running
- Greater job re-use, for e.g. concatenating VCFs.
- Backwards compatibility with previously-inconsistent .done file names is provided, so re-running previous samples/parts of samples should work seamlessly
Cleanup of template structure
- Provided helper functions and consistency around standard operations like logging timing, logging status to dashboard, validating that output files exist etc.
- For the most part templates should be like functions: they are told their inputs and where to store their output.
Reduce implicit (potentially inconsistent) duplication of paths/filenames.
Take qsub options from file instead of command line, eliminating need to firm hold job IDs into maximum command line length.
Fix status reporting to dashboard.
Validate that FASTQ name does not begin with a dash (likely to be interpreted as a switch to commands).
Remove secondary (unused) mappability track configuration in FREEC
Add an INI option to retain the recalibrated BAM file (useful for testing/experiments).
links.json now has relative paths (relative to the run directory). This makes the paths portable as the run dir is copied around.
New additions to the extras.tar in the portal:
- Final ini used when running the pipeline
- ExonCov preferred transcripts
- Germline and potentially somatic SVs from Manta (depending on INI used).

Tools & version changes

FreeC upgraded to 10.3
FreeC BAF uses dbSNP v149 sliced using CytoscanHD positions
ExonCov upgraded to 2.1.3
Manta introduced at 1.0.3
BCFtools 1.3.1 used by Freebayes post process

Quality

For assessing the quality of the pipeline we do the following checks:
- Determine germline precision & sensitivity on an internally sequenced NA12878
- Determine somatic precision & sensitivity on an internally sequenced GIAB-mix of 70% NA24385 and 30% NA12878, against 100% of NA24385 as reference sample.

Germline precision & sensitivity

Type	Config	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
SNP	KG	GATK	3115316	10425	38943	99,7%	98,8%	0.0%	0,0%

Somatic precision & sentivity

Type	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
INDEL	Freebayes	66002	494	30614	99,3%	68,3%	0,4%	1,7%
INDEL	Strelka	67300	275	29316	99,6%	69,7%	0,2%	19,6%
INDEL	Varscan	63576	624	33040	99,0%	65,8%	0,0%	0,0%
SNP	Freebayes	936784	1014	71060	99,9%	93,0%	0,0%	0.2%
SNP	Mutect	931973	5948	75871	99,4%	92,5%	0,0%	0,0%
SNP	Strelka	969316	2068	38528	99,8%	96,2%	0,2%	2,8%
SNP	Varscan	899598	832	108246	99,9%	89,3%	0,0%	0,0%

Assets 2

22 Dec 13:47

kduyvesteyn

v1.11

b35050e

Pipeline version 1.11

Summary

Release with final fixes and validation for KG

Changes for KG

Add CallableLoci functionality, enable in KG.ini
Enable ExonCov in KG.ini (disabled in v1.10)
Recalibrated BAM should not be the final BAM for BQSR runs due to size/lack of FASTQ recoverability: do not link it and delete it on success
Ability to link some pipeline artefacts as "extras" that themselves linked (in links.json) as a single archive file
and use this for a selection of extra files for KG

Technical Changes

Code has been rewritten to adhere to standard perl structure and automated tests has been added, build status vieweable via travis-ci

Validation

Regression for our somatic pipeline succeeded with no change in BAM and no regression in precision/sensitivity for germline or somatic VCF
For validation of single sample pipeline we use data for NA12878, internally known as VAL-S00025. This sample has a truthset of 3154259 variants, and we achieve the following results:

Type	Algo	TP	FP	FN	Prec	Sens	Δ Prec	Δ Sens
ANY	GATK	3115315	10425	38944	99,7%	98,8%	-	-

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: hartwigmedical/pipeline-perl

Pipeline version 4.8

Pipeline version 4.7

Pipeline version 4.5

Pipeline version 4.4

Pipeline version 4.3

Pipeline version 4.2

Pipeline version 4.0

Pipeline version v3.0

Pipeline version 1.12

Pipeline version 1.11