Skip to content
NGS DNA best practice pipeline for Illumina sequencing - alignment, variant calling, annotation and QC
Branch: master
Clone or download
marieke-bijlsma Merge pull request #250 from RoanKanninga/master
fixing issue when connecting via ssh to zinc-finger
Latest commit e623279 Apr 10, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
conf
docs
protocols fixing issue when connecting via ssh to zinc-finger Apr 10, 2019
report fix the removing of SequonomConcordance in QC report Apr 10, 2017
resources updates for new clean cluster Jun 14, 2016
scripts updated docs Dec 6, 2018
templates Doubled time-out value of sbatch get-user-env argument to prevent "mo… Apr 3, 2019
test fix test Apr 2, 2019
.gitignore Added .gitignore to repo. Aug 15, 2016
LICENSE Create LICENSE Sep 4, 2017
README.md
automated_create_in-house_ngs_projects_workflow.csv did some chmod-ing of csv files Aug 19, 2016
batchIDList_b38_chr.csv
batchIDList_chr.csv WIP Oct 24, 2018
batchIDList_small.csv WIP Oct 24, 2018
book.json fix docs Apr 14, 2017
create_5GPM_WGS_workflow.csv added 5GPM WGS analysis Sep 8, 2016
create_external_samples_ngs_projects_workflow.csv did some chmod-ing of csv files Aug 19, 2016
create_in-house_ngs_projects_workflow.csv did some chmod-ing of csv files Aug 19, 2016
create_reanalysis_workflow.csv test Aug 19, 2016
parameters.csv added recalibrated bam.table output to resultsdir Mar 18, 2019
parameters_boxy.csv Updated data staging prm->tmp in NGS_DNA and removed this from NGS_Au… Aug 7, 2017
parameters_calculon.csv added wgs part to inhouse part + made prm03 default on calculon Mar 6, 2019
parameters_gonl.csv fix GoNL issue Jun 30, 2017
parameters_host.csv bugfix in CreateExternSamples + change of rootpath in parameters_host Jun 20, 2016
parameters_leela.csv removing unnecessary checkEnvironment script (logic is now inside gen… Aug 16, 2016
parameters_leucine-zipper.csv updated prm for leucine-zipper to prm06 Sep 19, 2018
parameters_resources_exome.csv fixing bug when there are differences between output in Autotest Feb 27, 2019
parameters_umcg-atd.csv
parameters_umcg-gaf.csv updates for new clean cluster Jun 14, 2016
parameters_umcg-gd.csv
parameters_umcg-gdio.csv working on Decision Tree Jan 24, 2017
parameters_umcg-gonl.csv working on gonl b38 workflow Mar 7, 2017
parameters_umcg-testgroup.csv
parameters_zinc-finger.csv updated Convading make controlsgroup + changed prmhost for zinc-finge… Feb 12, 2018
startFromVcf.sh fix Dec 5, 2018
workflow-MarkDuplicates.csv updating GoNL pipeline Apr 7, 2017
workflow-bare.csv fixing Manta crash Aug 29, 2017
workflow.csv bugfix in GenderCalculate + workflow dependency not correct Dec 6, 2018
workflowNonUMCG.csv fix Convading Dec 12, 2016
workflow_5GPM_WGS.csv added 5GPM WGS analysis Sep 8, 2016
workflow_GavinStandAlone.csv added SnpEff to the GavinStandalone workflow Dec 6, 2018
workflow_reanalysis.csv added 5GPM WGS analysis Sep 8, 2016
workflow_samplesize_bigger_than_200.csv wip Mar 24, 2017
workflow_startFromVcf.csv added GeneNetwork to startFromVcf workflow Dec 4, 2017

README.md

NGS_DNA pipeline

Manual

Find manual on installation and use at https://molgenis.gitbooks.io/ngs_dna

Preprocessing

During the first preprocessing steps of the pipeline, PhiX reads are inserted in each sample to create control SNPs in the dataset. Subsequently, Illumina encoding is checked and QC metrics are calculated using FastQC1.

Alignment to a reference genome

The bwa-mem command from Burrows-Wheeler Aligner (BWA)2 is used to align the sequence data to a reference genome resulting in a SAM (Sequence Alignment Map) file. The reads in the SAM file are sorted with Sambamba3 resulting in a sorted BAM file. When multiple lanes were used during sequencing, all lane BAMs were merged into a sample BAM using Sambamba. The (merged) BAM file is marked for duplicates of the same read pair using Sambamba.

Variant discovery

The GATK4 HaplotypeCaller estimates the most likely genotypes and allele frequencies in an alignment using a Bayesian likelihood model for every position of the genome regardless of whether a variant was detected at that site or not. This information can later be used in the project based genotyping step. A joint analysis has been performed of all the samples in the project. This leads to a posterior probability of a variant allele at a site. SNPs and small Indels are written to a VCF file, along with information such as genotype quality, allele frequency, strand bias and read depth for that SNP/Indel. Based on quality thresholds from the GATK "best practices"5, the SNPs and indels are filtered and marked as Lowqual or Pass resulting in a final VCF file.

References

1. Andrews S (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
2. Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
3. Tarasov A et al. (2015). Sambamba: Fast processing of NGS alignment formats.
4. McKenna A et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
5. Van der Auwera GA et al. (2013). From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.
You can’t perform that action at this time.