Genotyping pipeline for short read Illumina data generated from whole genome sequencing
Scripts for processing raw reads to generate a set of high quality genotypes, as in "Genomic signatures of extensive inbreeding in Isle Royale wolves, a population on the threshold of extinction" by Robinson et al. (2019). Pipeline adapted from the Best Practices for GATK3.
- Java 1.8
- Reference genome assembly in fasta format and associated index files (see https://gatkforums.broadinstitute.org/gatk/discussion/2798/howto-prepare-a-reference-for-use-with-bwa-and-gatk)
Convert raw read data to an unmapped BAM file.
Mark Illumina adapter sequences in the unmapped BAM file.
Efficiently align reads to a reference genome.
Mark (and optionally remove) duplicate reads.
5. RemoveBadReads (optional)
Remove reads with low mapping quality or that don't align in proper pair (as indicated by flags). Optional step, reduces the size of the BAM file by eliminating reads that are not desired in downstream processing.
Recalibrate base quality scores to reach convergence between reported and empirical base quality scores.
Generate a gVCF file for each BAM file.
Generate a VCF file from gVCF files.
9. TrimAlternates and VariantAnnotator
Remove alleles from the VCF file that don't appear in any genotypes, and add desired annotations to the INFO field.
10. Variant Effect Predictor
Add mutation impact to the INFO field.
11. VariantFiltration and custom filtering
Apply site- and genotype-level filters.