HipSTR

Haplotype-based imputation, phasing and genotyping of STRs

Author: Thomas Willems twillems@mit.edu

License: GNU v2

Introduction

Short tandem repeats (STRs) are highly repetitive genomic sequences comprised of repeated copies of an underlying motif. Prevalent in most organisms' genomes, STRs are of particular interest because they mutate much more rapidly than most other genomic elements. As a result, they're extremely informative for genomic identification, ancestry inference and genealogy.

Despite their utility, STRs are particularly difficult to genotype. The repetitive sequence responsible for their high mutability also results in frequent alignment errors that can complicate and bias downstream analyses. In addition, PCR stutter errors often result in reads that contain additional or fewer repeat copies than the true underlying genotype.

HipSTR was specifically developed to deal with these errors in the hopes of obtaining more robust STR genotypes. In particular, it accomplishes this by:

Learning locus-specific PCR stutter models using an [EM algorithm] (http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
Mining candidate STR alleles from population-scale sequencing data
Utilizing phased SNP haplotypes to genotype, phase and/or impute STRs
Employing a specialized hidden Markov model to align reads to candidate sequences while accounting for stutter

Installation

HipSTR requires a standard c++ compiler, CMake as well as Java version 1.7 or later. To obtain HipSTR and all of its associated submodules, use:

% git clone --recursive https://github.com/tfwillems/HipSTR.git

To build, use Make:

% cd HipSTR
% make

On Mac, before running Make, change the line in vcflib/smithwaterman/Makefile from LDFLAGS=-Wl,-s to LDFLAGS=-Wl

Quick Start

To run HipSTR in its most broadly applicable mode, run it on all samples concurrently using the syntax:

./HipSTR --bams          run1.bam,run2.bam,run3.bam,run4.bam
         --fasta         /data/
         --regions       str_regions.bed
         --stutter-out   stutter_models.txt
         --str-vcf       str_calls.vcf.gz

--bam : a comma-separated list of BAM files generated by any indel-sensitive aligner (e.g. lobSTR or BWA-MEM) and sorted and indexed using samtools
--regions : a BED file containing the coordinates for each STR region of interest
--fasta : the directory containing [FASTA files] (https://en.wikipedia.org/wiki/FASTA_format) for each chromosome in the BED file. In the above usage example, if str_regions.bed contains chr1, chr2, and chr10, the corresponding files would be /data/chr1.fa, /data/chr2.fa and /data/chr10.fa. Alternatively, you can supply the path for a single FASTA file containing all of the relevant sequences

For each region in str_regions.bed, HipSTR will:

Learn locus-specific stutter models and output them to stutter_models.txt
Use the stutter model and haplotype-based alignment algorithm to genotype each individual
Output the resulting STR genotypes to str_calls.vcf.gz, a [bgzipped] (http://www.htslib.org/doc/tabix.html) VCF file. This VCF will contain calls for each sample in any of the BAM files' read groups.

In-depth usage

HipSTR has a variety of usage options designed to accomodate scenarios in which the sequencing data varies in terms of the number of samples and the coverage. Most scenarios will fall into one of the following categories:

200 or more low-coverage (~5x) samples
- Sufficient reads for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use de novo stutter estimation + STR calling with de novo allele generation
50 or more high-coverage (~30x) samples
- Sufficient reads for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use de novo stutter estimation + STR calling with de novo allele generation
Handful of low-coverage (~5x) samples
- Insufficient reads for stutter estimation
- Insufficient reads to detect candidate STR alleles
- Use external stutter models + STR calling with a reference panel
Handful of high-coverage (~30x) samples
- Insufficient samples for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use external stutter models + STR calling with de novo allele generation

1. De novo stutter estimation + STR calling with de novo allele generation

This mode is identical to the one suggested in the Quick Start section as it suits most applications. HipSTR will output the learned stutter models to stutter_models.txt and the STR genotypes in bgzipped VCF format to str_calls.vcf.gz

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-out      stutter_models.txt
         --str-vcf          str_calls.vcf.gz

2. External stutter models + STR calling with de novo allele generation

The sole difference in this mode is that we no longer output stutter models using the --stutter-out option and instead input them from a file using the --stutter-in file. For more details on the stutter model file format, see below. For humans, we've provided a file containing stutter models for each STR locus under PCR or PCR-free conditions at FILL ME IN

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-in       ext_stutter_models.txt
         --str-vcf          str_calls.vcf.gz

3. External stutter models + STR calling with a reference panel

This mode is very similar to mode #2, except that we provide an additional VCF file containing known STR genotypes at each locus using the --str-vcf option. HipSTR will not identify any additional candidate STR alleles in the BAMs when this option is specified, so it's best to use a VCF that contains STR genotypes for a wide range of populations and individuals. For humans, we've provided such a file based on the 1000 Genomes Project at FILL ME IN

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-in       ext_stutter_models.txt
         --ref-vcf          ref_strs.vcf.gz
         --str-vcf          str_calls.vcf.gz

STR imputation

Speed

HipSTR doesn't currently have multi-threaded support, but there are several options available to accelerate analyses:

Analyze each chromosome in parallel using the --chrom option. For example, --chrom chr2 will only genotype the BED regions on chr2
Utilize the length-based genotyper instead of the sequence-based genotyper using the --len-genotyper flag. This will vastly accelerate analyses, but it only considers the sizes of indels in reads independently. As a result, it is incapable of determining the exact sequence of an STR and is susceptible to alignment errors
If you have hundreds of BAM files, we recommend that you merge them into a more manageable number (10-100) using the samtools merge command. Large numbers of BAMs can lead to slow disk IO and poor performance

Call Filtering

Additional Usage Options

Option	Description
--viz-out <aln_viz.html.gz>	Output a bgzipped file containing Needleman-Wunsch alignments for each locus. The resulting file can be readily visualized with VizAln
--haploid-chrs <list_of_chroms>	Comma separated list of chromosomes to treat as haploid By default, all chromosomes are treated as diploid
--no-rmdup	Don't remove PCR duplicates. By default, they'll be removed
--snp-vcf <phased_snps.vcf.gz>	Bgzipped VCF file containing phased SNP genotypes for samples that are being genotyped. These SNPs will be used to physically phase any STRs when a read or its mate pair overlaps a heterozygous site Always use this option if you have available phased SNP genotypes
--bam-samps <list_of_read_groups>	Comma separated list of samples in same order as BAM files. Assign each read to the sample corresponding to its file. By default, each read must have an RG flag and the associated sample is used instead
--bam-lbs <list_of_read_groups>	Comma separated list of libraries in same order as BAM files. Assign each read to the library corresponding to its file. By default, each read must have an RG flag and the associated library is used instead NOTE: This option is required when --bam-samps has been specified

This list is comprised of the most useful and frequently used additional options, but is not all encompassing. For a complete list of options, please type either ./HipSTR or ./HipSTR --help

Alignment Visualization

When deciphering and inspecting STR calls, it's extremely useful to visualize the supporting reads. HipSTR facilitates this through the --viz-out option, which writes a bgzipped file containing alignments for each call that can be readily visualized using the VizAln command included in HipSTR main directory. If you're interested in visualizing alignments, you first need to index the file using tabix. For example, if you ran HipSTR with the option --viz-out alns.html.gz, you should use the command

tabix -p bed aln.html.gz

to generate a [tabix] (http://www.htslib.org/doc/tabix.html) index for the file so that we can rapidly extract alignments for a locus of interest. This command only needs to be run once after the file has been generated.

You could then visualize the calls for sample ERR218433 at locus chr1 51639636 using the command

./VizAln aln.html.gz chr1 51639636 ERR218433

This command will automatically open a rendering of the alignments in your browser and might look something like: The top bar represents the reference sequence and the red text indicates the name of the sample and its associated call at the locus. The remaining rows indicate the alignment for each read used in genotyping. In this particular example, 7 reads have a 4bp deletion, 9 reads have a 8bp deletion and 1 read has a 12bp deletion. The solitary 12bp deletion is likely the result of PCR stutter and HipSTR therefore genotypes this sample as -4 | -8

If we wanted to inspect all calls for the same locus, we could use the command

./VizAln aln.html.gz chr1 51639636

NOTE: Because the --viz-out file can become fairly large if you're genotyping thousands of loci or thousands of samples, in some scenarios it may be best to rerun HipSTR using this option on the subset of loci in which you're interested.

File Formats

BAM files

HipSTR requires BAM files produced by any indel-sensitive aligner. These files must have been sorted by position using the samtools sort command and then indexed using samtools index. To associate a read with its sample of interest, HipSTR uses read group information in the BAM header lines. These @RG lines must contain an ID field, an LB field indicating the library and an SM field indicating the sample. For example, if a BAM contained the following header line

@RG ID:lobSTR;RUN1 LB:ERR044603 SM:HG01914

an alignment with the RG tag

RG:Z:lobSTR;RUN1

will be associated with sample HG01914 and library ERR044603. In this manner, HipSTR can analyze BAMs containing more than one sample and/or more than one library and can handle BAMs in which a single sample's reads are spread across multiple files.

Alternatively, if your BAM files lack RG information, you can use the --bam-samps and -bam-lbs flags to specify the sample and library associated with each BAM. In this setting, however, a BAM can only contain a single library and a single read group. For example, the command

./HipSTR --bams             run1.bam,run2.bam,run3.bam,run4.bam
         --fasta            /data/
         --regions          str_regions.bed
         --stutter-out      stutter_models.txt
         --str-vcf          str_calls.vcf.gz
         --bam-samps        SAMPLE1,SAMPLE1,SAMPLE2,SAMPLE3
         --bam-lbs          LIB1,LIB2,LIB3,LIB4

essentially tells HipSTR to associate all the reads in the first two BAMS with SAMPLE1, all the reads in the third file with SAMPLE2 and all the reads in the last BAM with SAMPLE3.

STR region BED file

The BED file containing each STR region of interest is a tab-delimited file comprised of 5 required columns and one optional column:

The name of the chromosome on which the STR is located
The start position of the STR on its chromosome
The end position of the STR on its chromosome
The motif length (i.e. the number of bases in the repeat unit)
The number of copies of the repeat unit in the reference allele

The 6th column is optional and contains the name of the STR locus, which will be written to the ID column in the VCF. Below is an example file which contains 5 STR loci

NOTE: The table header is for descriptive purposes. The BED file should not have a header

CHROM	START	END	MOTIF_LEN	NUM_COPIES	NAME
chr1	13784267	13784306	4	10	GATA27E01
chr1	18789523	18789555	3	11	ATA008
chr2	32079410	32079469	4	15	AGAT117
chr17	38994441	38994492	4	12	GATA25A04
chr17	5529940	55299992	4	13	AAT245

For humans, we've provided a BED file containing STR loci for hg19 at FILL ME IN

For other model organisms, we recommend that you

Use Tandem Repeats Finder or other tools to scan the reference genome for STRs
Reformat the resulting output to conform with the format outlined above

VCF file

For more information on the VCF file format, please see the VCF spec. For filtering and parsing VCFs, we recommend the fantastic python package PyVCF

INFO fields

INFO fields contains statistics about each genotyped locus in the VCF. The INFO fields reported by HipSTR primarily describe the learned/supplied stutter model for the locus and its reference coordinates and sequence characteristics.

FIELD	DESCRIPTION
INFRAME_PGEOM	Parameter for in-frame geometric step size distribution
INFRAME_UP	Probability that stutter causes an in-frame increase in obs. STR size
INFRAME_DOWN	Probability that stutter causes an in-frame decrease in obs. STR size
OUTFRAME_PGEOM	Parameter for out-of-frame geometric step size distribution
OUTFRAME_UP	Probability that stutter causes an out-of-frame increase in obs. STR size
OUTFRAME_DOWN	Probability that stutter causes an out-of-frame decrease in obs. STR size
BPDIFFS	Base pair difference of each alternate allele from the reference allele
START	Inclusive start coodinate for the repetitive portion of the reference allele
END	Inclusive end coordinate for the repetitive portion of the reference allele
PERIOD	Length of STR motif
AC	Alternate allele counts
NSKIP	Number of samples not genotyped due to various issues

FORMAT fields

FORMAT fields contain information about the genotype for each sample at the locus. In addition to the most probable phased genotype (GT), HipSTR reports information about the posterior likelihood of this genotype (PQ) and its unphased analog (Q). HipSTR also reports the expected base pair difference from the reference for the genotype by marginalizing over all genotype probabilities (BPDOSE), a value that may be useful in association studies.

FIELD	DESCRIPTION
GT	Genotype
GB	Base pair differences of genotype from reference
Q	Posterior probability of unphased genotype
PQ	Posterior probability of phased genotype
DP	Read depth
DSNP	Number of reads with SNP phasing information
PDP	Fractional reads supporting each haploid genotype
BPDOSE	Posterior mean base pair difference from reference
ALLREADS	Base pair difference observed in each read
PALLREADS	Expected base pair diff in each read based on haplotype alignment probs
GL	log-10 genotype likelihoods
PL	Phred-scaled genotype likelihoods

Stutter model

To model PCR stutter artifacts, we assume that there are three types of stutter artifacts:

In-frame changes: Change the size of the STR in the read by multiples of the repeat unit. For instance, if the repeat motif is AGAT, in-frame changes could lead to differences of -8, -4, 4, 8, and so on.
Out-of-frame changes: Change the size of the STR by non-multiples of the repeat unit. For instance, if the repeat motif is AGAT, out-of-frame changes could lead to differences of -3, -2, -1, 1, 2, 3 and so on.
No stutter change: The size of the STR in the read is the same as the size of the underlying STR.

Stutter model files contain the information necessary to model each of these artifacts in a tab-delimited BED-like format with exactly 9 columns. An example of such a file is as follows:

CHROM	START	END	IGEOM	IDOWN	IUP	OGEOM	ODOWN	OUP
chr1	13784267	13784306	0.95	0.05	0.01	0.9	0.01	0.001
chr1	18789523	18789555	0.8	0.01	0.05	0.9	0.001	0.001
chr2	32079410	32079469	0.9	0.01	0.01	0.9	0.001	0.001
chr17	38994441	38994492	0.9	0.001	0.001	0.9	0.001	0.001
chr17	5529940	55299992	0.95	0.01	0.01	0.9	0.001	0.001

NOTE: The table header is for descriptive purposes. The stutter file should not have a header

Each of the stutter parameters is defined as follows:

VARIABLE	DESCRIPTION
IDOWN	Probability that in-frame changes decrease the size of the observed STR allele
IUP	Probability that in-frame changes increase the size of the observed STR allele
ODOWN	Probability that out-of-frame changes decrease the size of the observed STR allele
OUP	Probability that out-of-frame changes increase the size of the observed STR allele
IGEOM	Parameter governing geometric step size distribution for in-frame changes
OGEOM	Paramter governing geometric step size distribution for out-of-frame changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HipSTR

Author: Thomas Willems twillems@mit.edu

License: GNU v2

Introduction

Installation

Quick Start

In-depth usage

1. De novo stutter estimation + STR calling with de novo allele generation

2. External stutter models + STR calling with de novo allele generation

3. External stutter models + STR calling with a reference panel

STR imputation

Speed

Call Filtering

Additional Usage Options

Alignment Visualization

File Formats

BAM files

STR region BED file

VCF file

INFO fields

FORMAT fields

Stutter model

Files

README.md

Latest commit

History

README.md

File metadata and controls

HipSTR

Author: Thomas Willems twillems@mit.edu

License: GNU v2

Introduction

Installation

Quick Start

In-depth usage

1. De novo stutter estimation + STR calling with de novo allele generation

2. External stutter models + STR calling with de novo allele generation

3. External stutter models + STR calling with a reference panel

STR imputation

Speed

Call Filtering

Additional Usage Options

Alignment Visualization

File Formats

BAM files

STR region BED file

VCF file

INFO fields

FORMAT fields

Stutter model