Haplotype-based imputation, phasing and genotyping of STRs
Author: Thomas Willems twillems@mit.edu
Short tandem repeats (STRs) are highly repetitive genomic sequences comprised of repeated copies of an underlying motif. Prevalent in most organisms' genomes, STRs are of particular interest because they mutate much more rapidly than most other genomic elements. As a result, they're extremely informative for genomic identification, ancestry inference and genealogy.
Despite their utility, STRs are particularly difficult to genotype. The repetitive sequence responsible for their high mutability also results in frequent alignment errors that can complicate and bias downstream analyses. In addition, PCR stutter errors often result in reads that contain additional or fewer repeat copies than the true underlying genotype.
HipSTR was specifically developed to deal with these errors in the hopes of obtaining more robust STR genotypes. In particular, it accomplishes this by:
- Learning locus-specific PCR stutter models using an [EM algorithm] (http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
- Mining candidate STR alleles from population-scale sequencing data
- Utilizing phased SNP haplotypes to genotype, phase and/or impute STRs
- Employing a specialized hidden Markov model to align reads to candidate sequences while accounting for stutter
HipSTR requires a standard c++ compiler, CMake as well as Java version 1.7 or later. To obtain HipSTR and all of its associated submodules, use:
% git clone --recursive https://github.com/tfwillems/HipSTR.git
To build, use Make:
% cd HipSTR
% make
On Mac, before running Make, change the line in vcflib/smithwaterman/Makefile from LDFLAGS=-Wl,-s
to LDFLAGS=-Wl
To run HipSTR in its most broadly applicable mode, run it on all samples concurrently using the syntax:
./HipSTR --bams run1.bam,run2.bam,run3.bam,run4.bam
--fasta /data/
--regions str_regions.bed
--stutter-out stutter_models.txt
--str-vcf str_calls.vcf.gz
- --bam : a comma-separated list of BAM files generated by any indel-sensitive aligner (e.g. lobSTR or BWA-MEM) and sorted and indexed using samtools
- --regions : a BED file containing the coordinates for each STR region of interest
- --fasta : the directory containing [FASTA files] (https://en.wikipedia.org/wiki/FASTA_format) for each chromosome in the BED file. In the above usage example, if str_regions.bed contains chr1, chr2, and chr10, the corresponding files would be /data/chr1.fa, /data/chr2.fa and /data/chr10.fa. Alternatively, you can supply the path for a single FASTA file containing all of the relevant sequences
For each region in str_regions.bed, HipSTR will:
- Learn locus-specific stutter models and output them to stutter_models.txt
- Use the stutter model and haplotype-based alignment algorithm to genotype each individual
- Output the resulting STR genotypes to str_calls.vcf.gz, a [bgzipped] (http://www.htslib.org/doc/tabix.html) VCF file. This VCF will contain calls for each sample in any of the BAM files' read groups.
HipSTR has a variety of usage options designed to accomodate scenarios in which the sequencing data varies in terms of the number of samples and the coverage. Most scenarios will fall into one of the following categories:
- 200 or more low-coverage (~5x) samples
- Sufficient reads for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use de novo stutter estimation + STR calling with de novo allele generation
- 50 or more high-coverage (~30x) samples
- Sufficient reads for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use de novo stutter estimation + STR calling with de novo allele generation
- Handful of low-coverage (~5x) samples
- Insufficient reads for stutter estimation
- Insufficient reads to detect candidate STR alleles
- Use external stutter models + STR calling with a reference panel
- Handful of high-coverage (~30x) samples
- Insufficient samples for stutter estimation
- Sufficient reads to detect candidate STR alleles
- Use external stutter models + STR calling with de novo allele generation
This mode is identical to the one suggested in the Quick Start section as it suits most applications. HipSTR will output the learned stutter models to stutter_models.txt and the STR genotypes in bgzipped VCF format to str_calls.vcf.gz
./HipSTR --bams run1.bam,run2.bam,run3.bam,run4.bam
--fasta /data/
--regions str_regions.bed
--stutter-out stutter_models.txt
--str-vcf str_calls.vcf.gz
The sole difference in this mode is that we no longer output stutter models using the --stutter-out option and instead input them from a file using the --stutter-in file. For more details on the stutter model file format, see below. For humans, we've provided a file containing stutter models for each STR locus under PCR or PCR-free conditions at FILL ME IN
./HipSTR --bams run1.bam,run2.bam,run3.bam,run4.bam
--fasta /data/
--regions str_regions.bed
--stutter-in ext_stutter_models.txt
--str-vcf str_calls.vcf.gz
This mode is very similar to mode #2, except that we provide an additional VCF file containing known STR genotypes at each locus using the --str-vcf option. HipSTR will not identify any additional candidate STR alleles in the BAMs when this option is specified, so it's best to use a VCF that contains STR genotypes for a wide range of populations and individuals. For humans, we've provided such a file based on the 1000 Genomes Project at FILL ME IN
./HipSTR --bams run1.bam,run2.bam,run3.bam,run4.bam
--fasta /data/
--regions str_regions.bed
--stutter-in ext_stutter_models.txt
--ref-vcf ref_strs.vcf.gz
--str-vcf str_calls.vcf.gz
HipSTR doesn't currently have multi-threaded support, but there are several options available to accelerate analyses:
- Analyze each chromosome in parallel using the --chrom option. For example, --chrom chr2 will only genotype the BED regions on chr2
- Utilize the length-based genotyper instead of the sequence-based genotyper using the --len-genotyper flag. This will vastly accelerate analyses, but it only considers the sizes of indels in reads independently. As a result, it is incapable of determining the exact sequence of an STR and is susceptible to alignment errors
- If you have hundreds of BAM files, we recommend that you merge them into a more manageable number (10-100) using the
samtools merge
command. Large numbers of BAMs can lead to slow disk IO and poor performance
Option | Description |
---|---|
--viz-out <aln_viz.html.gz> | Output a bgzipped file containing Needleman-Wunsch alignments for each locus. The resulting file can be readily visualized with VizAln |
--haploid-chrs <list_of_chroms> | Comma separated list of chromosomes to treat as haploid By default, all chromosomes are treated as diploid |
--no-rmdup | Don't remove PCR duplicates. By default, they'll be removed |
--snp-vcf <phased_snps.vcf.gz> | Bgzipped VCF file containing phased SNP genotypes for samples that are being genotyped. These SNPs will be used to physically phase any STRs when a read or its mate pair overlaps a heterozygous site Always use this option if you have available phased SNP genotypes |
--bam-samps <list_of_read_groups> | Comma separated list of samples in same order as BAM files. Assign each read to the sample corresponding to its file. By default, each read must have an RG flag and the associated sample is used instead |
--bam-lbs <list_of_read_groups> | Comma separated list of libraries in same order as BAM files. Assign each read to the library corresponding to its file. By default, each read must have an RG flag and the associated library is used instead NOTE: This option is required when --bam-samps has been specified |
This list is comprised of the most useful and frequently used additional options, but is not all encompassing. For a complete list of options, please type either ./HipSTR
or ./HipSTR --help
When deciphering and inspecting STR calls, it's extremely useful to visualize the supporting reads. HipSTR facilitates this through the --viz-out option, which writes a bgzipped file containing alignments for each call that can be readily visualized using the VizAln command included in HipSTR main directory. If you're interested in visualizing alignments, you first need to index the file using tabix.
For example, if you ran HipSTR with the option --viz-out alns.html.gz
, you should use the command
tabix -p bed aln.html.gz
to generate a [tabix] (http://www.htslib.org/doc/tabix.html) index for the file so that we can rapidly extract alignments for a locus of interest. This command only needs to be run once after the file has been generated.
You could then visualize the calls for sample ERR218433 at locus chr1 51639636 using the command
./VizAln aln.html.gz chr1 51639636 ERR218433
This command will automatically open a rendering of the alignments in your browser and might look something like: The top bar represents the reference sequence and the red text indicates the name of the sample and its associated call at the locus. The remaining rows indicate the alignment for each read used in genotyping. In this particular example, 7 reads have a 4bp deletion, 9 reads have a 8bp deletion and 1 read has a 12bp deletion. The solitary 12bp deletion is likely the result of PCR stutter and HipSTR therefore genotypes this sample as -4 | -8
If we wanted to inspect all calls for the same locus, we could use the command
./VizAln aln.html.gz chr1 51639636
NOTE: Because the --viz-out file can become fairly large if you're genotyping thousands of loci or thousands of samples, in some scenarios it may be best to rerun HipSTR using this option on the subset of loci in which you're interested.
HipSTR requires BAM files produced by any indel-sensitive aligner. These files must have been sorted by position using the samtools sort
command and then indexed using samtools index
. To associate a read with its sample of interest, HipSTR uses read group information in the BAM header lines. These @RG lines must contain an ID field, an LB field indicating the library and an SM field indicating the sample. For example, if a BAM contained the following header line
@RG ID:lobSTR;RUN1 LB:ERR044603 SM:HG01914
an alignment with the RG tag
RG:Z:lobSTR;RUN1
will be associated with sample HG01914 and library ERR044603. In this manner, HipSTR can analyze BAMs containing more than one sample and/or more than one library and can handle BAMs in which a single sample's reads are spread across multiple files.
Alternatively, if your BAM files lack RG information, you can use the --bam-samps and -bam-lbs flags to specify the sample and library associated with each BAM. In this setting, however, a BAM can only contain a single library and a single read group. For example, the command
./HipSTR --bams run1.bam,run2.bam,run3.bam,run4.bam
--fasta /data/
--regions str_regions.bed
--stutter-out stutter_models.txt
--str-vcf str_calls.vcf.gz
--bam-samps SAMPLE1,SAMPLE1,SAMPLE2,SAMPLE3
--bam-lbs LIB1,LIB2,LIB3,LIB4
essentially tells HipSTR to associate all the reads in the first two BAMS with SAMPLE1, all the reads in the third file with SAMPLE2 and all the reads in the last BAM with SAMPLE3.
The BED file containing each STR region of interest is a tab-delimited file comprised of 5 required columns and one optional column:
- The name of the chromosome on which the STR is located
- The start position of the STR on its chromosome
- The end position of the STR on its chromosome
- The motif length (i.e. the number of bases in the repeat unit)
- The number of copies of the repeat unit in the reference allele
The 6th column is optional and contains the name of the STR locus, which will be written to the ID column in the VCF. Below is an example file which contains 5 STR loci
NOTE: The table header is for descriptive purposes. The BED file should not have a header
CHROM | START | END | MOTIF_LEN | NUM_COPIES | NAME |
---|---|---|---|---|---|
chr1 | 13784267 | 13784306 | 4 | 10 | GATA27E01 |
chr1 | 18789523 | 18789555 | 3 | 11 | ATA008 |
chr2 | 32079410 | 32079469 | 4 | 15 | AGAT117 |
chr17 | 38994441 | 38994492 | 4 | 12 | GATA25A04 |
chr17 | 5529940 | 55299992 | 4 | 13 | AAT245 |
For humans, we've provided a BED file containing STR loci for hg19 at FILL ME IN
For other model organisms, we recommend that you
- Use Tandem Repeats Finder or other tools to scan the reference genome for STRs
- Reformat the resulting output to conform with the format outlined above
For more information on the VCF file format, please see the VCF spec. For filtering and parsing VCFs, we recommend the fantastic python package PyVCF
INFO fields contains statistics about each genotyped locus in the VCF. The INFO fields reported by HipSTR primarily describe the learned/supplied stutter model for the locus and its reference coordinates and sequence characteristics.
FIELD | DESCRIPTION |
---|---|
INFRAME_PGEOM | Parameter for in-frame geometric step size distribution |
INFRAME_UP | Probability that stutter causes an in-frame increase in obs. STR size |
INFRAME_DOWN | Probability that stutter causes an in-frame decrease in obs. STR size |
OUTFRAME_PGEOM | Parameter for out-of-frame geometric step size distribution |
OUTFRAME_UP | Probability that stutter causes an out-of-frame increase in obs. STR size |
OUTFRAME_DOWN | Probability that stutter causes an out-of-frame decrease in obs. STR size |
BPDIFFS | Base pair difference of each alternate allele from the reference allele |
START | Inclusive start coodinate for the repetitive portion of the reference allele |
END | Inclusive end coordinate for the repetitive portion of the reference allele |
PERIOD | Length of STR motif |
AC | Alternate allele counts |
NSKIP | Number of samples not genotyped due to various issues |
FORMAT fields contain information about the genotype for each sample at the locus. In addition to the most probable phased genotype (GT), HipSTR reports information about the posterior likelihood of this genotype (PQ) and its unphased analog (Q). HipSTR also reports the expected base pair difference from the reference for the genotype by marginalizing over all genotype probabilities (BPDOSE), a value that may be useful in association studies.
FIELD | DESCRIPTION |
---|---|
GT | Genotype |
GB | Base pair differences of genotype from reference |
Q | Posterior probability of unphased genotype |
PQ | Posterior probability of phased genotype |
DP | Read depth |
DSNP | Number of reads with SNP phasing information |
PDP | Fractional reads supporting each haploid genotype |
BPDOSE | Posterior mean base pair difference from reference |
ALLREADS | Base pair difference observed in each read |
PALLREADS | Expected base pair diff in each read based on haplotype alignment probs |
GL | log-10 genotype likelihoods |
PL | Phred-scaled genotype likelihoods |
To model PCR stutter artifacts, we assume that there are three types of stutter artifacts:
- In-frame changes: Change the size of the STR in the read by multiples of the repeat unit. For instance, if the repeat motif is AGAT, in-frame changes could lead to differences of -8, -4, 4, 8, and so on.
- Out-of-frame changes: Change the size of the STR by non-multiples of the repeat unit. For instance, if the repeat motif is AGAT, out-of-frame changes could lead to differences of -3, -2, -1, 1, 2, 3 and so on.
- No stutter change: The size of the STR in the read is the same as the size of the underlying STR.
Stutter model files contain the information necessary to model each of these artifacts in a tab-delimited BED-like format with exactly 9 columns. An example of such a file is as follows:
CHROM | START | END | IGEOM | IDOWN | IUP | OGEOM | ODOWN | OUP |
---|---|---|---|---|---|---|---|---|
chr1 | 13784267 | 13784306 | 0.95 | 0.05 | 0.01 | 0.9 | 0.01 | 0.001 |
chr1 | 18789523 | 18789555 | 0.8 | 0.01 | 0.05 | 0.9 | 0.001 | 0.001 |
chr2 | 32079410 | 32079469 | 0.9 | 0.01 | 0.01 | 0.9 | 0.001 | 0.001 |
chr17 | 38994441 | 38994492 | 0.9 | 0.001 | 0.001 | 0.9 | 0.001 | 0.001 |
chr17 | 5529940 | 55299992 | 0.95 | 0.01 | 0.01 | 0.9 | 0.001 | 0.001 |
NOTE: The table header is for descriptive purposes. The stutter file should not have a header
Each of the stutter parameters is defined as follows:
VARIABLE | DESCRIPTION |
---|---|
IDOWN | Probability that in-frame changes decrease the size of the observed STR allele |
IUP | Probability that in-frame changes increase the size of the observed STR allele |
ODOWN | Probability that out-of-frame changes decrease the size of the observed STR allele |
OUP | Probability that out-of-frame changes increase the size of the observed STR allele |
IGEOM | Parameter governing geometric step size distribution for in-frame changes |
OGEOM | Paramter governing geometric step size distribution for out-of-frame changes |