DOCUMENTATION

NAME
    bam2mpg - Most probable genotype program for predicting variants and
    genotypes from alignments of short reads in BAM format.

SYNOPSIS
    bam2mpg *ref.fasta* *aln.sorted.bam*

DESCRIPTION
    This script uses samtools to process a BAM formatted file
    (http://samtools.sourceforge.net) and call genotypes and confidence
    scores across a covered region.

    For a set of aligned allele observations, the MPG ("Most Probable
    Genotype") algorithm is used to calculate the posterior probability of
    every possible diploid genotype (or single-allele genotypes for regions
    specified with the --single_copy option, e.g., on the non-PAR regions of
    the X and Y chromosome in a male). The statistical model uses base
    quality scores to calculate the probability of base-calling errors, and
    assumes a single prior probability for any non-homozygous-reference
    genotype.

MPG INPUT
    The first argument to bam2mpg is the path of a fasta-formatted file for
    the reference sequence. This fasta file must have a corresponding
    samtools index file with the same name except for an appended ".fai".

    The second argument to bam2mpg is the path of a BAM-formatted file of
    aligned sequencing reads. This file must be sorted and indexed using
    samtools prior to running bam2mpg.

MPG OUTPUT
    If no vcf file is specified (either by --snv_vcf or --div_vcf options,
    see below), and no mpg file is specified (using the --mpg option, see
    below), the standard output of the program will contain "MPG" lines with
    nine tab-separated fields. These fields are:

    variant type
         The variant type can be "MPG_SNV", which indicates a single base
         change at the position specified by the second and third fields, or
         "MPG_DIV" which indicates a deletion or insertion occurring between
         the "flanking" positions separated by a colon in the third field.

    chromosome
         This is the name of the entry in the fasta reference sequence
         passed as the first argument (and of the matching reference entry
         in the BAM file).

    position
         For an SNV, the position reported is the actual position of the
         nucleotide change. For DIV's, this field contains a colon-separated
         pair of positions, which represent the flanking positions
         surrounding the largest variable region in the sequence. So, for
         example, in a variable-length run of T's, the flanking positions
         would be the positions of the non-T characters outside of the run,
         and the alleles reported in the fourth and fifth fields would be
         the T's between these flanking positions.

    reference allele
         This is the base or bases seen in the reference sequence either at
         the specified position (for an SNV) or between the reported
         flanking positions (for a DIV). When the flanking positions are
         adjacent, so there are no bases between them, a "*" is reported to
         enable splitting on white space rather than tabs.

    genotype
         The genotype reported is the genotype with the highest posterior
         probability according to Bayes theorem, given the observed reads
         and quality scores, according to the program's error model. For
         SNV's, the two alleles are concatenated, so, for example "AT"
         indicates one A and one T. For DIV's, the two alleles are separated
         by a colon, with a "*" indicating an allele of zero bases.

         When the --single_copy option is used, single allele genotypes are
         reported, so no colon-separation is used in the DIV. This lack of a
         colon in MPG_DIV genotypes, as well as a single-character genotype
         at MPG_SNV positions, is what distinguishes "single_copy" output.

    MPG score
         The MPG score field contains the difference between the natural
         logarithms of the most probable and second most probable genotype's
         probabilities, and is therefore an indicator of the probability the
         reported genotype is correct. So, for example, a score of 10 would
         imply that the reported genotype was approximately 22,000 times as
         probable as the next most probable genotype. Since bam2mpg will
         call genotypes at any base position, variant or not, we recommend
         using a score cutoff of 10 to avoid a high level of false positive
         predictions of variation.

    ref/non-ref field
         This field is 0 for a homozygous-reference genotype, and 1 for
         anything else, allowing easy extraction of non-reference genotype
         lines with "awk".

    coverage
         This field reports the number of reads used to calculate the most
         probable genotype. It does not include bases that have been
         filtered for quality (with --qual_filter) or reads beyond the 200
         maximum reads the program allows.

    MPV score
         The MPV score reports the difference between the natural logarithms
         of the most probable and the homozygous reference genotype, and is
         therefore an indicator of the probability that the position's true
         genotype is something other than homozygous reference (although not
         a phred-scaled indicator).

MPG OPTIONS
    --region *chr*
    --region *chr:start-end*
         This option specifies a region as a reference entry optionally
         followed by a position range, and causes variants to be called only
         in that region.

    --qual_filter *minimum_quality*
         This option specifies a minimum base quality score necessary for a
         base to be included in the calculation for a particular aligned
         position. Bases with quality scores below this value will be
         completely ignored. At GTB, bam2mpg is almost always run with
         --qual_filter 20. (Default: 0)

    --single_copy *chr1:start1-end1,chr2:start2-end2*
         This option specifies regions for which only a single copy exists
         in the genome, so that only one allele is expected to be seen. The
         regions should be comma-separated without spaces, and in the same
         format as expected by the --region option.

    --indels
         This flag option causes the script to skip SNV predictions and only
         report DIV variants.

    --no_indels
         This flag option causes the script to skip DIV predictions and only
         report SNV variants.

    --old_indels
         This flag option causes the script to use averaged quality scores
         across indels rather than inflated scores which are increased
         proportional to the length of the indel. NOTE: This option is
         currently inactive as "old_indels" has become the default until I
         investigate the new indel calling more carefully.

    --only_nonref
         This flag option causes the script to only print lines that predict
         genotypes that are non homozygous reference.

    --align_bias *bias_value*
         This option specifies an additional expected percentage of aligned
         bases that are expected to be the reference allele due to bias in
         the alignment favoring the reference base. For example, if the
         alignment bias has value .05, mpg will expect a GT heterozygous
         position with reference base "G" to have roughly 55% G's aligned at
         that position, and 45% T's. It can also be used to tilt the
         expected percentages due to included probe sequence, which will
         always be reference, but in the long run it would be better to have
         a position-dependent alignment bias that only changed these
         expected values where the probes are located. (Default: 0)

    --bam_filter *'bam filter options'*
         Specifies filters to apply to the "samtools view" command, to limit
         the SAM alignments that are processed by bam2mpg. Potential
         filtering options include the mapping quality (-qNN), "proper pair"
         flag (-f2), and duplicate flag (-F1024). The default is to exclude
         unmapped reads (-F4); if you supply your own filter, be sure to
         include this value in the "-F" flag, e.g. by adding 4+512+1024=1540
         to exclude unmapped reads, QC failures, and duplicate reads. It is
         advisable to always use quotes around the options, and is required
         when there are spaces between options, e.g. '-q20 -F 1540'.

    --ds_coverage *min_bases*
         This option specifies a minimum number of bases that must be seen
         on each strand for that base's counts to be included in the
         probability calculation. For example, if -ds_coverage is specified
         as 1, and an aligned "T" is observed multiple times on the forward
         strand, but never on the reverse strand, no T's will be included in
         the calculation because T was not seen at least once on the reverse
         strand. This option is dangerous in that it can artificially
         amplify scores by eliminating errors, so its use is discouraged.

    --use_seed *seed value*
         This option sets the random seed to the specified value at the
         beginning of the run. Perl's rand function is used for choosing a
         subset of reads at a position when the depth of coverage exceeds
         200.

    --nocache
         This flag option prevents bam2mpg from caching the SNV genotype
         calls it sees at each site. This caching is meant to speed up
         run-time when genotypes are being called genome-wide, but because
         the caching doesn't consider exact base qualities, it can lead to
         slightly inaccurate scoring (but generally not actual genotype
         errors, especially when --qual_filter is used).

    --sam_filter *"unix command"*
         The specified command is applied as a filter between the "samtools
         view" and "samtools pileup" commands. This means that your command
         should expect SAM format, and should print SAM format lines for
         alignments that pass your filter criteria. Default is no additional
         filtering (but see "--bam_filter"). If your filter can be
         accomplished with options to "samtools view", using "--bam_filter"
         will be more efficient.

    --pileup_filter *"pileup options"*
         The specified options are appended to the call to "samtools
         mpileup". By default, -B is applied, since it looks as if BAQ
         quality scores hurt sensitivity. Anything specified with the
         --pileup_filter option will supersede the -B option (i.e., -B will
         not be applied if --pileup_filter is used, and -B is not included
         in the option.

    --snv_vcf *"path of SNV VCF file to write"*
         This option specifies the path of a VCF file to be written
         containing MPG genotypes in VCF format. If the filename ends in
         ".gz" or ".bgz", the program will attempt to pipe the VCF string to
         "bzip2" before printing.

    --div_vcf *"path of DIV VCF file to write"*
         This option specifies the path of a VCF file to be written
         containing MPG genotypes in VCF format. If the filename ends in
         ".gz" or ".bgz", the program will attempt to pipe the VCF string to
         "bzip2" before printing.

    --mpg *"path of MPG-formatted output file to write"*
         This option specifies the path of an MPG file to be written
         containing genotypes in MPG format. If the filename ends in ".gz",
         the program will attempt to pipe the VCF string to "gzip" before
         printing.

AUTHOR
     Nancy F. Hansen - nhansen@mail.nih.gov

LEGAL
    This software/database is "United States Government Work" under the
    terms of the United States Copyright Act. It was written as part of the
    authors' official duties for the United States Government and thus
    cannot be copyrighted. This software/database is freely available to the
    public for use without a copyright notice. Restrictions cannot be placed
    on its present or future use.

    Although all reasonable efforts have been taken to ensure the accuracy
    and reliability of the software and data, the National Human Genome
    Research Institute (NHGRI) and the U.S. Government does not and cannot
    warrant the performance or results that may be obtained by using this
    software or data. NHGRI and the U.S. Government disclaims all warranties
    as to performance, merchantability or fitness for any particular
    purpose.

    In any work or product derived from this material, proper attribution of
    the authors as the source of the software or data should be made, using
    "NHGRI Genome Technology Branch" as the citation.