-
Notifications
You must be signed in to change notification settings - Fork 1
/
DOCUMENTATION
250 lines (212 loc) · 12.5 KB
/
DOCUMENTATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
NAME
bam2mpg - Most probable genotype program for predicting variants and
genotypes from alignments of short reads in BAM format.
SYNOPSIS
bam2mpg *ref.fasta* *aln.sorted.bam*
DESCRIPTION
This script uses samtools to process a BAM formatted file
(http://samtools.sourceforge.net) and call genotypes and confidence
scores across a covered region.
For a set of aligned allele observations, the MPG ("Most Probable
Genotype") algorithm is used to calculate the posterior probability of
every possible diploid genotype (or single-allele genotypes for regions
specified with the --single_copy option, e.g., on the non-PAR regions of
the X and Y chromosome in a male). The statistical model uses base
quality scores to calculate the probability of base-calling errors, and
assumes a single prior probability for any non-homozygous-reference
genotype.
MPG INPUT
The first argument to bam2mpg is the path of a fasta-formatted file for
the reference sequence. This fasta file must have a corresponding
samtools index file with the same name except for an appended ".fai".
The second argument to bam2mpg is the path of a BAM-formatted file of
aligned sequencing reads. This file must be sorted and indexed using
samtools prior to running bam2mpg.
MPG OUTPUT
If no vcf file is specified (either by --snv_vcf or --div_vcf options,
see below), and no mpg file is specified (using the --mpg option, see
below), the standard output of the program will contain "MPG" lines with
nine tab-separated fields. These fields are:
variant type
The variant type can be "MPG_SNV", which indicates a single base
change at the position specified by the second and third fields, or
"MPG_DIV" which indicates a deletion or insertion occurring between
the "flanking" positions separated by a colon in the third field.
chromosome
This is the name of the entry in the fasta reference sequence
passed as the first argument (and of the matching reference entry
in the BAM file).
position
For an SNV, the position reported is the actual position of the
nucleotide change. For DIV's, this field contains a colon-separated
pair of positions, which represent the flanking positions
surrounding the largest variable region in the sequence. So, for
example, in a variable-length run of T's, the flanking positions
would be the positions of the non-T characters outside of the run,
and the alleles reported in the fourth and fifth fields would be
the T's between these flanking positions.
reference allele
This is the base or bases seen in the reference sequence either at
the specified position (for an SNV) or between the reported
flanking positions (for a DIV). When the flanking positions are
adjacent, so there are no bases between them, a "*" is reported to
enable splitting on white space rather than tabs.
genotype
The genotype reported is the genotype with the highest posterior
probability according to Bayes theorem, given the observed reads
and quality scores, according to the program's error model. For
SNV's, the two alleles are concatenated, so, for example "AT"
indicates one A and one T. For DIV's, the two alleles are separated
by a colon, with a "*" indicating an allele of zero bases.
When the --single_copy option is used, single allele genotypes are
reported, so no colon-separation is used in the DIV. This lack of a
colon in MPG_DIV genotypes, as well as a single-character genotype
at MPG_SNV positions, is what distinguishes "single_copy" output.
MPG score
The MPG score field contains the difference between the natural
logarithms of the most probable and second most probable genotype's
probabilities, and is therefore an indicator of the probability the
reported genotype is correct. So, for example, a score of 10 would
imply that the reported genotype was approximately 22,000 times as
probable as the next most probable genotype. Since bam2mpg will
call genotypes at any base position, variant or not, we recommend
using a score cutoff of 10 to avoid a high level of false positive
predictions of variation.
ref/non-ref field
This field is 0 for a homozygous-reference genotype, and 1 for
anything else, allowing easy extraction of non-reference genotype
lines with "awk".
coverage
This field reports the number of reads used to calculate the most
probable genotype. It does not include bases that have been
filtered for quality (with --qual_filter) or reads beyond the 200
maximum reads the program allows.
MPV score
The MPV score reports the difference between the natural logarithms
of the most probable and the homozygous reference genotype, and is
therefore an indicator of the probability that the position's true
genotype is something other than homozygous reference (although not
a phred-scaled indicator).
MPG OPTIONS
--region *chr*
--region *chr:start-end*
This option specifies a region as a reference entry optionally
followed by a position range, and causes variants to be called only
in that region.
--qual_filter *minimum_quality*
This option specifies a minimum base quality score necessary for a
base to be included in the calculation for a particular aligned
position. Bases with quality scores below this value will be
completely ignored. At GTB, bam2mpg is almost always run with
--qual_filter 20. (Default: 0)
--single_copy *chr1:start1-end1,chr2:start2-end2*
This option specifies regions for which only a single copy exists
in the genome, so that only one allele is expected to be seen. The
regions should be comma-separated without spaces, and in the same
format as expected by the --region option.
--indels
This flag option causes the script to skip SNV predictions and only
report DIV variants.
--no_indels
This flag option causes the script to skip DIV predictions and only
report SNV variants.
--old_indels
This flag option causes the script to use averaged quality scores
across indels rather than inflated scores which are increased
proportional to the length of the indel. NOTE: This option is
currently inactive as "old_indels" has become the default until I
investigate the new indel calling more carefully.
--only_nonref
This flag option causes the script to only print lines that predict
genotypes that are non homozygous reference.
--align_bias *bias_value*
This option specifies an additional expected percentage of aligned
bases that are expected to be the reference allele due to bias in
the alignment favoring the reference base. For example, if the
alignment bias has value .05, mpg will expect a GT heterozygous
position with reference base "G" to have roughly 55% G's aligned at
that position, and 45% T's. It can also be used to tilt the
expected percentages due to included probe sequence, which will
always be reference, but in the long run it would be better to have
a position-dependent alignment bias that only changed these
expected values where the probes are located. (Default: 0)
--bam_filter *'bam filter options'*
Specifies filters to apply to the "samtools view" command, to limit
the SAM alignments that are processed by bam2mpg. Potential
filtering options include the mapping quality (-qNN), "proper pair"
flag (-f2), and duplicate flag (-F1024). The default is to exclude
unmapped reads (-F4); if you supply your own filter, be sure to
include this value in the "-F" flag, e.g. by adding 4+512+1024=1540
to exclude unmapped reads, QC failures, and duplicate reads. It is
advisable to always use quotes around the options, and is required
when there are spaces between options, e.g. '-q20 -F 1540'.
--ds_coverage *min_bases*
This option specifies a minimum number of bases that must be seen
on each strand for that base's counts to be included in the
probability calculation. For example, if -ds_coverage is specified
as 1, and an aligned "T" is observed multiple times on the forward
strand, but never on the reverse strand, no T's will be included in
the calculation because T was not seen at least once on the reverse
strand. This option is dangerous in that it can artificially
amplify scores by eliminating errors, so its use is discouraged.
--use_seed *seed value*
This option sets the random seed to the specified value at the
beginning of the run. Perl's rand function is used for choosing a
subset of reads at a position when the depth of coverage exceeds
200.
--nocache
This flag option prevents bam2mpg from caching the SNV genotype
calls it sees at each site. This caching is meant to speed up
run-time when genotypes are being called genome-wide, but because
the caching doesn't consider exact base qualities, it can lead to
slightly inaccurate scoring (but generally not actual genotype
errors, especially when --qual_filter is used).
--sam_filter *"unix command"*
The specified command is applied as a filter between the "samtools
view" and "samtools pileup" commands. This means that your command
should expect SAM format, and should print SAM format lines for
alignments that pass your filter criteria. Default is no additional
filtering (but see "--bam_filter"). If your filter can be
accomplished with options to "samtools view", using "--bam_filter"
will be more efficient.
--pileup_filter *"pileup options"*
The specified options are appended to the call to "samtools
mpileup". By default, -B is applied, since it looks as if BAQ
quality scores hurt sensitivity. Anything specified with the
--pileup_filter option will supersede the -B option (i.e., -B will
not be applied if --pileup_filter is used, and -B is not included
in the option.
--snv_vcf *"path of SNV VCF file to write"*
This option specifies the path of a VCF file to be written
containing MPG genotypes in VCF format. If the filename ends in
".gz" or ".bgz", the program will attempt to pipe the VCF string to
"bzip2" before printing.
--div_vcf *"path of DIV VCF file to write"*
This option specifies the path of a VCF file to be written
containing MPG genotypes in VCF format. If the filename ends in
".gz" or ".bgz", the program will attempt to pipe the VCF string to
"bzip2" before printing.
--mpg *"path of MPG-formatted output file to write"*
This option specifies the path of an MPG file to be written
containing genotypes in MPG format. If the filename ends in ".gz",
the program will attempt to pipe the VCF string to "gzip" before
printing.
AUTHOR
Nancy F. Hansen - nhansen@mail.nih.gov
LEGAL
This software/database is "United States Government Work" under the
terms of the United States Copyright Act. It was written as part of the
authors' official duties for the United States Government and thus
cannot be copyrighted. This software/database is freely available to the
public for use without a copyright notice. Restrictions cannot be placed
on its present or future use.
Although all reasonable efforts have been taken to ensure the accuracy
and reliability of the software and data, the National Human Genome
Research Institute (NHGRI) and the U.S. Government does not and cannot
warrant the performance or results that may be obtained by using this
software or data. NHGRI and the U.S. Government disclaims all warranties
as to performance, merchantability or fitness for any particular
purpose.
In any work or product derived from this material, proper attribution of
the authors as the source of the software or data should be made, using
"NHGRI Genome Technology Branch" as the citation.