This is another attempt to write a fast enough experimental ancient DNA damage aware short read mapper. This work depends on the rust-bio crate (Köster, 2016).
mapAD uses pure backtracking on top of the bidirectional FMD-index (Li, 2012). Central algorithmic ideas are inspired by BWA-backtrack (Li & Durbin, 2009). Improved algorithms and error models will be incorporated step by step as needed.
Ancient DNA damage models can be included via the SequenceDifferenceModel
trait.
The default (and only) impl is based on Udo Stenzel's ANFO/r-candy.
Besides Rust, no additional dependencies are needed to compile.
-
Install/update Rust (locally, to the current user's home directory):
curl https://sh.rustup.rs -sSf | sh
-
Git clone mapAD or simply download a release zip file:
git clone https://github.com/mpieva/mapAD.git && cd mapAD
-
Build:
cargo build --release
The resulting binary file
mapad
is now in the subfoldertarget/release/
. -
Run!
cd target/release/
./mapad index --reference /path/to/reference/hg19.fasta
- For increased performance on modern CPUs the compiler can make use of advanced SIMD instructions if you enable AVX2 and FMA like this (recommended). Please note that the resulting binary will not run on CPUs that don't support these features.
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release
or this (not recommended; reduced portability)
RUSTFLAGS="-C target-cpu=native" cargo build --release
Please note that the resulting binary is not necessarily portable between different CPU architectures.
- The number of
v
s passed to the program determines the level of verbosity:
mapad -vv index ...
or mapad -vv map ...
mapAD
can be built as a fully static binary withmusl
libc:- Install target platform requirements:
rustup target add x86_64-unknown-linux-musl
- Install
musl-gcc
on the building system (Ubuntu:sudo apt install musl-tools
)
RUSTFLAGS="-C target-feature=+crt-static -C link-self-contained=yes" cargo build --release --target x86_64-unknown-linux-musl
- Install target platform requirements:
The subprograms mapad index
and mapad map
will index the reference and map reads to it, respectively.
Adding the --help
flag will print a list of available and required command line options.
./mapad index --reference /path/to/reference/hg19.fasta
will store six index files in the directory of the input
FASTA (/path/to/reference/hg19.fasta{.tbw, .tle, .toc, .tpi, .trt, .tsa}
).
The scoring model is derived from Udo
Stenzel's ANFO/r-candy (Green et al., 2010
; SOM3).
The symbols
Double-stranded library preparation: The probability of a
position
Effective deamination
probabilities:
Sequencing errors and evolution (
A | C | G | T | |
---|---|---|---|---|
A | ||||
C | ||||
G | ||||
T |
All of the above probabilities are
Tests have shown that we can achieve good sensitivity and specificity allowing -p 0.03
mismatches and relatively high
deamination parameters (see "50% Deamination Parameters" below).
Over-specification of damage parameters does not seem to have a significant negative impact on alignment accuracy.
The following example aligns reads to an existing index of the hg19 reference. These damage settings cause C -> T mismatches on both 5'- and 3'-ends to be free (no penalty). The penalties for those substitutions of course increase as the center of a read is approached.
./mapad -vv map \
--threads 32 `# Number of threads to use (runs on all available cores when set to 0).` \
--library single_stranded `# Library preparation protocol (single- or double-stranded)` \
-p 0.03 `# Allowed mismatches under `-D` base error rate (similar to BWA backtrack)` \
-f 0.5 `# Five-prime overhang parameter` (generic overhang parameter when "--library" is set to "double_stranded") \
-t 0.5 `# Three-prime overhang parameter` (not used if "--library" is set to "double_stranded") \
-d 0.02 `# Deamination rate in double-stranded parts` \
-s 1.0 `# Deamination rate in single-stranded overhangs` \
-i 0.001 `# InDel rate (corresponds to gap open penalty)` \
-x 0.5 `# Gap extension penalty as a fraction of the repr. mismatch penalty` \
--reads "${input_bam}" \
--reference "/path/to/reference/hg19.fasta" `# Prefix of index files` \
--output "${output_bam}"
The following example starts a dispatcher node and then spawns multi-threaded workers on SGE cluster nodes that have more than 30GB of free RAM. Start the dispatcher:
./mapad -v map \
--dispatcher \
# ... (see local example)
Spawn workers:
qsub -N "mapAD_worker" -pe "smp" 1-32 -t 1-128 -l "h_vmem=30G,s_vmem=30G,virtual_free=30G,mem_free=30G,class=*" -j "y" -R "y" -b "y" ./mapad -vv worker --threads 8 --host $(hostname)
Mapping qualities are comparable with those produced by BWA
. However, an alignment that maps equally well to two
positions in the genome would be assigned a MAPQ of 3 by mapAD
, whereas BWA
would assign a MAPQ of 0. To filter out
reads mapping to multiple positions a MAPQ threshold of > 5-10 roughly corresponds to a BWA
-specific threshold of > 0.
Here,
- Unique (best alignment maps to one position):
$1$ - Pseudo-unique (best alignment maps to one position, but, with worse score, also to others): $\frac{\text{AS}\text{best}}{\text{AS}\text{best} + \sum{\text{AS}_\text{subopt} |\text{subopt}|}}$
- Non-unique (best alignment maps to multiple
positions):
$\frac{1}{|\text{best}|}$
Mapping quality is defined as the PHRED-scaled probability that an alignment is incorrect. Hence the above probabilities
are PHRED-scaled, and, for better compatibility with BWA
, confined to the interval
A recommended equivalent to a mapping quality threshold of 25 for BWA mapped data is 20 for mapAD output.
mapAD uses BAM auxiliary data fields to report suboptimal alignments in a bwa aln
-like fashion. X0
: Number of best
hits (multi-mapped alignment), X1
: Number of suboptimal alignments, XA
: 5 best suboptimal alignments in the format
chr,(+|-)pos,cigar,MD,NM,num_of_hits,AS
, XT
: Type of mapping ((N|U|R)
), XS
: Best suboptimal alignment score.
- Memory consumption of both mapping and indexing (see Hardware Requirements)
- No awareness of paired-end sequencing (pairs need to be merged before mapping)
- No seeding (it's not very effective for short reads, but could easily be implemented for longer ones. Probably
with less negative impact on (aDNA-)sensitivity than seeding in
BWA
).