DNApi (version: 1.1)
de novo adapter prediction (iterative) algorithm for small RNA sequencing data. DNApi requires Python (2 or 3) under a Linux/Unix environment.
DNApi accept (un)compressed FASTQ files or redirected standard input
stdin) as an input. You can simply run:
$ python dnapi.py <fastq>
$ <process-generates-fastq> | python dnapi.py -
To see the detailed usage, type:
$ python dnapi.py [-h | --help]
DNApi can predict most 3′ adapters correctly with the default parameters. However, if you want to tweak the parameters or want to run other prediction modes, see [prediction modes and parameters] (https://github.com/jnktsj/DNApi#prediction-modes-and-parameters) for mode defail.
For quick examples and case studies, see: Examples
For other useful utilities, see: Utilities
If you want to integrate the adapter prediction algorithm into your program, see: API
Of course, (sadly) there are some limitations on 3′ adapter prediction although DNApi gives near-perfect results. For the information, see: Limitations
Prediction modes and parameters
The package covers three ways (hereafter modes) to predict adapters.
The prediction algorithm needs two main parameters
(k-mer lengths) and
(filtering ratio for less abundant kmers). The default is iterative
-k 9:11:2 and
-r 1.2:1.4:0.1. The default setting
already works well on any small RNA libraries, but you can tweak the
-r (For more detail, see
Iterative mode runs the algorithm multiple times with different combinations of k and R and refines the ranks of predicted adapter candidates in subsequent iterations.
$ python3 dnapi.py -k 9:11:2 -r 1.2:1.4:0.1 <fastq>
Single mode runs a single adapter prediction algorithm with a specific combination of k and R.
$ python3 dnapi.py -k 9 -r 1.4 <fastq>
Exhaustive mode exhaustively searches an optimal 3´ adapter by
running the algorithm multiple times with different combinations of
k and R to obtain a non-redundant list of adapter candidates, and
incorporating adapter removal and read mapping. Only this mode can
judge whether input libraries are already clean (i.e. the 3′ adapter
sequences are already removed). To turn on this mode, you need to run
(For more detail, see
$ python3 dnapi.py --map-command "<command>" <fastq>
You can also incorporate
-r. The default setting is
-k 9:11:2 and
This mode also outputs cleansed reads in FASTA format if the input FASTQ is not processed. The reads in the output FASTA are non-redundant, and the read counts are written in FASTA headers.
If a 3′adapter sequence is specified with
only executes quality control using a given genome mapping
$ python3 dnapi.py --map-command "<command>" --adapter-seq SEQ1 [SEQ2 SEQ3...] <fastq>
DNApi judges the input FASTQ quality is poor when the mapping rate is below 20%.
Adapter prediction parameters
-k [KMER_BEG:KMER_END:INCREMENT | KMER_LEN]
K-mer(s) to predict a 3′ adapter in the input FASTQ. When you specify
the longer argument with ":", DNApi performs iterative mode. In the
KMER_BEG is the smallest k-mer to start,
is the largest k-mer to end, and
INCREMENT is an interval of the
k-mers. The default is
9:11:2, i.e., from 9mer to 11mer in a 2nt
interval (k = 9, 11). When you specify a single k-mer length
KMER_LEN with a single ratio (see
-r below), DNApi runs single
-r [RATIO_BEG:RATIO_END:INTCREMENT | RATIO]
Cutoff ratio(s) for filtering less frequent k-mers. For each k-mer, a
ratio of the frequency of the most abundant k-mer to the frequency of
a target k-mer will be computed. If a ratio is lower than the cutoff
-r, the k-mer with the ratio will be discarded. As in
-k for iterative search,
RATIO_BEG is the smallest ratio
RATIO_END is the largest ratio to end, and
an interval of the ratios. The default is
1.2:1.4:0.1, i.e., from
1.2 to 1.4 in a 0.1 interval (r = 1.2, 1.3, 1.4). When you specify a
RATIO with a single k-mer length, DNApi runs single
This option shows other predicted 3′adapter candidates (if any).
Exhaustive adapter search with mapping process
COMMAND is the genome mapping command to be tested.
For this argument, any read mapping software package can be used.
The requirements for this argument are:
- Specify FASTA as the input read format
- Specify the input read filename as
- Specify SAM as the output format for the mapping results
- Specify the output SAM filename as
COMMANDas a string in the command for DNApi
For example, when you want to use Bowtie as a mapping engine, the entire command line for DNApi will be:
$ python3 dnapi.py "/path_to/bowtie /path_to/genome_index -p8 -v0 -k1 -S -f @in > @out" <FASTQ>
Bowtie options used:
-p <int>: Number of
-v <int>: Number of
-k <int>: report up to
-S: SAM output
-f: FASTA input
The results will be printed in standard output (
stdout). The length
of predicted 3′ adapter sequences will be the 3′ adapter prefix match
length specified by
--prefix-match + 5nt.
Subsampling fraction of reads in an input FASTQ for exhaustive mode.
In the default, DNApi uses all reads (
Small read sets can make DNApi faster (For more detail, see [Tips for
making the exhaustive search mode faster]
Output directory for cleansed reads after a computation of
exhaustive mode. If the input FASTQ is not processed, DNApi removes
predicted 3′ adapters from the reads and generates a FASTA file
containing cleansed reads. In the default setting, DNApi creates the
output in the current directory as
Suppress the output of the report and the cleansed reads, and only display report on the screen.
Place for the temporary directory. DNApi creates a temporary directory
during a computation of exhaustive mode. In the default setting, the
program makes the directory in
Evaluation of 3′ adapter candidates
--adapter-seq SEQ [SEQ ...]
A list of 3′ adapter(s) for quality control. When the option is specified, DNApi maps the processed reads after clipping each 3′ adapter in every run and checks the genome mapping rate.
Adapter removal parameters
3′ adapter prefix match length. DNApi only considers perfect adapter matches. The default is 7nt. This option affects the length of predicted 3′ adapter sequences in the final output.
Minimum read length to keep for mapping. Extracted small RNA reads
will be discarded if the lengths are shorter than the specified
--min-len. The default is 16nt.
Maximum read length to keep for mapping. Extracted small RNA reads
will be discarded if the lengths are longer than the specified
--max-len. The default is 36nt.
Trim specified number of bases from 5′ ends after adapter removal. This option will be combined with the adapter clipping process to trim down specific number of bases additionally.
Trim specified number of bases from 3′ ends after adapter removal. This option can be combined with the adapter clipping process to trim down specific number of bases additionally.
You can access the adapter prediction algorithm once you import
dnapilib.apred in your python
the core function for iterative and single adapter prediciton. It
takes four arguments: FASTQ file name, filtering ratio, k-mer size,
and subsampling read count. As the result, the two functions return
the list of tuples containing predicted 3′ adapters and the assembly
scores. The returned list is sorted by the scores.
from dnapilib.apred import adapter_prediction from dnapilib.apred import iterative_adapter_prediction # [iterative mode] # iterative_adapter_prediction(FASTQ, ratios, k-mers, subsample_read_count, length_to_print=12) iterative_result = iterative_adapter_prediction("examples/good.fq", [1.2, 1.3, 1.4], [9, 11], 50000) # [single mode] # adapter_prediction(FASTQ, ratio, k-mer, subsample_read_count) single_result = adapter_prediction("examples/good.fq", 1.4, 9, 50000) # all predicted adapters print(iterative_result) # >> [('TGGAATTCTCGG', 200.0)] print(single_result) # >> [('TGGAATTCTCGGGTGCCAAGGAACTCC', 100.0)] # predicted adapter with the highest score print(iterative_result) # >> 'TGGAATTCTCGGG' print(single_result) # >> 'TGGAATTCTCGGGTGCCAAGGAACTCC'
In addition to DNApi, there are potentially useful three programs in
qual-offset.pyestimates ASCII-encoded quality score offsets of FASTQ files.
qual-trim.pytrims low quality bases in input FASTQ reads. The quality trimming algorithm in the program is the same as the one in BWA.
to-fasta.pyremoves specified 5′ and/or 3′ adapter sequences, merges identical reads while retainig the counts, and writes the collapsed reads as FASTA in standard output (
To see the usage for each program, type:
$ python3 <program-name> [-h | --help]
DNApi has a few limitations on 3′ adapter prediction:
- Poly(A) or other low-complexity 3′ adapters can't be predicted due to the low-complexity k-mer filtering step.
- Prediction accuracy will drop if gel-extracted lengths of RNAs are long enough to be sequenced, i.e. if few 3′ adapters are in FASTQ.
- DNApi can't do demultiplexing.
If you use DNApi in your publications, please cite:
Tsuji J, Weng Z. (2016) DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data. PLoS One 11(10):e0164228. [article]