-
Notifications
You must be signed in to change notification settings - Fork 1
barapost prober
barapost-prober.py -- This script is designed for taxonomic classification of nucleotide sequences by finding the most similar sequence in NCBI Nucleotide database using BLAST web service.
Processing entire data set with "barapost-prober.py" is not expedient, since servers are in public use and handling a request can linger for a long time.
Therefore the main goal of this script is to submit a relatively small probing batch (see -b
option) of sequences to NCBI BLAST service and discover, what Genbank records can be downloaded and used as reference sequences in a database stored on local machine. Further classification will be performed by "barapost-local.py" on local machine.
This script processes FASTQ and FASTA (as well as .fastq.gz
and .fasta.gz
) files.
If you have your own FASTA files that can be used as reference sequences database to align against, you can omit "barapost-prober.py" step and go to "barapost-local.py" (see -l
option in "barapost-local.py" description.
- Input FASTA and FASTQ files should be specified as positional arguments (see examples below).
- Input files must have different names. I.e. files
.../dir1/reads.fastq
and.../dir2/reads.fastq
are not allowed.
- if no input files are specified, "barapost-prober.py" processes all FASTQ and FASTA files in working directory;
- algorithm (see
-a
option):0
(zero, i.e. megaBlast); - packet forming mode (see
-c
option):0
(zero); - packet size: (see
-p
option): 100 sequences for packet forming mode0
, 20,000 base pairs for packet forming mode1
; - probing batch size (see
-b
option): 200 sequences; - database slices (see
-g
option): wholenr/nt
database, i.e. no slices; - output directory (
-o
option): directory namedbarapost_result
nested in working directory; - prober sends sequences intact
(i.e. does not prune them before submission (see
-x
option));
-h (--help) --- show help message.
'-h' -- brief, '--help' -- full;
-v (--version) --- show version;
-d (--indir) --- directory which contains FASTQ or FASTA
files meant to be processed.
I.e. all FASTQ and FASTA files in this direcory will be processed;
-o (--outdir) --- output directory.
Default value: `barapost_result`;
-c (--packet-mode) --- indicates how prober calculates size of packet.
See section "Example of how mode 1 works" below for details.
Values:
0 -- packet size equals number of sequences in it [default mode];
1 -- packet size equals sum of lengths of sequences in it;
Mode 1 performs better if input sequences are binned by length within input file;
-p (--packet-size) --- size of the packet (see option '-c' above for definition).
It implies that "barapost-prober.py" can perform
multiple requests during single run.
Value: positive integer number.
Default value:
100 (sequences) for mode 0;
20,000 (base pairs) for mode 1;
-a (--algorithm) --- BLASTn algorithm to use for alignment.
Available values: 0 for megaBlast, 1 for discoMegablast, 2 for blastn.
Default is 0 (megaBlast);
* -g (--organisms) --- TaxIDs of organisms to align your sequences against.
I.e. with this options you can specify 'nr/nt' database slices.
Format of value (see Examples #3 and #4 below.):
<organism1_taxid>,<organism2_taxid>...
Spaces are NOT allowed.
Default value is full 'nr/nt' database, i.e. no slices.
You can find your Taxonomy IDs here:
'https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi';
-b (--probing-batch-size) --- total number of sequences that
will be submitted to BLAST server during 'barapost-prober.py' run.
You can specify '-b all' to process all your sequeces by 'barapost-prober.py'.
Value: positive integer number.
Default value is 200;
-x (--max-seq-len) --- maximum length of a sequence that
prober submits to NCBI BLAST service.
If specified, prober will prune your sequences before
submission in order to spare NCBI servers.
This feature is disabled by default;
* - Functionality of -g
option is totally equal to "Organism" text boxes on this BLASTn page:
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome.
-
"hits_to_download.tsv"
file in root of output directory. In this file you can find: accessions and names of Genbank records to be further downloaded by "barapost-local.py" in order to create a database on your local machine. -
"classification.tsv"
file in directory corresponding to a particular input file (for example, for input filesomething.fastq
result directory name will besomething
). In this file you can find:- IDs of classified sequences;
- classification (usually lineage);
- accession(s) of best hit(s) divided with
&&
; - some alignment statistics;
- quality information (for FASTQ files).
If packet forming mode 1
is set, prober will add sequences to each packet until sum of ther length reaches (or exceeds) specified size of packet.
prober takes preventive pruning (see -x
option) into account when calculating packet size in mode 1
: lengths of pruned sequences are considered.
Example:
- Input fasta file:
>sequence_1_length_6_bp
ACTGAA
>sequence_2_length_5_bp
GTGTA
>sequence_3_length_4_bp
TAGC
>sequence_4_length_4_bp
GGAC
-
Packet size is
5
(-p 5
), mode is1
(-c 1
),-x
is not enabled. -
Packets for submission:
- Packet #1. Length of
sequence_1
is greater than packet size -- packet contains this sequence only:
>sequence_1_length_101_bp ACTGAA
- Packet #2. Length of
sequence_2
equals packet size -- packet contains this sequence only:
>sequence_2_length_100_bp GTGTA
- Packet #3. Length of
sequence_3
is less than packet size, so prober will addsequence_4
to this packet, and then sum of their lengths will exceed packet size, making it ready for submission:
>sequence_3_length_99_bp TAGC >sequence_4_length_99_bp GGAC
- Packet #1. Length of
Note for Windows users: run py -3 barapost-prober.py
in Windows console. barapost-prober.py
won't work.
- Process all FASTA and FASTQ files in working directory with default settings: barapost-prober.py
barapost-prober.py
- Process one file with default settings:
barapost-prober.py reads.fastq
- Process a FASTQ file and a FASTA file with discoMegablast, packet size of 100 sequences. Search only among Erwinia sequences (551 is Erwinia taxID):
barapost-prober.py reads_1.fastq.gz some_sequences.fasta -a discoMegablast -p 100 -g 551
- Process all FASTQ and FASTA files in directory named
some_dir
. Process 300 sequences, packet size is 100 sequnces (3 packets will be sent). Search only among Escherichia (taxID 561) and viral (taxID 10239) sequences:
barapost-prober.py -d some_dir -g 561,10239 -o outdir -b 300 -p 100