Skip to content

BLAST and extract the sequences

Patrick Douglas edited this page Mar 28, 2019 · 5 revisions

The option SeqsExtractor-blast-and-extract will perform a BLAST search using NCBI-BLAST+ and after that will extract all sequences that match in a specific percentage (specified by user) with the subject database.

EXAMPLE: After a BLAST run you can use the tabular BLAST format to extract from your query dataset only the sequences that match in a specific percentage of hits, like 100%.

The commandline required is like bellow:

USAGE:
Example: SeqsExtractor-blast-and-extract -i query.fa -o /home/user/test -b n -d mouse.preformated.blastdb.fa -p 90-100 -e 1e-20 -t 10 -a '-max_target_seqs 1'

Required arguments: 
-i <string> | Query fasta
-o <string> | Output directory
-b <n/x>    | Blast+ algorithm (n or x) 
-d <string  | Pre-formated Blast+ database
-p <string  | Pct. of identity to_extract Sequences

Optional arguments: 
-e <string>     | Default 1e-3 
-t <interger>   | Default: all available threads 
-a <string>     | Blast+ optional parameters. E.g. '-max_target_seqs 1 -import_search_strategy filename' (Use between quotes!)

Example commandline:

SeqsExtractor-blast-and-extract -i M.musculus_NCBI_entire_genome.fasta -o /home/user/test -b n -d Mus_musculus_uniprot_swisprot.fasta -p 90-100 -e 1e-20 -t 10 -a '-max_target_seqs 1'

**NOTE: In this option you need provide a preformated blast database, to create this set of files use comand like bellow

Example for BLASTx

makeblastdb -in name_of_your_database_to_BLAST.fasta -dbtype prot

Example for BLASTn

makeblastdb -in name_of_your_database_to_BLAST.fasta -dbtype nucl

Commandline explained

Input FASTA file:

Enter the fasta file to be used as a query

-i /home/me/M.musculus_NCBI_entire_genome.fasta
Output directory to save all results:

Enter the fasta file to be used as a query

-o 

Blast+ algorithm to use

Avaliable x or n

-b x

or

-b n

Blast+ preformated database:

Here you need provided a blast+ preformated database

-d Mus_musculus_uniprot_swisprot.fasta

Percentage of identity to extract Sequences:

Now you can choose a specific percentage to extract your sequences. The all available options are provided bellow:

10  to get only the sequences that match with 10%	
20  to get only the sequences that match with 20%	
30  to get only the sequences that match with 30%		
40  to get only the sequences that match with 40%		
50  to get only the sequences that match with 50%		
60  to get only the sequences that match with 60%		
70  to get only the sequences that match with 70%		
80  to get only the sequences that match with 80%		
90  to get only the sequences that match with 90%		
100  to get only the sequences that match with 100%
10-100  to get only the sequences that match with 10% to 100% of hits	
20-100  to get only the sequences that match with 20% to 100% of hits	
30-100  to get only the sequences that match with 30% to 100% of hits	
40-100  to get only the sequences that match with 40% to 100% of hits	
50-100  to get only the sequences that match with 50% to 100% of hits	
60-100  to get only the sequences that match with 60% to 100% of hits	
70-100  to get only the sequences that match with 70% to 100% of hits	
80-100  to get only the sequences that match with 80% to 100% of hits	
90-100  to get only the sequences that match with 90% to 100% of hits	

Or type all to no filter and get all sequences the match in the blast search.

Example:

-p 90-100

Will extract the sequences that match 90% to 100% percent of identity

Optional.

Expected value (E-value) that you want to use in the BLAST search.

Example:

-e 1e-20

If you do not use this options it will use a default value (1e-3)

How many CPU-threads you want to use in the BLAST search.

NOTE: In the linux Mint/Ubuntu environment the command nproc shows the total number of threads available in the machine

Example:

-t 12

If you do not use this option SeqsExtractor automatically set the maximum number of cores of the machine

Here you can insert additional BLAST parameters separated by spaces and starting with dashes inside single quotes.

Example:

-a '-max_target_seqs 1 -num_descriptions 10'

The final screen will indicate the name of files that will be stored in the output directory.

image

Clone this wiki locally