A total of 30 utilities/options for common operarations with fasta sequences are currently organized under 6 main subcommands. The utilities range from searching specific sequence entries from a large set of fasta sequences based on ID or a text string in the defline or a sequence motif to spliting a large sequence file into small chuncks, reporting summary stats of a genome assembly, and filtering sequences by length or gap size or redundant entries based on ID or sequences. Furthermore, by allowing input from and output to stdout, multiple processes can be done sequencially in one line of commpands via use of pipe (|). Make sure you have python 2 installed to run fatools.
Typing 'fatools ' displays the list of subcomands; typing 'fatools ' displays the detailed utilities/options for the subcommand.
convert
-r print sequence in revevrse compliment.
-N convert all non-ACGT letters to N.
-R remove all non-ACGT letters.
-U to upper case
-u to lower case
extract
-F N extract the first N fasta entries, if N is larger than the total number of entries,
then print to the last entries.
-S N extract from the Nth entry to the last entry.
-L N extract last N sequence entries. If N is larger than the total number of entries, then print all entries.
Use -S N -F M for entries from N to M; use -F N and -L N to extract both the first and last N entries.
-f N extract first N bp, prints the entire sequence if N is larger than the total length.
-s N extract sequence up to to N bp.
-l N extract last N bp, prints the entire sequence if N is larger than the total length.
Use -s N -f M for sequence from N to M bp; use -f N and -l N to extract both the first and last N bp as
one sequence separated by a space.
Note: The -f, -s, and -l options were designed for working with a single long sequences, even though they will
work for multiple sequences by applying the same operation to all sequences.
filter
-g N skip sequences with N or more Ns.
-r 1/2 1: skip redundant entry based ID; 2: keep redundant entries by adding
a serial number to the identical IDs to make each ID unique.
-R 1/N skip redundant entries based on sequence.
1: use the entire sequence; N: use only the first and last N bases.
-l N skip sequences shorter than N bp.
-L N skip sequences longer than N bp.
use -l N -L M for sequences with length from N to M bp (inclusive).
In all options, '-e' can be added to print the skipped entries in STDERR, which can be captured using 2>[skipped.fa].
report
-f print fasta entries as in the input.
-F print fasta entries with all sequence in one line.
-n print sequences without the defline.
-d print deflines in short form (part before the first space).
-D print deflines in the original form.
-c print the total number of fasta entries in the input.
-l print short defline +[\t] length.
-L print original defline +[\t] length.
-s print sequence summary statistics including N50.
-S print sequence summary statistics plus detailed gap info.
Use -h with -s and -S to disable the header above the outputs
Use -H to print parameters in human friendly form.
search
-s string: search for entries containing "string" in the sequence.
-d string: search for entries containing "string" in the defline: Default is for exact match; use "/string" to search for entries with "string" as part of the ID.
-F file: search for sequences based on a list of IDs in the file (one ID/line).
Can use -D to specify delimiter in the defline. Default is space or '|' or end of line;
use -i to specify the field number, default is 1.
-1 print only the 1st match for -d and -s.
-v use with -s, -d or -F to negate the search.
split
-G N split each of the sequences in the input file as non-gap fragments.
"N" is the number of consecutive Ns base, default is 1;
Use -G N with -t to print just the gap positions.
-n N split the input sequences into chunks, each containing N fasta entries (the last chunk may be less).
-N N split the input sequences into N chunks, each containing equal number of entries (last one may be smaller).
-M N split the input sequences into chunks at ~N MB (million bp) in size (last chunk may be smaller).
-o file: prefix for output files (serial numbers added to prefix; required).
- Open the terminal and navigate to where you have downloaded fatools
- Find where python2 is installed in your system
which python
. Usually you would get something like /usr/bin/python - Copy this output to the beggining of fatools as
#!/usr/bin/python
. - Run the following command to make the script executable
chmod +x fatools
- Add fatools to your bin directory or any other directories included in your $PATH
- You should be able to run fatools from anywhere!
- Type 'control panel' in the Windows search bar
- Go to System and Security > System > Advanced System settings > Environment variables
- Under system variables, select 'Path' and click 'Edit' and then 'New'
- Add the path of where you have fatools located and click OK
- You should be able to run fatools from anywhere!
Navigate to the exampleFiles directory in this repository. In there, a fasta file (exampleFasta.fa) and a file containing a list of IDs (IDlist.txt) from exampleFasta.fa.
To extract the fasta sequences from exampleFasta.fasta based on the list of IDs:
fatools search -F IDlist.txt exampleFasta.fa
NP_001245510.1 notch, isoform B [Drosophila melanogaster]
NNMQSQRSRRRSRAPNTWICFWINKMHAVASLPASLPLLLLTLAFANLPNTVRGTDTALVAASCTSVGCQNG
GTCVTQLNGKTYCACDSHYVGDYCEHRNPCNSMRCQNGGTCQVTFRNGRPGISCKCPLGFDESLCEIAVP
NACDHVTCLNGGTCQLKTLEEYTCACANGYTGERCETKNLCASSPCRNGATCTALAGSSSFTCSCPPGFT...
DY343456.1 Macropus rufus BRCA1 (BRCA1) gene, partial cds
CAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAGTCTGGATGAAAGTAAGGAAATATGTAGTGCTGGA
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGTAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCA
FY343456.1 Macropus rufus BRCA1 (BRCA1) gene, partial cds
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGCAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAG
TCTGGATGAAAGTAAGGAAATATGTAGTGCTGGAAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCAT
To report summary statistics:
fatools report -s exampleFasta.fa
Total 2,868 bps from qualified 5 sequences (5 total); length average: 573 (210-1262) bp; N50: 698 bp
To get fasta sequences with a specific maximum length filter
python fatools filter -L250 exampleFasta.fa
DY343456.1
CAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAGTCTGGATGAAAGTAAGGAAATATGTAGTGCTGGA
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGTAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCA
FY343456.1
TGTGGCACAGATGCTCGTGCCACCTCATTACTTCCTGAAACCACCAGCTTATCGCCCAACACAGACCGAA
TGAATGTAGAAAAGGCTGAACTCTGTAATAAAAGCAAACAGCCTGGCTTAGCAAAAAACCAACAGAGCAG
TCTGGATGAAAGTAAGGAAATATGTAGTGCTGGAAAGACCCTGGGTGCCCATGAGCTGAATGCCCATCAT
You can combine multiple utilities using pipe "|". Let's say you want to see the short defline and the length of the first 3 fasta sequences in a fasta file.
fatools extract -F3 sequenceTesting2.txt | fatools report -l -
AY211956.1Macropus(BRCA1)gene,partialcds 698
NP_001245510.1 1262
DY343456.1 210
Or extract the sequences from a large sequence set for a list of IDs and then search sequences with a specific sequence by using
fatools search -F IDlist.txt exampleFasta.fa |fatools -S AAATAAA -