Skip to content

A small collection of one-liners in a one bash script for fasta files processing.

License

Notifications You must be signed in to change notification settings

pavlohrab/process_fasta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Process fasta script

Visitor count GitHub GitHub Repo stars

This script is just a collection of one-liners, with which I was processing fasta sequences frequently. Therefore I made one single script out of them. All the one-liners are freely available on different forums.

Dependencies

The script is inthended to be run under Linux, using only the basic bash functionality. The only one, that need to be installed is a perl programming language (must be available for pretty much any Linux distro) for adding numbers to duplicated headers. Also awk and sed must be installed (they should be installed by default, so no need to worry)

Usage

The only mandatory flag is -i <input-file>. All other flags are optional. The general usage example is as such:

sh process_fasta.sh -i <input-file> ....

Features

The full list of options for fasta processing is available under sh process_fasta.sh -h:

-i|--input        - input fasta file.

-o|--output       - output, processed fasta file. 

-l|--linearize    - linearize fasta file to a format one sequence per line. 
                   Mandatory for other manipulations.

-a|--add_numbers  - add numbers to duplicated fasta headers. Numbers are 
                   added with an underscore. Must have perl installed.

-d|--delete       - provide a txt file with fasta headers of sequences 
                   that must be deleted. One header per line. Partial headers
                   are also recognized (they do not have to be a full match)
                   Can be deleted those headers, which contain certain word or 
                   phase (txt file then contain this word or phrase)
   
-s|--spaces       - provide a character to replaces spaces with in a fasta
                   headers. Please use single quotes. Example: '_'

-c|--concatenate  - concatenate all the sequence in a multifasta file. 
                   Filename will be use as a fasta header.

--extract_header  - extract the 'subheader' from fasta headers. Use the 
                   chosen delimiter and column numbers (use the delimiter
                   to divide the header to columns). Must to provide
                   --extract_delim and --extract_column flags
   
--extract_delim   - provide a delimiter (any character) to divide the fasta
                   headers. Part of --extract_header flag
   
--extract_column  - provide a column number to extract the part of fasta
                   header. Part of --extract_header flag
                       
--replace         - specify the characters to replace in the fasta sequence.
                   The characters to replace should be defined with 
                   double quotes with spaces between individual replacements. 
                   The length of replacements in this flag should be matched 
                   with the length in --replace_to. Else the character will 
                   be deleted (replaced to empty)

                   Example: ... --replace "\[" "AdpA" --replace_to "_" "_"
   
--replace_to      - specify the characters replace to in the fasta headers.
                   The characters to replace should be defined with 
                   double quotes with spaces between individual replacements.
                   The length of replacement in this flag should be matched 
                   with the length in --replace. Else th replace_to would
                   not be used (not enough characters to replace)

                   Example: ... --replace "Streptomyces" --replace_to "" "_"
                   (here Streptomyces will be deleted, replaced with "")

--add_filename    - add filename to the fasta headers. Please specify "start"
                   or "end". If no options are provided the filename will 
                   be added to the end of a header

--add_delim       - specify the delimeter when adding filename. The default
                   is empty string (concatenate filename and fasta header)

--split_fasta     - split the multifasta file to the indifidual ones. Fasta
                   header is used as filename. Splitted is done in the end,
                   so all modifications are preserved.

-h|--help         - print this message and exit

-v|--version      - print version and exit. 

About

A small collection of one-liners in a one bash script for fasta files processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages