Collection of Python libraries to parse bioinformatics files, or perform common tasks related to annotation and comparative genomics.
Following modules are available as generic Bioinformatics handling methods.
-
apps
- Wrapper for BLAST+, LASTZ, LAST, BWA, BOWTIE2, CLC, CDHIT, CAP3, etc.
-
formats
Currently supports.bed
format,.blast
output,.fasta
format,.fastq
format,.gff
format,obo
format (ontology),.psl
format (UCSC blat, GMAP, etc.),.sam
format (read mapping), etc.
Following are a list of third-party python packages that are used by some routines in the library. These dependencies are not mandatory since they are only used by a few modules.
There are other Python modules here and there in various scripts. The
best way is to install them via pip install
when you see ImportError
.
The easiest way is to install it via PyPI:
To install the development version:
pip install git+git://github.com/orionzhou/maize.git
Alternatively, if you want to install manually:
cd ~/code # or any directory of your choice
git clone git://github.com/orionzhou/maize.git
export PYTHONPATH=~/code:$PYTHONPATH
Please replace ~/code
above with whatever you like, but it must
contain maize
. To avoid setting PYTHONPATH
everytime, please insert
the export
command in your .bashrc
or .bash_profile
.
In addition, a few module might ask for locations of external programs,
if the extended cannot be found in your PATH
. The external programs
that are often used are:
Most of the scripts in this package contains multiple actions. To use
the fasta
or gff
example:
usage: fasta [-h]
{size,desc,clean,extract,split,tile,merge,gaps,rename,rmdot,cleanid,2aln,translate}
...
fasta utilities
optional arguments:
-h, --help show this help message and exit
available commands:
{size,desc,clean,extract,split,tile,merge,gaps,rename,rmdot,cleanid,2aln,translate}
size Report length for each sequence
desc Report description for each sequence
clean Remove irregular chararacters
extract retrieve fasta sequences
split run pyfasta to split a set of fasta records evenly
tile create sliding windows that tile the entire sequence
merge merge multiple fasta files and update IDs
gaps report gap ('N's) locations in fasta sequences
rename rename/normalize sequence IDs, merge short
scaffolds/contigs
rmdot replace periods (.) in an alignment fasta by dashes
(-)
cleanid clean sequence IDs in a fasta file
2aln convert fasta alignment file to clustal format
translate translate nucleotide seqs to amino acid seqs
usage: gff [-h]
{summary,filter,fix,fixboundaries,fixpartials,index,extract,cluster,chain,format,note,splicecov,picklong,2gtf,2tsv,2bed12,2fas,fromgtf,merge}
...
gff utilities
optional arguments:
-h, --help show this help message and exit
available commands:
{summary,filter,fix,fixboundaries,fixpartials,index,extract,cluster,chain,format,note,splicecov,picklong,2gtf,2tsv,2bed12,2fas,fromgtf,merge}
summary print summary stats for features of different types
filter filter the gff file based on Identity and Coverage
fix fix gff fields using various options
fixboundaries fix boundaries of parent features by range chaining
child features
fixpartials fix 5/3 prime partial transcripts, locate nearest in-
frame start/stop
index index gff db
extract extract contig or features from gff file
cluster cluster transcripts based on shared splicing structure
chain fill in parent features by chaining children
format format gff file, change seqid, etc.
note extract certain attribute field for each feature
splicecov extract certain attribute field for each feature
picklong pick longest transcript
2gtf convert gff3 to gtf format
2tsv convert gff3 to tsv format
2bed12 convert gff3 to bed12 format
2fas extract feature (e.g. CDS) seqs and concatenate
fromgtf convert gtf to gff3 format
merge merge several gff files into one
Then you can just do to run any action:
python -m maize.formats.fasta size
python -m maize.formats.gff fix
This will tell you the options and arguments it expects.