Skip to content
This repository has been archived by the owner on Dec 18, 2022. It is now read-only.

orionzhou/maize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

maize utility libraries

Collection of Python libraries to parse bioinformatics files, or perform common tasks related to annotation and comparative genomics.

Contents

Following modules are available as generic Bioinformatics handling methods.

  • apps

    • Wrapper for BLAST+, LASTZ, LAST, BWA, BOWTIE2, CLC, CDHIT, CAP3, etc.
  • formats Currently supports .bed format, .blast output, .fasta format, .fastq format, .gff format, obo format (ontology), .psl format (UCSC blat, GMAP, etc.), .sam format (read mapping), etc.

Dependencies

Following are a list of third-party python packages that are used by some routines in the library. These dependencies are not mandatory since they are only used by a few modules.

There are other Python modules here and there in various scripts. The best way is to install them via pip install when you see ImportError.

Installation

The easiest way is to install it via PyPI:

To install the development version:

pip install git+git://github.com/orionzhou/maize.git

Alternatively, if you want to install manually:

cd ~/code  # or any directory of your choice
git clone git://github.com/orionzhou/maize.git
export PYTHONPATH=~/code:$PYTHONPATH

Please replace ~/code above with whatever you like, but it must contain maize. To avoid setting PYTHONPATH everytime, please insert the export command in your .bashrc or .bash_profile.

In addition, a few module might ask for locations of external programs, if the extended cannot be found in your PATH. The external programs that are often used are:

Most of the scripts in this package contains multiple actions. To use the fasta or gff example:

usage: fasta [-h]
             {size,desc,clean,extract,split,tile,merge,gaps,rename,rmdot,cleanid,2aln,translate}
             ...

fasta utilities

optional arguments:
  -h, --help            show this help message and exit

available commands:
  {size,desc,clean,extract,split,tile,merge,gaps,rename,rmdot,cleanid,2aln,translate}
    size                Report length for each sequence
    desc                Report description for each sequence
    clean               Remove irregular chararacters
    extract             retrieve fasta sequences
    split               run pyfasta to split a set of fasta records evenly
    tile                create sliding windows that tile the entire sequence
    merge               merge multiple fasta files and update IDs
    gaps                report gap ('N's) locations in fasta sequences
    rename              rename/normalize sequence IDs, merge short
                        scaffolds/contigs
    rmdot               replace periods (.) in an alignment fasta by dashes
                        (-)
    cleanid             clean sequence IDs in a fasta file
    2aln                convert fasta alignment file to clustal format
    translate           translate nucleotide seqs to amino acid seqs

usage: gff [-h]
           {summary,filter,fix,fixboundaries,fixpartials,index,extract,cluster,chain,format,note,splicecov,picklong,2gtf,2tsv,2bed12,2fas,fromgtf,merge}
           ...

gff utilities

optional arguments:
  -h, --help            show this help message and exit

available commands:
  {summary,filter,fix,fixboundaries,fixpartials,index,extract,cluster,chain,format,note,splicecov,picklong,2gtf,2tsv,2bed12,2fas,fromgtf,merge}
    summary             print summary stats for features of different types
    filter              filter the gff file based on Identity and Coverage
    fix                 fix gff fields using various options
    fixboundaries       fix boundaries of parent features by range chaining
                        child features
    fixpartials         fix 5/3 prime partial transcripts, locate nearest in-
                        frame start/stop
    index               index gff db
    extract             extract contig or features from gff file
    cluster             cluster transcripts based on shared splicing structure
    chain               fill in parent features by chaining children
    format              format gff file, change seqid, etc.
    note                extract certain attribute field for each feature
    splicecov           extract certain attribute field for each feature
    picklong            pick longest transcript
    2gtf                convert gff3 to gtf format
    2tsv                convert gff3 to tsv format
    2bed12              convert gff3 to bed12 format
    2fas                extract feature (e.g. CDS) seqs and concatenate
    fromgtf             convert gtf to gff3 format
    merge               merge several gff files into one

Then you can just do to run any action:

python -m maize.formats.fasta size

python -m maize.formats.gff fix

This will tell you the options and arguments it expects.