Skip to content
/ cdseq Public

Extract coding sequence(s) from a genome and annotation file

License

Notifications You must be signed in to change notification settings

glarue/cdseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

Dependencies

This script needs the biogl module to function properly. If you use (or can get) pip, you can simply do

python3 -m pip install biogl

to add the package to a location reachable by your Python installation.

Otherwise, you can clone the biogl repo and source it locally (to run from anywhere, you'll need to add it to your PYTHONPATH environment variable, a process that varies by OS):

git clone https://github.com/glarue/biogl.git

Usage info

usage: cdseq [-h] [-t] [-e] [-i] [-c] [-m] [--tag tag] [-n] [-l]
             [--introns] [--truncate_introns int_length] [--print_arg_info]
             genome annotation

Extract transcript/coding sequences from a genome using an annotation file

positional arguments:
  genome                genome file in FASTA format
  annotation            annotation file in GFF[3]/GTF format

optional arguments:
  -h, --help            show this help message and exit
  -t, --translate       translate the output sequence (default: False)
  -e, --exon            use exons instead of CDS entries to define coding
                        sequence (default: False)
  -i, --isoforms        allow multiple isoforms per gene, instead of only
                        longest (default: False)
  -c, --coord_based_isoforms
                        detect isoforms by overlapping coordindates in
                        addition to shared parent (useful for annotations
                        without gene entries) (default: False)
  -m, --minimal_headers
                        omit detailed coordinate information from output
                        headers (default: False)
  --tag tag             prepend headers with specified <tag> (default:
                        None)
  -n, --non_coding      include non-coding (intronic, UTR, etc.) sequence;
                        uses only the coordinates of the transcript itself
                        (incompatible with --introns) (default: False)
  -l, --leave_lowercase
                        leave lowercase genomic sequence intact in output
                        (ignored if --introns) (default: False)
  --introns             include intron sequences in lowercase (incompatible
                        with -n) (default: False)
  --truncate_introns int_length
                        truncate intron sequences to length {int_length}
                        (or {int_length} -1 if odd) (default: None)
  --print_arg_info      print commented argument information before
                        sequence records (default: False)

About

Extract coding sequence(s) from a genome and annotation file

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages