Skip to content

NCBI-Hackathons/Master_gff3_parser

Repository files navigation

Coverage Status Build Status

Squidstream (an amazing tool that does wonderful stuff)—Documentation

Generic Feature Format version 3 (GFF3) is a file type that is commonly used in bioinformatic applications. Different institutions have varying naming conventions for the genomic identifier column in the GFF3 format. Therefore, there can be GFF3 files that use different seqids for the same genomic feature. In addition, there are other file formats that also have sequence identifiers, such as GTF, BED, SAM, and BAM files. Squidstream is an easy-to-use command line tool that can convert the genomic feature reference name for chromosomes, scaffolds, and contigs in different file formats to the corresponding seqid from NCBI’s RefSeq database. GFF3 files are a common input into many different types of bioinformatics tools and pipelines, and Squidstream provides naming consistency in these input files by converting the sequence feature IDs in the entire file to the desired ID format using a single command.

Squidstream Workflow: Figure 1. Examples of NCBI, UCSC, and RefSeq GFF3 files.

Sequence Identifier Conversion Examples:

  • Annotation with RefSeq ID to UCSC ID for use in UCSC Genome Browser tracks
  • Convert to NCBI ID to search KEGG GENES Database
  • RefSeq to Genbank ID

Squidstream was built in Python and runs from the command line. Users provide the input file, the specific reference genome, and the desired name of the output file.

A summary of the seqconv commands is provided below.

Command Description
convert Converts sequence IDs

Links to file format descriptions: GFF3, SAM, BED, GFF/GTF

Installation

Linux:

python setup.py install

OSX:

python setup.py install

Squidstream Workflow:

Figure 2. Squidstream workflow.

About

Convert sequence IDs between ucsc/refseq/genbank

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published