GFF and its descendants are the most flexible feature-rich annotation formats. From their flexibility arises considerable complexity. This can make them tricky to parse. This document describes how to work with one of GFF's descendants, GFF3.
In the original GFF specification, much of the file structure was deliberately left unspecified to maximize flexibility. For example, it was up to users to decide:
- whether coordinates were
0-indexed
or1-indexed
, andfully-closed
orhalf-open
- what feature types can be represented
- what parent-child relationships exist (if any) between feature types
- how the ninth column, which can contain arbitrary text data, including encoded attributes and values, should be formatted
Because of this, several flavors of GFF exist today, each adding their own specifications. Two of the most common are:
GFF3 adds additional constraints to the original GFF format:
- coordinates are
1-indexed
andfully-closed
- a ninth column contains key-value pairs of arbitary attributes
- features can have parent-child relationships, specified by the Parent attribute in the ninth column. All features that have children additionally must have an ID attribute defined in the ninth column. Values of a feature's Parent attribute must match the ID of the parent feature.
For more detail, see the GFF3 specification.
Features in GFF3 files can be hierarchical, in that they can have parents and children defined in their ninth column. Usually, features of a given type can only have children of specified types. For example, an mRNA feature can have children of type exon and CDS, but not gene. A gene feature, however, could have an mRNA child.
In addition, skipping levels of hierarchy is permitted. So, a gene feature could have direct children that are exon or CDS types, and an mRNA feature might not be explicitly represented.
A specification of accepted parent-child relationships is called a feature ontology. The GFF3 standard deliberately does not specify which ontology to use, which adds flexibility to the format but complicates handling.
Only continuous features (for example exons, but not multi-exon transcripts) can be represented directly in GFF formats.
Discontinuous features, like transcripts or gapped alignments, must be represented as a set of related continuous components. In GFF3, relationships between a feature and its components can be represented in two ways:
- by a shared value in the Parent attribute
- by sharing an ID attribute
For example, a transcript can be represented as a group of exons and coding regions, whose Parent attributes match the ID of the complex feature:
TODO example
Or, a transcript can be represented as a group of exons who share an `ID`:
TODO example
Some research groups prefer one representation over another, while others use both, even in the same file.
Reading GFF3 files in plastid
plastid
offers two ways to read GFF3 files:
- reading them line-by-line, yielding each continuous feature as a separate item (via )
- reading them processively, reconstructing transcripts from their constituent exons and coding regions (via )
parses each line of a GFF3 file and returns a single-segment corresponding to the feature described by the line:
>>> reader = GFF3_Reader(open("some_file.gff"))
>>> for feature in reader:
>>> pass #do_something
Attributes described in the ninth column of the GFF3 file are placed into the attr dictionary of the :
>>> feature.attr
{ "some_key" : "some_value", "some_other_key" : "some_other_value", ... }
Reconstructing transcripts from GFF3 files is tricky because:
relationships can be represented by common Parents or shared IDs <gff3-feature-relationships>
- GFF3 allows any possible
ontology
to be used, so the relationships between parent and child feature types is not always clear
takes care of these two problems by:
- first attempting to assemble transcripts by matching the Parent attributes of their component exons and coding regions, and then attempting a match by shared ID attributes
- assuming that the GFF3 file follows version 2.5.3 of the ontology defined by the Sequence Ontology Project. This ontology is used by many of model organism databases, including SGD, FlyBase, and WormBase.
The assembler behaves as an iterator, which assembles groups of transcripts lazily:
>>> reader = GFF3_TranscriptAssembler("some_file.gff")
>>> for transcript in reader: # transcripts are assembled from features when necessary
>>> pass # do something with the current transcript
Any malformed/unparsable GFF3 lines are kept in the rejected property:
>>> reader.rejected
[] # list of strings, corresponding to bad GFF3 lines
A GFF3 assembler must keep many subfeatures in memory until it is sure that it has parsed all of the components necessary to reconstruct a given transcript. This guarantee can be made by any of the following signals:
at which point, the assembler can purge feature components from memory and return a batch of transcripts.
Because potentially many features must be held in memory before any transcripts can be returned, assembling transcripts from GFF3 files can require substantially more memory than reading the same data represented as pre-assembled transcripts line-by-line from a BED or BigBed file.
For this reason, plastid
includes a script called ~plastid.bin.reformat_transcripts
, which can interconvert a number of annotation formats, including GFF3.
- The GFF3 specification for a full description of the file format
- The Sequence Ontology Project feature schema
- , , , and
~plastid.bin.reformat_transcripts
script