SEP | 011 |
---|---|
Title | Specify DNA / RNA topology with type fields |
Authors | Raik Gruenberg (raik.gruenberg at gmail com) |
Editor | Raik Gruenberg |
Type | Data Model |
SBOL Version | 2.1 |
Replace | 008 (SEP #8) |
Status | Accepted |
Created | 22-Oct-2016 |
Last modified | |
Issue | #26 |
SBOL currently is missing any clear means to distinguish between circular (e.g. plasmid) and linear DNA constructs. Also double- and single-stranded DNA or RNA is not distinguished. This proposal describes a best-practice and new validation rules for encoding topology and strand information in the ComponentDefinition.type field.
Note from the editors: This SEP changes the original proposal (SEP #8) such that no changes to the data model are required. It was therefore implemented in SBOL 2.1 without a vote. SEP #8 may nevertheless be discussed and revived for future SBOL releases.
Engineered DNA constructs are currently mostly deposited, used and exchanged as circular, double-stranded, plasmid DNA. However, there are also many applications for linear constructs which are easily and cheaply accessible through PCR or gene synthesis. Moreover, even circular plasmid DNA may be cut and stored or exchanged as linear DNA to aid the construction of new vectors. Likewise, natural or engineered genomes may be circular (bacterial chromosomes) or linear (eukaryotic chromosomes). Both linear and circular DNA or RNA molecules can, moreover, be either double-stranded (ds) or single stranded. RNA, typically, is used as single-stranded molecule whereas DNA is, mostly, utilized double-stranded (dsDNA). However, oligonucleotides (primers) are an example of single-stranded DNA (ssDNA) which, however, can also be complemented into double-stranded fragments.
Knowing whether a given construct is supposed to have circular or linear topology is of relevance both for experimental design and for basic sequence analysis. Nevertheless, unlike the much older genbank format, SBOL so far lacks the possibility to distinguish between linear and circular DNA. Likewise, the distinction between single-stranded and double-stranded has important consequences but is not currently conveyed by SBOL.
(2.1) Topology
Any ComponentDefinition classified as DNA (via its type
field) is
RECOMMENDED to encode circular/linear topology information in an additional
type
field. This (topology) type
field MUST point to either of two possible
Sequence Ontology terms:
- SO:0000987 -- a.k.a. 'linear' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000987]
- SO:0000988 -- a.k.a. 'circular' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000988]
There is no default topology for DNA. Lack of topology information is only
acceptable for DNA records without or with incomplete sequence information. Lack
of topology information is not acceptable for DNA ComponentDefinition
records with fully specified sequence -- unless the topology is genuinely
unknown, in which case the sequence is, in fact, not fully specified.
The topology type
field is OPTIONAL for RNA ComponentDefinition
records. The
default assumption for RNA records is linear topology.
Any other ComponentDefinition
record (protein, small molecule, etc) MUST NOT
have a type
field with DNA topology information.
(2.2) Topology Validation Rules
If ComponentDefinition
has type 'DNA' and has a sequence
record but does not have a type
field pointing to either 'linear' or 'circular' (SO:0000987 or 988), this should trigger a warning.
If ComponentDefinition
has any type other than 'DNA' or 'RNA' and has a type
field pointing to SO:0000987 or 988, this MUST be considered an error.
(2.3) Double / Single stranded
Any ComponentDefinition
classified as DNA or RNA (via its type
field)
MAY have strand information encoded in an additional type
field pointing to
either of two possible Sequence Ontology terms:
- SO:0000984 -- a.k.a. 'single' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000984]
- SO:0000985 -- a.k.a. 'double' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000985]
In absence of this field, the default strand information assumed for DNA is 'double-stranded' and for RNA is 'single-stranded'.
Any other ComponentDefinition
record (protein, small molecule, etc) MUST NOT
have any type
field pointing to these SO terms.
(2.4) Strand validation rules
If ComponentDefinition
has any type other than 'DNA' or 'RNA' and has a type
field pointing to SO:0000984 or 985, this MUST be considered an error.
(2.5) Consequences for software tools
circular
topology (SO:0000988) instructs software to interpret the beginning
/ end position of a given sequence (be it DNA or RNA) as arbitrary so that sequence
features may be mapped or identified across this junction.
double
stranded (SO:0000985) instructs software to apply sequence searches to
both strands (i.e. sequence and reverse complement of sequence).
This is such a basic feature of molecular biology that it is difficult to find an example where it is, in fact, not needed. By way of example, consider popular sequence design software such as Benchling, CLC, VectorNTI or others. In these cases, declaring a sequence as circular changes the view and the searching behaviour of the sequence editor.
Example for double-stranded / single-stranded:
While the pattern 'ATG' is not found within the sequence TTCCTAC
, the reverse complement of this sequence does contain ATG
. Whether or not a given sequence is considered double-stranded therefore decides whether or not it matches a given pattern.
Example for circular / linear:
While no 'ATG' start codon is found within the linear sequence GAATCATCATAT
, this sequence does contain a start codon if considered circular.
No backwards compatibility issues are foreseen. However, the fields will be missing from older records. Software then should fall back to making an educated guess or assuming linear topology.
This was, in fact, the original suggestion (See SEP #8). It was retracted in order to make topology and strand information already available in SBOL 2.1. Depending on experience, this issue may be re-visited before finalizing SBOL 3.0.
Declaration of dedicated fields for this information would have advantages:
- more explicit (rather than several
type
fields pointing to cryptic SO terms, the attribute name would clarify what type of information can be expected) - lower risk that topology and strand information is omitted
- less or less complex validation rules
- software would not have to interpret a collection of type fields even though it is only interested in, for example, telling apart circular plasmid DNA from DNA fragments
- if non-standard SO terms are used, software could, at least, still know whether the unknown term refers to
topology
,strand
ortype
Disadvantage:
- the two fields would only apply to 'RNA' and 'DNA' ComponentDefinition, validation rules would still be needed.
- This may be more confusing for software developers than additional type fields.
I originally proposed to make fields required at the data model level. However, since topology and strand information makes little or even no sense for ComponentDefinitions describing proteins or small molecules, this was dropped (following debate on sbol-dev). As pointined out at COMBINE, CompentDefinition without sequence information cannot be required to have topology information either. Moreover, topology information may genuinely unknown. For all these reasons, the field cannot be required. It is recommended as best practice.
The original SEP proposed boolean values (circular=True/False). This was critized on the grounds that we, so far, are not using boolean values anywhere in SBOL and that we should tie things in with SO instead.
SO offers terms for all 8 combinations of circular/linear and double/single stranded and DNA/RNA. An example is
the term circular_double_stranded_DNA_chromosome
. Using these terms was considered less elegant. (1) software likely would have to decompose these terms again, (2) the 'chromosome' semantics does not fit many SBOL applications.
It would be possible to formulate topology through sequence constraints. However, sequence constraints have a different use case -- they describe incomplete designs that may, for example, be communicated to services for automated design or library construction. Standard sequence editors will likely not support any sequence constraints framework in the near future. By contrast, the current standard use case for SBOL is (or should be) the communication of a complete sequence, e.g. for a plasmid deposited at AddGene or a fragment described on parts.igem.org.
Also protein or even small / medium-sized molecules can feature circular topologies and topology was originally envisioned to apply to all types of ComponentDefinition. However, in case of proteins, circularity is exceedingly rare and 'double-stranded' does not apply at all. Furthermore, the SO terms seem to be more or less explicitely defined for nucleic acid polymers. Explicitely excluding these fields from all non-RNA or DNA ComponenDefinitions therefore seems reasonable.
This SEP replaces SEP #8.
To the extent possible under law,
SBOL developers
have waived all copyright and related or neighboring rights to
SEP 011.
This work is published from:
United States.