SEP 008 -- Specify DNA / RNA topology

SEP	008
Title	Specify DNA / RNA topology
Authors	Raik Gruenberg (raik.gruenberg at gmail com)
Editor	Raik Gruenberg
Type	Data Model
SBOL Version	2.2
Status	Deferred
Created	03-Sep-2016
Last modified	22-Oct-2016
Issue	#23

Abstract

SBOL currently is missing any means to distinguish between circular (e.g. plasmid) and linear DNA constructs. Also double- and single-stranded DNA or RNA cannot be distinguished.

Note from the editors: This proposal was deferred in favor of a short-term solution described in SEP #11.

Rationale

Engineered DNA constructs are currently mostly deposited, used and exchanged as circular, double-stranded, plasmid DNA. However, there are also many applications for linear constructs which are easily and cheaply accessible through PCR or gene synthesis. Moreover, even circular plasmid DNA may be cut and stored or exchanged as linear DNA to aid the construction of new vectors. Likewise, natural or engineered genomes may be circular (bacterial chromosomes) or linear (eukaryotic chromosomes). Both linear and circular DNA or RNA molecules can, moreover, be either double-stranded (ds) or single stranded. RNA, typically, is used as single-stranded molecule whereas DNA is, mostly, utilized double-stranded (dsDNA). However, oligonucleotides (primers) are an example of single-stranded DNA (ssDNA) which, however, can also be complemented into double-stranded fragments.

Knowing whether a given construct is supposed to have circular or linear topology is of crucial relevance both for experimental design and for basic sequence analysis. Nevertheless, unlike the much older genbank format, SBOL so far lacks the possibility to distinguish between linear and circular DNA. Likewise, the distinction between single-stranded and double-stranded has important consequences but is not currently conveyed by SBOL.

Specification

(2.1) ComponentDefinition should be amended by a new field topology pointing to either of two possible Sequence Ontology terms:

SO:0000987 -- a.k.a. 'linear' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000987]
SO:0000988 -- a.k.a. 'circular' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000988]

(2.2) A new validation rule needs to be implemented that makes this an required field for ComponentDefinitions of type 'DNA' or 'RNA'. By contrast, the field is not allowed for descriptions of proteins, small molecules or other types of entities.

(2.3) ComponentDefinition should be amended by a new field strand pointing to either of two possible Sequence Ontology terms:

SO:0000984 -- a.k.a. 'single' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000984]
SO:0000985 -- a.k.a. 'double' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000985]

(2.4) A new validation rule needs to be implemented that makes this an required field for ComponentDefinitions of type 'DNA' or 'RNA'. By contrast, the field is not allowed for descriptions of proteins, small molecules or other types of entities.

(2.5) Consequences for software tools

topology = circular (SO:0000988) instructs software to interpret the beginning / end position of a given sequence (be it DNA or RNA) as arbitrary so that sequence features may be mapped or identified across this junction.

strand = double (SO:0000985) instructs software to apply sequence searches to both strands (i.e. sequence and reverse complement of sequence).

(2.6) Missing fields

There are no default values for topology or strand. Absence of either of these fields should be treated as genuine lack of information.

Example or Use Case

This is such a basic feature of molecular biology that it is difficult to find an example where it is, in fact, not needed. By way of example, consider popular sequence design software such as Benchling, CLC, VectorNTI or others. In these cases, declaring a sequence as circular changes the view and the searching behaviour of the sequence editor.

Example for double-stranded / single-stranded:

While the pattern 'ATG' is not found within the sequence TTCCTAC, the reverse complement of this sequence does contain ATG. Whether or not a given sequence is considered double-stranded therefore decides whether or not it matches a given pattern.

Example for circular / linear:

While no 'ATG' start codon is found within the linear sequence GAATCATCATAT, this sequence does contain a start codon if considered circular.

Backwards Compatibility

No backwards compatibility issues are foreseen. However, the fields will be missing from older records. Software then should fall back to making an educated guess or assuming linear topology.

Discussion

5.1 re-using `type` field for topology and strand

(This is the solution ultimately adopted for SBOL 2.1)

Several devs suggest to not introduce any new fields but, instead, to attach topology and strand features using the existing DnaComponent.type field. type is already supposed to point to SO terms and more than one type field can be added per ComponentDefinition.

Advantage of re-using type:

no change of data model, only validation rules need update
may make life slightly easier for reasoning software (less fields to consider)
fast implementation in terms of spec and no need for a vote

Disadvantage:

higher risk that topology and strand information is not actually given
complex validation (if one type field points to DNA, ensure there are two more type fields, one of which ...)
software would have to interpret a collection of type fields even if it is only interested in, for example, telling apart DNA and protein parts
undefined result if type fields point to neither of the four given SO terms; non-reasoning software could much easier recover if it, at least, is still told whether the unknown term refers to topology, strand or type

Conversely, issues with the proposal of new fields topology, strand (e.g. pointed out by Chris):

Having two fields in CD that only make sense for some types complicates the data model unnecessarily and actually complicates software development as well, since this means not just the library needs new validation rules but the software also using the library needs to be aware of new fields. This is particularly problematic for software that is not using SBOL internally as its data model, and it will increase the chance of data loss. [Chris]

Making these fields required renders files of previous versions invalid. This would therefore become an SBOL 3.0 change, not a SBOL 2.1 change.

5.2 optional versus required fields

I originally proposed to make fields required at the data model level. However, since topology and strand information makes little or even no sense for ComponentDefinitions describing proteins or small molecules, this was dropped (following debate on sbol-dev). Two new validation rules are proposed instead.

5.3 boolean value instead of SO term

The original SEP proposed boolean values (circular=True/False). This was critized on the grounds that we, so far, are not using boolean values anywhere in SBOL and that we should tie things in with SO instead.

5.4 Combine `topology` and `strand` into one field

SO offers terms for all 8 combinations of circular/linear and double/single stranded and DNA/RNA. An example is the term circular_double_stranded_DNA_chromosome. Using these terms was considered less elegant though. (1) software likely would have to decompose these terms again, (2) the 'chromosome' semantics does not fit many SBOL applications.

5.5 Formulation through sequence constraints

It would be possible to formulate topology through sequence constraints. However, sequence constraints have a different use case -- they describe incomplete designs that may, for example, be communicated to services for automated design or library construction. Standard sequence editors will likely not support any sequence constraints framework in the near future. By contrast, the current standard use case for SBOL is (or should be) the communication of a complete sequence, e.g. for a plasmid deposited at AddGene or a fragment described on parts.igem.org.

5.6 Application to non-DNA / RNA entities

Also protein or even small / medium-sized molecules can feature circular topologies and topology was originally envisioned to apply to all types of ComponentDefinition. However, in case of proteins, circularity is exceedingly rare and 'double-stranded' does not apply at all. Furthermore, the SO terms seem to be more or less explicitely defined for nucleic acid polymers. Explicitely excluding these fields from all non-RNA or DNA ComponenDefinitions therefore seems reasonable.

Competing SEPs

This SEP was formerly replaced by SEP 011 (SEP #11).

References

Copyright

To the extent possible under law, SBOL developers have waived all copyright and related or neighboring rights to SEP 008. This work is published from: United States.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sep_008.md

sep_008.md

SEP 008 -- Specify DNA / RNA topology

Abstract

5.1 re-using `type` field for topology and strand

5.2 optional versus required fields

5.3 boolean value instead of SO term

5.4 Combine `topology` and `strand` into one field

5.5 Formulation through sequence constraints

5.6 Application to non-DNA / RNA entities

References

Copyright

Files

sep_008.md

Latest commit

History

sep_008.md

File metadata and controls

SEP 008 -- Specify DNA / RNA topology

Abstract

5.1 re-using type field for topology and strand

5.2 optional versus required fields

5.3 boolean value instead of SO term

5.4 Combine topology and strand into one field

5.5 Formulation through sequence constraints

5.6 Application to non-DNA / RNA entities

References

Copyright

5.1 re-using `type` field for topology and strand

5.4 Combine `topology` and `strand` into one field