Skip to content

Latest commit

 

History

History
214 lines (149 loc) · 10.9 KB

sep_011.md

File metadata and controls

214 lines (149 loc) · 10.9 KB

SEP 011 -- Specify DNA / RNA topology with type fields

SEP 011
Title Specify DNA / RNA topology with type fields
Authors Raik Gruenberg (raik.gruenberg at gmail com)
Editor Raik Gruenberg
Type Data Model
SBOL Version 2.1
Replace 008 (SEP #8)
Status Accepted
Created 22-Oct-2016
Last modified
Issue #26

Abstract

SBOL currently is missing any clear means to distinguish between circular (e.g. plasmid) and linear DNA constructs. Also double- and single-stranded DNA or RNA is not distinguished. This proposal describes a best-practice and new validation rules for encoding topology and strand information in the ComponentDefinition.type field.

Note from the editors: This SEP changes the original proposal (SEP #8) such that no changes to the data model are required. It was therefore implemented in SBOL 2.1 without a vote. SEP #8 may nevertheless be discussed and revived for future SBOL releases.

  1. Rationale

Engineered DNA constructs are currently mostly deposited, used and exchanged as circular, double-stranded, plasmid DNA. However, there are also many applications for linear constructs which are easily and cheaply accessible through PCR or gene synthesis. Moreover, even circular plasmid DNA may be cut and stored or exchanged as linear DNA to aid the construction of new vectors. Likewise, natural or engineered genomes may be circular (bacterial chromosomes) or linear (eukaryotic chromosomes). Both linear and circular DNA or RNA molecules can, moreover, be either double-stranded (ds) or single stranded. RNA, typically, is used as single-stranded molecule whereas DNA is, mostly, utilized double-stranded (dsDNA). However, oligonucleotides (primers) are an example of single-stranded DNA (ssDNA) which, however, can also be complemented into double-stranded fragments.

Knowing whether a given construct is supposed to have circular or linear topology is of relevance both for experimental design and for basic sequence analysis. Nevertheless, unlike the much older genbank format, SBOL so far lacks the possibility to distinguish between linear and circular DNA. Likewise, the distinction between single-stranded and double-stranded has important consequences but is not currently conveyed by SBOL.

  1. Specification

(2.1) Topology

Any ComponentDefinition classified as DNA (via its type field) is RECOMMENDED to encode circular/linear topology information in an additional type field. This (topology) type field MUST point to either of two possible Sequence Ontology terms:

There is no default topology for DNA. Lack of topology information is only acceptable for DNA records without or with incomplete sequence information. Lack of topology information is not acceptable for DNA ComponentDefinition records with fully specified sequence -- unless the topology is genuinely unknown, in which case the sequence is, in fact, not fully specified.

The topology type field is OPTIONAL for RNA ComponentDefinition records. The default assumption for RNA records is linear topology.

Any other ComponentDefinition record (protein, small molecule, etc) MUST NOT have a type field with DNA topology information.

(2.2) Topology Validation Rules

If ComponentDefinition has type 'DNA' and has a sequence record but does not have a type field pointing to either 'linear' or 'circular' (SO:0000987 or 988), this should trigger a warning.

If ComponentDefinition has any type other than 'DNA' or 'RNA' and has a type field pointing to SO:0000987 or 988, this MUST be considered an error.

(2.3) Double / Single stranded

Any ComponentDefinition classified as DNA or RNA (via its type field) MAY have strand information encoded in an additional type field pointing to either of two possible Sequence Ontology terms:

In absence of this field, the default strand information assumed for DNA is 'double-stranded' and for RNA is 'single-stranded'.

Any other ComponentDefinition record (protein, small molecule, etc) MUST NOT have any type field pointing to these SO terms.

(2.4) Strand validation rules

If ComponentDefinition has any type other than 'DNA' or 'RNA' and has a type field pointing to SO:0000984 or 985, this MUST be considered an error.

(2.5) Consequences for software tools

circular topology (SO:0000988) instructs software to interpret the beginning / end position of a given sequence (be it DNA or RNA) as arbitrary so that sequence features may be mapped or identified across this junction.

double stranded (SO:0000985) instructs software to apply sequence searches to both strands (i.e. sequence and reverse complement of sequence).

  1. Example or Use Case

This is such a basic feature of molecular biology that it is difficult to find an example where it is, in fact, not needed. By way of example, consider popular sequence design software such as Benchling, CLC, VectorNTI or others. In these cases, declaring a sequence as circular changes the view and the searching behaviour of the sequence editor.

Example for double-stranded / single-stranded:

While the pattern 'ATG' is not found within the sequence TTCCTAC, the reverse complement of this sequence does contain ATG. Whether or not a given sequence is considered double-stranded therefore decides whether or not it matches a given pattern.

Example for circular / linear:

While no 'ATG' start codon is found within the linear sequence GAATCATCATAT, this sequence does contain a start codon if considered circular.

  1. Backwards Compatibility

No backwards compatibility issues are foreseen. However, the fields will be missing from older records. Software then should fall back to making an educated guess or assuming linear topology.

  1. Discussion

5.1 creating new fields topology and strand

This was, in fact, the original suggestion (See SEP #8). It was retracted in order to make topology and strand information already available in SBOL 2.1. Depending on experience, this issue may be re-visited before finalizing SBOL 3.0.

Declaration of dedicated fields for this information would have advantages:

  • more explicit (rather than several type fields pointing to cryptic SO terms, the attribute name would clarify what type of information can be expected)
  • lower risk that topology and strand information is omitted
  • less or less complex validation rules
  • software would not have to interpret a collection of type fields even though it is only interested in, for example, telling apart circular plasmid DNA from DNA fragments
  • if non-standard SO terms are used, software could, at least, still know whether the unknown term refers to topology, strand or type

Disadvantage:

  • the two fields would only apply to 'RNA' and 'DNA' ComponentDefinition, validation rules would still be needed.
  • This may be more confusing for software developers than additional type fields.

5.2 optional versus required fields

I originally proposed to make fields required at the data model level. However, since topology and strand information makes little or even no sense for ComponentDefinitions describing proteins or small molecules, this was dropped (following debate on sbol-dev). As pointined out at COMBINE, CompentDefinition without sequence information cannot be required to have topology information either. Moreover, topology information may genuinely unknown. For all these reasons, the field cannot be required. It is recommended as best practice.

5.3 boolean value instead of SO term

The original SEP proposed boolean values (circular=True/False). This was critized on the grounds that we, so far, are not using boolean values anywhere in SBOL and that we should tie things in with SO instead.

5.4 Combine topology and strand into one field

SO offers terms for all 8 combinations of circular/linear and double/single stranded and DNA/RNA. An example is the term circular_double_stranded_DNA_chromosome. Using these terms was considered less elegant. (1) software likely would have to decompose these terms again, (2) the 'chromosome' semantics does not fit many SBOL applications.

5.5 Formulation through sequence constraints

It would be possible to formulate topology through sequence constraints. However, sequence constraints have a different use case -- they describe incomplete designs that may, for example, be communicated to services for automated design or library construction. Standard sequence editors will likely not support any sequence constraints framework in the near future. By contrast, the current standard use case for SBOL is (or should be) the communication of a complete sequence, e.g. for a plasmid deposited at AddGene or a fragment described on parts.igem.org.

5.6 Application to non-DNA / RNA entities

Also protein or even small / medium-sized molecules can feature circular topologies and topology was originally envisioned to apply to all types of ComponentDefinition. However, in case of proteins, circularity is exceedingly rare and 'double-stranded' does not apply at all. Furthermore, the SO terms seem to be more or less explicitely defined for nucleic acid polymers. Explicitely excluding these fields from all non-RNA or DNA ComponenDefinitions therefore seems reasonable.

  1. Competing SEPs

This SEP replaces SEP #8.

References

Copyright

CC0
To the extent possible under law, SBOL developers have waived all copyright and related or neighboring rights to SEP 011. This work is published from: United States.