Skip to content

Minimal information to represent an assembly #51

@manulera

Description

@manulera

Tagging @BjornFJohansson @jakebeal @dgruano for input.

As described in more detail in pydna-group/pydna#165, I have recently implemented an alternative Assembly class based on the original pydna class. The fully documented source code can be seen here, but in essence there are three types of inputs:

  • A list of sequences (as Dseqrecord objects)
  • A function to find common substrings among pairs of sequences (by default common substrings anywhere, but can accept functions that would only find common substrings at the edges of the sequence, or one could also think of find common restriction sites)
  • Constrains (how long must the common substring be for it to be considered, should all fragments be used, etc.)

Representing the join of two fragments

An assembly is then represented as a list of "joins" between fragments, where each join is represented as (u, v, loc_u, loc_v).

  • u and v are integers, representing the index (1-based) of a joined fragment from the input list. The sign of the node key represents the orientation of the fragment, positive for forward orientation, negative for reverse orientation.
  • loc_u and loc_v are the locations of the common substring among u and v.

For example, the joining of the left part of fragment 1 and the second part of fragment 2 through their homology as shown below

1 AacgatCAtgctccaa                      ......
          ||||||            ==> AacgatCAtgctccTAAattctgc
2        TtgctccTAAattctgc

Would be represented as (1, 2, [8:14](+), [1:7](+)), here locations are represented as biopython does, but any representation would be fine.

If fragment 2 in the input given to the assembly would be reverse complemented, then the same joining would be represented as (1, -2, [8:14](+), [1:7](+)). The strand in the location is not strictly necessary, so it could be omitted.

Representing an assembly

An assembly can then be represented as a list of input fragments and a tuple of joins as described above, like this:
- Linear: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'))
- Circular: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'), (3, 1, '3[12:17](+):1[1:6](+)'))
Note that the first and last fragment are the same in a circular assembly.

De-duplication

The same sequence output of an assembly can be described in several ways:

  • Linear outputs can be described in forward and reverse orientation
  • Circular outputs can be described in forward and reverse orientation, and all their circular permutations

To prevent de-duplication, the following constrains are applied:

  • Linear assemblies: the first fragment is in the forward orientation.
  • Circular assemblies: the first fragment is in the forward orientation, and has the smallest index in the input fragment list.

Use cases

Based on pydna's current uses, and some more

  • Gibson assembly
  • Homologous recombination
  • Representation of ligation of fragments with sticky overhangs (algorithm should return the location of compatible overhangs)
  • One step restriction-ligation (the algorithm could return common substrings based on the cutsite of restriction enzymes provided by the user)

Feedback

This is meant to be the minimal information that could then be translated into SBOL format. Any limitation or improvement to this? Feel free to leave your thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions