-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Tagging @BjornFJohansson @jakebeal @dgruano for input.
As described in more detail in pydna-group/pydna#165, I have recently implemented an alternative Assembly class based on the original pydna class. The fully documented source code can be seen here, but in essence there are three types of inputs:
- A list of sequences (as
Dseqrecordobjects) - A function to find common substrings among pairs of sequences (by default common substrings anywhere, but can accept functions that would only find common substrings at the edges of the sequence, or one could also think of find common restriction sites)
- Constrains (how long must the common substring be for it to be considered, should all fragments be used, etc.)
Representing the join of two fragments
An assembly is then represented as a list of "joins" between fragments, where each join is represented as (u, v, loc_u, loc_v).
uandvare integers, representing the index (1-based) of a joined fragment from the input list. The sign of the node key represents the orientation of the fragment, positive for forward orientation, negative for reverse orientation.loc_uandloc_vare the locations of the common substring amonguandv.
For example, the joining of the left part of fragment 1 and the second part of fragment 2 through their homology as shown below
1 AacgatCAtgctccaa ......
|||||| ==> AacgatCAtgctccTAAattctgc
2 TtgctccTAAattctgc
Would be represented as (1, 2, [8:14](+), [1:7](+)), here locations are represented as biopython does, but any representation would be fine.
If fragment 2 in the input given to the assembly would be reverse complemented, then the same joining would be represented as (1, -2, [8:14](+), [1:7](+)). The strand in the location is not strictly necessary, so it could be omitted.
Representing an assembly
An assembly can then be represented as a list of input fragments and a tuple of joins as described above, like this:
- Linear: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'))
- Circular: ((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'), (3, 1, '3[12:17](+):1[1:6](+)'))
Note that the first and last fragment are the same in a circular assembly.
De-duplication
The same sequence output of an assembly can be described in several ways:
- Linear outputs can be described in forward and reverse orientation
- Circular outputs can be described in forward and reverse orientation, and all their circular permutations
To prevent de-duplication, the following constrains are applied:
- Linear assemblies: the first fragment is in the forward orientation.
- Circular assemblies: the first fragment is in the forward orientation, and has the smallest index in the input fragment list.
Use cases
Based on pydna's current uses, and some more
- Gibson assembly
- Homologous recombination
- Representation of ligation of fragments with sticky overhangs (
algorithmshould return the location of compatible overhangs) - One step restriction-ligation (the algorithm could return common substrings based on the cutsite of restriction enzymes provided by the user)
Feedback
This is meant to be the minimal information that could then be translated into SBOL format. Any limitation or improvement to this? Feel free to leave your thoughts.