Minimal information to represent an assembly

Tagging @BjornFJohansson @jakebeal @dgruano for input.

As described in more detail in https://github.com/BjornFJohansson/pydna/issues/165, I have recently implemented an alternative `Assembly` class based on the original pydna class. The fully documented source code can be seen [here](https://github.com/manulera/ShareYourCloning_backend/blob/master/assembly2.py), but in essence there are three types of inputs:

- A list of sequences (as `Dseqrecord` objects)
- A function to find common substrings among pairs of sequences (by default common substrings anywhere, but can accept functions that would only find common substrings at the edges of the sequence, or one could also think of find common restriction sites)
- Constrains (how long must the common substring be for it to be considered, should all fragments be used, etc.)

## Representing the join of two fragments

An assembly is then represented as a list of "joins" between fragments, where each join is represented as `(u, v, loc_u, loc_v)`.
* `u` and `v` are integers, representing the index (1-based) of a joined fragment from the input list. The sign of the node key represents the orientation of the fragment, positive for forward orientation, negative for reverse orientation.
* `loc_u` and `loc_v` are the locations of the common substring among `u` and `v`.

For example, the joining of the left part of fragment 1 and the second part of fragment 2 through their homology as shown below

```
1 AacgatCAtgctccaa                      ......
          ||||||            ==> AacgatCAtgctccTAAattctgc
2        TtgctccTAAattctgc
```

Would be represented as `(1, 2, [8:14](+), [1:7](+))`, here locations are represented as biopython does, but any representation would be fine.

If fragment 2 in the input given to the assembly would be reverse complemented, then the same joining would be represented as `(1, -2, [8:14](+), [1:7](+))`. The strand in the location is not strictly necessary, so it could be omitted.

## Representing an assembly

An assembly can then be represented as a list of input fragments and a tuple of joins as described above, like this:
    - Linear: `((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'))`
    - Circular: `((1, 2, '1[8:14](+):2[1:7](+)'), (2, 3, '2[10:17](+):3[1:8](+)'), (3, 1, '3[12:17](+):1[1:6](+)'))`
Note that the first and last fragment are the same in a circular assembly.

## De-duplication

The same sequence output of an assembly can be described in several ways:
* Linear outputs can be described in forward and reverse orientation
* Circular outputs can be described in forward and reverse orientation, and all their circular permutations

To prevent de-duplication, the following constrains are applied:

- Linear assemblies: the first fragment is in the forward orientation.
- Circular assemblies: the first fragment is in the forward orientation, and has the smallest index in the input fragment list.

## Use cases

Based on pydna's current uses, and some more

* Gibson assembly
* Homologous recombination
* Representation of ligation of fragments with sticky overhangs (`algorithm` should return the location of compatible overhangs)
* One step restriction-ligation (the algorithm could return common substrings based on the cutsite of restriction enzymes provided by the user)

## Feedback

This is meant to be the minimal information that could then be translated into SBOL format. Any limitation or improvement to this? Feel free to leave your thoughts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimal information to represent an assembly #51

Representing the join of two fragments

Representing an assembly

De-duplication

Use cases

Feedback

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Minimal information to represent an assembly #51

Description

Representing the join of two fragments

Representing an assembly

De-duplication

Use cases

Feedback

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions