# Algorithms for Assembly

In the 4th and final week of this course, we learn about algorithms to tackle the assembly problem. However, as the "Third Law of Assembly" dictates, repetitive sequences will make the tasks very difficult. We will tackle this problem building upon the previous weeks of work.

In [1]:
from typing import Optional, Sequence
import itertools
import unittest

# Homework

## Question 1

In a practical, we saw the `shortest_common_superstring` function (copied below along with overlap) for finding the shortest common superstring of a set of strings.

It's possible for there to be multiple different shortest common superstrings for the same set of input strings.

What is the length of the shortest common superstring of the following strings?

"CCT", "CTT", "TGC", "TGG", "GAT", "ATT"

## Question 2

How many different shortest common superstrings are there for the input strings given in the previous question?

In [2]:
def overlap(a: str, b: str, min_length:int=3) -> int:
    """Return length of the longest suffix of 'a' matching
    a prefix of 'b' that is at least 'min_length' characters
    long. If no such overlap exists, return 0.

    Parameters
    ----------
    a : str
        String to test its suffix
    b : str
        String to test its prefix
    min_length : int
        Minimum length of match

    Returns
    -------
    int
        Longest overlap between suffix of 'a' with prefix of 'b'. Zero (0) otherwise
    """
    start = 0  # start all the way at the left
    # MGH addition, min length must be positive
    if min_length < 1:
        raise ValueError("min_length must be positive definite")
    # MGH addition, edge case if len(b) < min_length, then should return 0
    if len(b) < min_length:
        return 0
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match


def shortest_common_superstring(ss: Sequence[str]) -> Optional[str]:
    """ Using brute-force algorithms, find shortest common superstring
    of given strings, which must be the same length. The complexity of
    the method is O(N!) where N is the number of input string.

    Parameters
    ----------
    ss : Sequence[str]
        Sequence of strings

    Returns
    -------
    Optional[str]
        Shortest superstring or None if no input
    """
    shortest_sup: Optional[str] = None
    # MGH addition, min length must be positive
    if len(ss) == 0:
        return shortest_sup
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]  # superstring starts as first string
        for i in range(len(ss)-1):
            # overlap adjacent strings A and B in the permutation
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            # add non-overlapping portion of B to superstring
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup  # found shorter superstring
    return shortest_sup  # return shortest


class ShortestSuperStringTestCase(unittest.TestCase):

    def test_found_superstring(self):
        strings = [
            "ACGGTACGAGC",
            "GAGCTTCGGA",
            "GACACGG"
        ]
        super_string = shortest_common_superstring(strings)
        expected_super_string = "GACACGGTACGAGCTTCGGA"
        self.assertEqual(
            super_string,
            expected_super_string
        )

        strings = "ABC", "BCA", "CAB"
        expected_super_strings = ["ABCAB", "BCABC", "CABCA"]
        for input_combo in itertools.permutations(strings, 3):
            super_string = shortest_common_superstring(input_combo)
            self.assertIn(
                super_string,
                expected_super_strings
            )

    def test_no_super_string(self):
        empty_strings = ["", "", ""]
        super_string = shortest_common_superstring(empty_strings)
        expected_super_string = ""
        self.assertEqual(
            super_string,
            expected_super_string
        )

        no_strings = []
        self.assertIsNone(shortest_common_superstring(no_strings))


res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_found_superstring (__main__.ShortestSuperStringTestCase) ... ok
test_no_super_string (__main__.ShortestSuperStringTestCase) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.014s

OK


In [3]:
inputs = [
    "CCT",
    "CTT",
    "TGC",
    "TGG",
    "GAT",
    "ATT"
]

common_ss: set[str] = set()
for permut_input in itertools.permutations(inputs, len(inputs)):
    scs = shortest_common_superstring(list(permut_input))
    if scs is not None:
        common_ss.add(scs)


In [5]:
print([len(ss) for ss in common_ss])
print("There are %d scs" % len(common_ss))

[11, 11, 11, 11]
There are 4 scs


The answer to Q1 is that the shortest common superstring is __11__.

The answer to Q2 is that __4__ possible combinations