# Algorithms for Assembly

In the 4th and final week of this course, we learn about algorithms to tackle the assembly problem. However, as the "Third Law of Assembly" dictates, repetitive sequences will make the tasks very difficult. We will tackle this problem building upon the previous weeks of work.

In [22]:
from collections import Counter
import itertools
from pathlib import Path
from typing import Optional, Sequence
import unittest

from Bio.SeqIO import parse, SeqRecord
import numpy as np

# Homework

## Question 1

In a practical, we saw the `shortest_common_superstring` function (copied below along with overlap) for finding the shortest common superstring of a set of strings.

It's possible for there to be multiple different shortest common superstrings for the same set of input strings.

What is the length of the shortest common superstring of the following strings?

"CCT", "CTT", "TGC", "TGG", "GAT", "ATT"

## Question 2

How many different shortest common superstrings are there for the input strings given in the previous question?

In [2]:
def overlap(a: str, b: str, min_length:int=3) -> int:
    """Return length of the longest suffix of 'a' matching
    a prefix of 'b' that is at least 'min_length' characters
    long. If no such overlap exists, return 0.

    Parameters
    ----------
    a : str
        String to test its suffix
    b : str
        String to test its prefix
    min_length : int
        Minimum length of match

    Returns
    -------
    int
        Longest overlap between suffix of 'a' with prefix of 'b'. Zero (0) otherwise
    """
    start = 0  # start all the way at the left
    # MGH addition, min length must be positive
    if min_length < 1:
        raise ValueError("min_length must be positive definite")
    # MGH addition, edge case if len(b) < min_length, then should return 0
    if len(b) < min_length:
        return 0
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match


def shortest_common_superstring(ss: Sequence[str]) -> Optional[str]:
    """ Using brute-force algorithms, find shortest common superstring
    of given strings, which must be the same length. The complexity of
    the method is O(N!) where N is the number of input string.

    Parameters
    ----------
    ss : Sequence[str]
        Sequence of strings

    Returns
    -------
    Optional[str]
        Shortest superstring or None if no input
    """
    shortest_sup: Optional[str] = None
    # MGH addition, min length must be positive
    if len(ss) == 0:
        return shortest_sup
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]  # superstring starts as first string
        for i in range(len(ss)-1):
            # overlap adjacent strings A and B in the permutation
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            # add non-overlapping portion of B to superstring
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup  # found shorter superstring
    return shortest_sup  # return shortest


class ShortestSuperStringTestCase(unittest.TestCase):

    def test_found_superstring(self):
        strings = [
            "ACGGTACGAGC",
            "GAGCTTCGGA",
            "GACACGG"
        ]
        super_string = shortest_common_superstring(strings)
        expected_super_string = "GACACGGTACGAGCTTCGGA"
        self.assertEqual(
            super_string,
            expected_super_string
        )

        strings = "ABC", "BCA", "CAB"
        expected_super_strings = ["ABCAB", "BCABC", "CABCA"]
        for input_combo in itertools.permutations(strings, 3):
            super_string = shortest_common_superstring(input_combo)
            self.assertIn(
                super_string,
                expected_super_strings
            )

    def test_no_super_string(self):
        empty_strings = ["", "", ""]
        super_string = shortest_common_superstring(empty_strings)
        expected_super_string = ""
        self.assertEqual(
            super_string,
            expected_super_string
        )

        no_strings = []
        self.assertIsNone(shortest_common_superstring(no_strings))


res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_found_superstring (__main__.ShortestSuperStringTestCase) ... ok
test_no_super_string (__main__.ShortestSuperStringTestCase) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.014s

OK


In [3]:
inputs = [
    "CCT",
    "CTT",
    "TGC",
    "TGG",
    "GAT",
    "ATT"
]

common_ss: set[str] = set()
for permut_input in itertools.permutations(inputs, len(inputs)):
    scs = shortest_common_superstring(list(permut_input))
    if scs is not None:
        common_ss.add(scs)


In [5]:
print([len(ss) for ss in common_ss])
print("There are %d scs" % len(common_ss))

[11, 11, 11, 11]
There are 4 scs


The answer to Q1 is that the shortest common superstring is __11__.

The answer to Q2 is that __4__ possible combinations.

## Question 3

Download this FASTQ file containing synthetic sequencing reads from a mystery virus:

https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ads1_week4_reads.fq

All the reads are the same length (100 bases) and are exact copies of substrings from the forward strand of the virus genome.  You don't have to worry about sequencing errors, ploidy, or reads coming from the reverse strand.

Assemble these reads using one of the approaches discussed, such as greedy shortest common superstring.  Since there are many reads, you might consider ways to make the algorithm faster, such as the one discussed in the programming assignment in the previous module.

How many As are there in the full, assembled genome?

Hint: the virus genome you are assembling is exactly 15,894 bases long

In [7]:
!wget https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ads1_week4_reads.fq
!mkdir -p week4hw
!mv ads1_week4_reads.fq week4hw

--2022-12-06 18:20:52--  https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ads1_week4_reads.fq
Resolving d28rh4a8wq0iu5.cloudfront.net (d28rh4a8wq0iu5.cloudfront.net)... 108.156.200.29, 108.156.200.104, 108.156.200.204, ...
Connecting to d28rh4a8wq0iu5.cloudfront.net (d28rh4a8wq0iu5.cloudfront.net)|108.156.200.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 395781 (387K) [video/m2ts]
Saving to: ‘ads1_week4_reads.fq’


2022-12-06 18:20:53 (1.69 MB/s) - ‘ads1_week4_reads.fq’ saved [395781/395781]



In [11]:
synthetic_reads_fq = Path("week4hw/ads1_week4_reads.fq")
with synthetic_reads_fq.open("r") as fh:
    synthetic_reads: list[SeqRecord] = list(parse(fh, "fastq"))
    synthetic_reads_str: list[str] = [str(read.seq) for read in synthetic_reads]

Let's see how synthetic this data is.

In [20]:
quality_counter = Counter()
for syn_read in synthetic_reads:
    letter_anno = syn_read.letter_annotations["phred_quality"]
    quality_counter.update(letter_anno)

print(quality_counter)

Counter({40: 188100})


All the read quality has a Q-value of 40, which is a p-value of $10^{-4}$. This data is clearly fake/synthetic!

Let's implement that faster, but error-prone greedy shortest common superstring method

In [70]:
def pick_maximal_overlap(
    reads: Sequence[str],
    min_length: int
) -> tuple[str, str, int]:
    """Find the best pair of sequence strings with maximal suffix, prefix overlap

    Parameters
    ----------
    reads : Sequence[str]
        Read sequences
    min_length
        Minimum required overlap length
    Returns
    -------
    tuple[str, str, int]
        Suffix, prefix, overlap
    """
    best_read_a, best_read_b = "", ""
    best_overlap_len = 0
    for read_a, read_b in itertools.permutations(list(reads), 2):
        overlap_len = overlap(read_a, read_b, min_length)
        if overlap_len > best_overlap_len:
            best_read_a, best_read_b = read_a, read_b
            best_overlap_len = overlap_len
    return best_read_a, best_read_b, best_overlap_len


def greedy_shortest_common_superstring(
    reads: Sequence[str],
    min_length: int
) -> Optional[str]:
    """ Using greedy algorithms, find shortest common superstring
    of given strings, which must be the same length.

    Parameters
    ----------
    reads : Sequence[str]
        Sequence of strings
    min_length : int
        Criteria for greedy search that maximizes overlap

    Returns
    -------
    Optional[str]
        Shortest superstring or None if no input
    """
    scs: Optional[str] = None
    reads = list(reads)
    if len(reads) == 0:
        return scs
    read_a, read_b, overlap_len = pick_maximal_overlap(reads, min_length)
    while overlap_len > 0:
        reads.remove(read_a)
        reads.remove(read_b)
        replacement = read_a + read_b[overlap_len:]
        reads.append(replacement)
        read_a, read_b, overlap_len = pick_maximal_overlap(reads, min_length)
    scs = "".join(reads)
    return scs


class GreedyShortestCommonSuperstringTestCase(unittest.TestCase):

    def test_greedy_scs(self):

        strings = "ABC", "BCA", "CAB"
        expected_super_strings = ["ABCAB", "BCABC", "CABCA"]
        for input_combo in itertools.permutations(strings, 3):
            greedy_scs = greedy_shortest_common_superstring(input_combo, 2)
            self.assertIn(
                greedy_scs,
                expected_super_strings
            )



res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_greedy_scs (__main__.GreedyShortestCommonSuperstringTestCase) ... ok
test_found_superstring (__main__.ShortestSuperStringTestCase) ... ok
test_no_super_string (__main__.ShortestSuperStringTestCase) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.013s

OK
