# Week 2

What is not examined in this notebook is the Boyer-Moore algorithm, which is explored as stand-alone, buildable executables from C and C++. It is neat to see that Boyer-Moore is built into C++17 by default and the C implementation is not too hard either.

What will be used in this notebook is an implementation from the instructors of the course. The nice feature of the module is that we can run its unit tests ourselves!

In [27]:
from pathlib import Path
import unittest

from Bio import SeqIO

from src.bm_preproc import BoyerMoore

In [None]:
# This is a part of the assignment. The BoyerMoore object preprocesses the pattern


In [81]:
!pytest src/bm_preproc.py::TestBoyerMoorePreproc -v

platform darwin -- Python 3.10.8, pytest-7.2.0, pluggy-1.0.0 -- /Users/mhogan/Documents/algorithms-genomic-sequencing/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/mhogan/Documents/algorithms-genomic-sequencing
plugins: anyio-3.6.2
collected 12 items                                                             [0m

src/bm_preproc.py::TestBoyerMoorePreproc::test_big_l_prime_1 [32mPASSED[0m[32m      [  8%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_big_l_prime_2 [32mPASSED[0m[32m      [ 16%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_good_suffix_match_mismatch_1 [32mPASSED[0m[32m [ 25%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_good_suffix_table_1 [32mPASSED[0m[32m [ 33%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_good_suffix_table_2 [32mPASSED[0m[32m [ 41%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_n_1 [32mPASSED[0m[32m                [ 50%][0m
src/bm_preproc.py::TestBoyerMoorePreproc::test_n_2 [32mPASS

Neat! All the unit tests passed meaning we do not have to worry about Python2 or Python3 version differences.

In [1]:
!wget http://d28rh4a8wq0iu5.cloudfront.net/ads1/data/chr1.GRCh38.excerpt.fasta
!mv chr1.GRCh38.excerpt.fasta week2hw

--2022-11-28 14:04:37--  http://d28rh4a8wq0iu5.cloudfront.net/ads1/data/chr1.GRCh38.excerpt.fasta
Resolving d28rh4a8wq0iu5.cloudfront.net (d28rh4a8wq0iu5.cloudfront.net)... 108.156.200.25, 108.156.200.104, 108.156.200.204, ...
Connecting to d28rh4a8wq0iu5.cloudfront.net (d28rh4a8wq0iu5.cloudfront.net)|108.156.200.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 810105 (791K) [application/octet-stream]
Saving to: ‘chr1.GRCh38.excerpt.fasta’


2022-11-28 14:04:38 (12.9 MB/s) - ‘chr1.GRCh38.excerpt.fasta’ saved [810105/810105]



## Task

Implement versions of the naive exact matching and Boyer-Moore algorithms that additionally count and return (a) the number of character comparisons performed and (b) the number of alignments tried. Roughly speaking, these measure how much work the two different algorithms are doing.

In [25]:
def naive(
    pat: str,
    ref: str,
    full:bool=False
):
    """Find all the alignments use naive pattern matching of a pattern
    against a reference string

    Parameters
    ----------
    pat : str
        Pattern string
    ref : str
        Reference string
    full : bool
        Return full comparison (default=False)

    Returns
    -------
    occurrences : list[int]
        Alignment offsets
    alignments : int
        Number of alignment tried. Returned only if `full` is True.
    comparisons : int
        Number of character comparisons performed. Returned only if `full` is True

    """
    comparisons, alignments = 0, 0
    occurrences: list[int] = []
    for i in range(len(ref) - len(pat) + 1):  # loop over alignments
        alignments += 1
        match = True
        for j in range(len(pat)):  # loop over characters
            is_char_match = ref[i + j] == pat[j]
            comparisons += 1
            if not is_char_match:
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    if not full:
        return occurrences
    return occurrences, alignments, comparisons


def boyer_moore(
    pat: str,
    p_bm: BoyerMoore,
    tex: str,
    full: bool = False
):
    """Run a pattern search using the Boyer-Moore algorithm

    Parameters
    ----------
    pat : str
        Pattern
    p_bm : BoyerMoore
        Preprocessor for the pattern
    tex : str
        Text to search
    full : bool
        Return full comparison (default=False)

    Returns
    -------
    occurrences : list[int]
        Alignment offsets
    alignments : int
        Number of alignment tried. Returned only if `full` is True.
    comparisons : int
        Number of character comparisons performed. Returned only if `full` is True

    """
    index_i = 0
    comparisons, alignments = 0, 0
    occurrences: list[int] = []
    while index_i < len(tex) - len(pat) + 1:
        alignments += 1
        shift = 1
        mismatched = False
        for index_j in range(len(pat) - 1, -1, -1):
            is_char_match = pat[index_j] == tex[index_i + index_j]
            comparisons += 1
            if not is_char_match:
                skip_bc = p_bm.bad_character_rule(index_j, tex[index_i + index_j])
                skip_gs = p_bm.good_suffix_rule(index_j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(index_i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        index_i += shift
    if not full:
        return occurrences
    return occurrences, alignments, comparisons


In [26]:
class NaiveWithCountsTestCase(unittest.TestCase):
    """Test the occurrences, alignments, and character comparisons of the Naive algorithm"""

    def test_short_patterns(self):
        p_1 = 'word'
        t_1 = 'there would have been a time for such a word'
        naive_results_1 = naive(p_1, t_1, full=True)
        self.assertListEqual(
            list(naive_results_1),
            [[40], 41, 46]
        )

        p_2 = 'needle'
        t_2 = 'needle need noodle needle'
        naive_results_2 = naive(p_2, t_2, full=True)
        self.assertListEqual(
            list(naive_results_2),
            [[0, 19], 20, 35]
        )


class BoyerMooreWithCountsTestCase(unittest.TestCase):
    """Test the occurrences, alignments, and character comparisons of the Boyer-Moore algorithm"""

    def test_short_patterns(self):
        lowercase_alphabet = (
            "".join([chr(index) for index in range(ord("a"), ord("z")+1)])  # letters
            + " "  # empty space
        )

        p_1 = "word"
        t_1 = "there would have been a time for such a word"
        p_bm_1 = BoyerMoore(p_1, lowercase_alphabet)
        bm_results_1 = boyer_moore(p_1, p_bm_1, t_1, full=True)
        self.assertListEqual(
            list(bm_results_1),
            [[40], 12, 15]
        )

        p_2 = "needle"
        t_2 = "needle need noodle needle"
        p_bm_2 = BoyerMoore(p_2, lowercase_alphabet)
        bm_results_2 = boyer_moore(p_2, p_bm_2, t_2, full=True)
        self.assertListEqual(
            list(bm_results_2),
            [[0, 19], 5, 18]
        )


res = unittest.main(argv=[''], verbosity=3, exit=False)
assert len(res.result.failures) == 0

test_short_patterns (__main__.BoyerMooreWithCountsTestCase) ... ok
test_short_patterns (__main__.NaiveWithCountsTestCase) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.017s

OK


Let's now read in the FASTA file with the human Alu sequences

In [28]:
with Path("week2hw/chr1.GRCh38.excerpt.fasta").open(mode="r") as fh:
    alu_sequences: SeqIO.SeqRecord = list(SeqIO.parse(fh, "fasta"))[0]

# Quiz

## Q1
How many alignments does the naive exact matching algorithm try when matching the string
`GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG`
(derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)


First verify that the sequence loaded into memory is from humans (Homo sapiens) and is chromosome 1

In [33]:
print(str(alu_sequences))

ID: CM000663.2_excerpt
Name: CM000663.2_excerpt
Description: CM000663.2_excerpt EXCERPT FROM CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
Number of features: 0
Seq('TTGAATGCTGAAATCAGCAGGTAATATATGATAATAGAGAAAGCTATCCCGAAG...AGG')


Now run the query

In [34]:
p_1 = "GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG"
_, align_1, _ = naive(p_1, str(alu_sequences.seq), full=True)
print(f"Number of alignments: {align_1}")

Number of alignments: 799954


## Q2:
How many character comparisons does the naive exact matching algorithm try when matching the string
`GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG`
(derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

In [35]:
p_2 = "GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG"
_, _, comp_2 = naive(p_2, str(alu_sequences.seq), full=True)
print(f"Number of comparisons: {comp_2}")

Number of comparisons: 984143


## Q3:
How many alignments does Boyer-Moore try when matching the string
`GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG`
(derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

In [37]:
p_3 = "GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG"
bm_indexer_3 = BoyerMoore(p_3)
_, align_3, _ = boyer_moore(p_3, bm_indexer_3, str(alu_sequences.seq), full=True)
print(f"Number of alignments: {align_3}")

Number of alignments: 127974
