# Burrows Wheeler Transform - using FM-index
---

## Implement BetterBWMatching http://rosalind.info/problems/ba9m/

<br>

If you implemented BWMATCHING in “Implement BWMatching”, you probably found the algorithm to be slow. The reason for its sluggishness is that **updating the pointers top and bottom is time-intensive, since it requires examining every symbol in LastColumn between top and bottom at each step**. 

To improve *BWMATCHING*, we introduce a function \$ Count_{symbol}(i, LastColumn) $\,
which returns the number of occurrences of symbol in the first i positions of LastColumn. 

For example, Count_{n](10, "smnpbnnaaaaa#a”  = 3, and Count_{a}(4, "smnpbnnaaaaa#a”) = 0 

The green lines from BWMATCHING can be compactly described without the First-to-Last mapping by the following two lines:

>    top ← position of symbol with rank Countsymbol(top, LastColumn) + 1 in FirstColumn
    bottom ← position of symbol with rank Countsymbol(bottom + 1, LastColumn) in FirstColumn
    

Define FirstOccurrence(symbol) as the first position of symbol in FirstColumn. If Text = "panamabananas#", then FirstColumn is "#aaaaaabmnnnps", and the array holding all values of FirstOccurrence is \[0, 1, 7, 8, 9, 11, 12]. For DNA strings of any length, the array FirstOccurrence contains only five elements.

The two lines of pseudocode from the previous step can now be rewritten as follows:

    top ← FirstOccurrence(symbol) + Countsymbol(top, LastColumn)
    bottom ← FirstOccurrence(symbol) + Countsymbol(bottom + 1, LastColumn) − 1

In the process of simplifying the green lines of pseudocode from *BWMATCHING*, we have also eliminated the need for both FirstColumn and LastToFirst, resulting in a more efficient algorithm called *BETTERBWMATCHING*.

`BETTERBWMATCHING(FirstOccurrence, LastColumn, Pattern, Count)
    top ← 0
    bottom ← |LastColumn| − 1
    while top ≤ bottom
        if Pattern is nonempty
            symbol ← last letter in Pattern
            remove last letter from Pattern
            if positions from top to bottom in LastColumn contain an occurrence of symbol
                top ← FirstOccurrence(symbol) + Countsymbol(top, LastColumn)
                bottom ← FirstOccurrence(symbol) + Countsymbol(bottom + 1, LastColumn) − 1
            else
                return 0
        else
            return bottom − top + 1
`

Implement BetterBWMatching
Given: A string BWT(Text), followed by a collection of strings Patterns.

Return: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.

Sample Dataset
GGCGCCGC$TAGTCACACACGCCGTA
ACC CCG CAG
Sample Output
1 2 1

In [4]:
import time
from IPython.display import display, Markdown, Latex

In [5]:
class BurrowsWheeler: 
    """Takes a BWT string for pattern matching with list of text Patterns"""

    def __init__(self, bw):
        """BW constructor computes first_occurence of each symbol & symbol count array """

        self.bw = bw
        self.first_occur = self.first_occurence(bw)
        self.count = self.get_count(bw)

    def pattern_match(self, Pattern):
        """BW matching using first occur and count data, constant-time indexing"""

        top = 0
        bottom = len(self.bw)
        while bottom >= top:
            if len(Pattern)>=1:
                symbol = Pattern[-1]
            else:
                return bottom-top
            top = self.first_occur[symbol] + self.count[symbol][top]
            bottom = self.first_occur[symbol] + self.count[symbol][bottom]
            Pattern = Pattern[:-1]
        return 0

    @staticmethod
    def first_occurence(bw):
        """The index i at which each symbol first occurs in the first col, ie. lex. sorted(bw)"""
        first_occur = {}
        first_col = sorted(bw)
        for i in range(len(bw)):
            symbol = first_col[i][0]
            if symbol not in first_occur.keys():
                first_occur[symbol]= i
        return first_occur

    @staticmethod
    def get_count(bw):
        """Cumulative count for each symbol s in alphabet, previous to index i in bw"""
        alpha = sorted(set(bw))
        count = {symbol:[0] for symbol in alpha}
        for symbol in bw:
            for s in alpha:
                if symbol == s:
                    count[s].append(count[s][-1]+1)
                else:
                    count[s].append(count[s][-1])
        return count


In [6]:
import time

start=time.time()
with open("data/BetterBWMatching.txt") as infile:
    infile.readline().strip()
    Text = infile.readline().strip()
    Patterns = infile.readline().strip().split(' ')
    infile.readline().strip()
    Solution = infile.readline().strip()

print('parsing text file, time(s):', round(time.time()-start,4))
print('len text:', len(Text))
print('len Patterns:', len(Patterns))

start = time.time()
bw = BurrowsWheeler(Text)
print('initializing bwt object, time:',round(time.time()-start,4))

start = time.time()
matches = [bw.pattern_match(pattern) for pattern in Patterns]
print('new algo with FirstOccurence and Count[symbol][i] -> time (s):', round(time.time()-start,4))

Matches_array =' '.join(str(i) for i in matches)
print("Matches_array == Solution?", Matches_array == Solution)

parsing text file, time(s): 0.002
len text: 10001
len Patterns: 3785
initializing bwt object, time: 0.0181
new algo with FirstOccurence and Count[symbol][i] -> time (s): 0.1009
Matches_array == Solution? True
