# Burrows-Wheeler Transforms and Pattern Matching (BA. 6 wk 2)

---

<br>

### BWT Package
---


* [ ] Encode BWT - ba9i
* [ ] Decode BWT - ba9j
* [ ] Last to First Array - ba9k
* [ ] Pattern Match in BWT - ba9l

<br>

---

### 1. Encoding text into Burrows-Wheeler Transform (bwt) (BA9I)
<hr/>


Rosalind BA9I
- extremely naive algorithm that doesn't save space bc generates matrix of size $ |text|^2 $


<br>

In [222]:
def naive_bwt(text):
    
    if text[-1] != '$':
        text += '$'
        
    cycles = []
    for i in range(len(text)):
        rotation = text[i:] + text[:i]
        cycles.append(rotation)
    
    cycles = sorted(cycles)
    bwt = ''
    for cycle in cycles:
        bwt += cycle[-1]
    return bwt

In [210]:
# with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/bwt.txt') as file:
#    text = file.readline().strip()
text = 'panamabananas'
text

'panamabananas'

In [211]:
bwt = naive_bwt(text)
bwt

'smnpbnnaaaaa$a'

---

### 2.  Decoding BWT (BA9J)
<hr/>

Rosalind BA9J 

decode text from a Burrows-Wheeler Transform

In [212]:
def decode_bwt(bwt):
    
    first_col = ''.join(sorted([i for i in bwt]))
    counts = {symbol:0 for symbol in first_col}
    first = []
    for symbol in first_col:
        counts[symbol]+=1
        first.append((symbol, counts[symbol]))
    
    counts = {symbol:0 for symbol in bwt}
    last = []
    for symbol in bwt:
        counts[symbol]+=1
        last.append((symbol, counts[symbol]))
    
    decoded =''
    symbol = ('$',1)
       
    while len(decoded)<len(last):
        symbol = first[last.index(symbol)]
        decoded += symbol[0]
    return decoded[:-1]

In [213]:
bwt

'smnpbnnaaaaa$a'

In [214]:
# with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba9j.txt') as file:
#     bwt = file.readline().strip()
decoded = decode_bwt(bwt)
decoded

'panamabananas'

---       

### 3. BW Matching 
---


#### Sub section - Last To First Array (BA9K)
---

The Last-to-First array, denoted LastToFirst(i), answers the following question: given a symbol at position i in LastColumn, what is its position in FirstColumn?

Last-to-First Mapping Problem

    Given: 
        A string Transform and an integer i.

    Return: 
        The position LastToFirst(i) in FirstColumn in the Burrows-Wheeler matrix if LastColumn = Transform.

    Sample Dataset
        T$GACCA
        3
    Sample Output
        1

---       

In [215]:
def process_bwt(bwt):

    counts = {symbol:0 for symbol in bwt}
    last_col = []

    for symbol in bwt:
        counts[symbol]+=1
        last_col.append((symbol, counts[symbol]))
    
    Last_to_first = [sorted(last_col).index(x) for x in last_col]
    
    return last_col, Last_to_first


In [216]:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba9k.txt') as file:
    bwt = file.readline().strip()
    i = int(file.readline().strip())
last_col, Last_to_first = process_bwt(bwt)
Last_to_first[i]

714

---       

### BW Matching Algorithm (BA9L)
---

#### BW matching pseudocode

We are now ready to describe BWMatching, an algorithm that counts the total number of matches of Pattern in Text, where the only information that we are given is FirstColumn and LastColumn in addition to the Last-to-First mapping. The pointers top and bottom are updated by the green lines in the following pseudocode.


    BWMatching(LastColumn, Pattern, LastToFirst)
        top ← 0
        bottom ← |LastColumn| − 1
        while top ≤ bottom
            if Pattern is nonempty
                symbol ← last letter in Pattern
                remove last letter from Pattern
                if positions from top to bottom in LastColumn contain an occurrence of symbol
                    topIndex ← first position of symbol among positions from top to bottom in LastColumn
                    bottomIndex ← last position of symbol among positions from top to bottom in LastColumn
                    top ← LastToFirst(topIndex)
                    bottom ← LastToFirst(bottomIndex)
                else
                    return 0
            else
                return bottom − top + 1
<br>

#### Code Challenge: Implement BWMatching.

    Input:
        A string BWT(Text), followed by a collection of Patterns.
    Output:
        A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text

    Sample Input:

        "TCCTCTATGAGATCCTATTCTATGAAACCTTCA$GACCAAAATTCTCCGGC"
        ['CCT', 'CAC', 'GAG', 'CAG', 'ATC']

    Sample Output:

        2 1 1 0 1

        # bwt = 'smnpbnnaaaaa$a'
        # Patterns = ['ana']


In [217]:

def BW_match(bwt_tup, Last_to_first, Pattern):

    top = 0
    bottom = len(bwt_tup)
    symbol = Pattern[-1]

    while top <= bottom:
    
        if len(Pattern)>=1:
            symbol = Pattern[-1]
        else:
            return bottom-top
        
        new_bottom = 0
        match = False
        
        for i in range(top, bottom):
            if bwt_tup[i][0] == symbol:           
                j = i+1
                new_bottom = i 
                while j < bottom:
                    if bwt_tup[j][0] == symbol:
                        new_bottom = j
                    j += 1 
                match = True
                break
                
        top = Last_to_first[i]
        bottom = Last_to_first[new_bottom]+1

        if not match:
            return 0
        else:
            Pattern = Pattern[:-1]

In [218]:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba9l.txt') as file:
    bwt = file.readline().strip()
    Patterns = file.readline().strip().split()

last_col, Last_to_first = process_bwt(bwt)
matches =[]
for Pattern in Patterns:
    matches.append(BW_match(last_col, Last_to_first, Pattern))


with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba9k_out.txt','w') as outfile:
    outfile.write(' '.join(str(i) for i in matches))

In [219]:
' '.join(str(i) for i in matches)

'0 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 2 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 2 0 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 2 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1

---

<br>

### That is the end of Course 6 week 2.
---

#### So far...

* [x] Encode BWT - ba9i
* [x] Decode BWT - ba9j
* [x] Last to First Array - ba9k
* [x] Pattern Match in BWT - ba9l

<br>

---

<br>

### Next notebook-> BA9-3: Course 6, Week 3 -> Better BWT Matching algorithms... 
---