## Module 2: Read Alignment Problem 

- Want faster and more approximate.
- Boyer Moore
- Indexing
- Pigeon Hole Principle - allows us to look for exact and approximate occurences 

In [12]:
""" Added modules to sys.path """
# import sys
# print(sys.path)
# sys.path.append('/Users/arshmeetkaur/Genomic_Data_Science/course 3 /modules')

' Added modules to sys.path '

In [19]:
# Neccessary Modules 

import modules.all_functions 
from modules.all_functions import * 

import modules.boyer_moore
from modules.boyer_moore import * 

import modules.index 
from modules.index import * 

import math 
import bisect 

## Boyer-Moore Basics

## Diversion: Repetetive Elements

## Practical: Implementing Boyer Moore 

In [28]:
"""
GCTAGCTC
TCAA 

expect
bad character rule: shift 2 
good suffix rule: shift 1 (only suffix matching would be 'A' 
from the lower string.)
"""
p = 'TCAA'
""" Need to create a BoyerMoore Object """
p_bm = BoyerMoore(p)
""" Calculate shifts as per the bad character rule.
Bad character at index 2, in T it's 'T'
We would expect a shift of 2. """
print(p_bm.bad_character_rule(2, 'T'))
print(p_bm.good_suffix_rule(2))

2
1


In [16]:
""" 
AAGTAG
AAGGA
mismatch at 3

bad character rule: 4 
good suffix rule: 3 
"""

p = 'AAGGA' 

p_bm = BoyerMoore(p)
print(p_bm.bad_character_rule(3, 'T'))
print(p_bm.good_suffix_rule(3))

4
3


In [29]:
""" 
GCTAGCTC 
ACTA

With bad character, expect shift 1 

with good suffix, expect 3: 
Need A in GCTA to match with A from the start of ACTA. 
"""

p2 = 'ACTA' 
p_bm2 = BoyerMoore(p2)

print(p_bm2.bad_character_rule(0, 'G'))
print(p_bm2.good_suffix_rule(0))

1
3


In [30]:
"""
GCTAGCTA
GCTA

Case: when the pattern matches entirely. 
It's essentially just doing the good suffix rule, special case.
When it's a match_skip, it should shift by the length of P. 
"""

p3 = 'GCTA'
p3_bm = BoyerMoore(p3)
print(p3_bm.match_skip())

4


## Preprocessing

No preprocessing needed for the naive Bayes algorithm. 

For the Boyer Moore algorithm, when you're given P, you can preprocess: make lookup tables for bad character and good suffix rules. 

You can reuse the preprocessed P object over any T. 
The preprocessing for P may be computationally expensive, but we only have to do it once for many tasks. 

What about preprocessing T? 
If we want to do many problems where we're matching different Ps to 
a single T, we would want to store 

Preprocessing T = offline
Not preprocessing T = online

Boyer Moore and Naive Bayes are both online. 

For the read alignment problem, we need to preprocess T (reference genome). It'll be computationally expensive once, but for the most part, we're working with a static reference genome. 

## Indexing and k-mer index 

Preprocess T: 

Indexing a book (ordering): 
- index is a list of key terms and pages in the book where the term is mentioned. 
- allows you to "query" the index, look up wher the key term occurs. 

Indexing by grocery store (grouping): 
- look for x in the x section. 

Indexing DNA like a book (Merging): 

C G T G C G T G C
k-mer index: take every substring of length k from the genome. Store it and the offset where it occurs in the genome. Alphabetize the index by the first, then second, then third, etc characters. 

5-mer index: 
C G T G C: 0, 4
G C G T G: 3 
G T G C G: 1
T G C G T: 2 

How do you use this? 
P: G T G C G T
Take the first 5-mer of P: G T G C G
Look it in the 5-mer index for the index hit: 1 
Verification: Look at pos 1 in T and ensure that P[6] = T[7] = 'T'

Any 5mer from P should work. 
If any 5mer does not exist within P, P cannot have a match in T, since T has every 5mer from T. 

## Ordered structures for Indexing

multimap - associates key with values
one key can have multiple values 

Data structures to implement multimap 

By ordering: 

<img src="a_kmer.png" alt="Local Image" width="300">

- make a table of the kmer and associated offset 
- sort the table by the alphabetical order of kmer strings.
- have to use binary search **O(log n)** to find the kmer so you can get the offset value. 

Python provides functions that help with binary search: 
bisect module
<img src="bisect.png" alt="Local Image" width="300">

- we could use the bisect_left function to look up where we woluld place a certain kmer string in our index. We know then that all actual instances of that kmer are above that index and that will help us find all instances. 

In [32]:
import bisect 
# takes an ascending ordered list, and item 
# returns the leftmost position where the item could be inserted
bisect.bisect_left([1,2], 2) # you would insert 2 at the 1th index.
# another way to look at it: it also tells you that 2 is above the 1th index.

1

## Hash Tables for Indexing

The basic shtick: 

<img src="hash_table.png" alt="Local Image" width="300">

- hash_index = hash_function('kmer string')
- Each hash index is an index in an array of "buckets" 
- Each "bucket" is actually List<Entry<k-mer,offsets>> = linked lists of entries of (key: 'kmer string', value: offsets). 
- There can occasionally be collisions, where distinct kmers end up in the same bucket. 

- when we want to query the hash table, we use the hash_function again.
we want to receive a list of indices where the kmer_string occurs at. So we should loop through the linked list, and collect the offset values for all Entries whose key == kmer_string. 

- python dictionaries are implementations of a hash map, so we don't have to worry about implementing a hashmap 

<img src="dict.png" alt="Local Image" width="300">

## Implementing a kmer index table

In [33]:
# implmented in the index module, tested here. 
t = 'GCTACGATCTACAATCTA'
p = 'TCTA'

table = Index(t, 2) # creates an index table
query_index(p, t, table) 

[7, 14]

## Variations on kmer index

Two variations: 

Variation 1: 

What if you only took the kmers at even offsets? 
- why do this: the index table is smaller, which means easier binary search.
- concern: what if we can't find hits for p? 
- fix: just query from p with different/more kmers.
Take two kmers, one at even offset and one at odd offset like so: 

<img src="each_2.png" alt="Local Image" width="300">

If you do every third kmer, do the first 3 kmers of p.

<img src="each_n.png" alt="Local Image" width="300">

Variation 2: 

Use subsequences. 

AACCGGTT 
AAGT 
- subsequence of the string = a string of chars also occuring in S, in the same order. 
- a substring is always a subsequence

1. Take a subsequence of a "shape" (2 chars, skip 1, 1 char, skip 1, 2 chars)- basically some sort of pattern 
2. Take the same subsequence "shape" for each offset 
3. When you want to query P, extract a subsequence of the same "shape" and look up that string. 

Advantages: Increases specificity of the index. Leads to a succesfull verification more of the time. 

## Genome Indexes Used in Research


Indexing is super duper powerful.
How can we fit the genome into an index? 

k-mer indexing
every-n-kmer indexing
subsequences

Suffix index:

<img src="suffix_table.png" alt="Local Image" width="300">

make a list of all suffixes of the genome. 
- idea: take all the suffixes and put them in alphabetical order.
- now you can use binary search to find all suffixes which have a certain P as a prefix.

is this data structure too big? 
- genome of length n => n(n+1)/2 space needed.
                - 3B => super huge number. not practical.
- but you can fix the size issue: instead of storing the suffixes, store a list of integers which correspond to a suffixes offset. These integers are ordered in order of the alphabetical order of each suffix. 

<img src="index_array.png" alt="Local Image" width="300">

- advantage: the array of indices is equal to the length of the genome, which is like n size instead of n^2 

<img src="compression_algos.png" alt="Local Image" width="300">

Suffix array - uses ordering, like a book index

Suffix tree- uses grouping

FM Index- Based on Burrows Wheeler Transform for compression

## Approximate Matching


- all the algos tried so far are exact. 
- we want approx matching algos because reads dont perfectly align with the ref genome.
- reasons for differences: 
1. sequencing errors
2. not all genomes of a species are the same (differences betweeen humnas for example)

Types of genetic mutations: 
1. mismatch / substitution
2. Insertion or deletion (frameshift)

Want to be able to define how different two strains are, the genetic distances. 
for strings X and Y...
- Hamming Distance: # of substitutions needed to turn one into the other.
- - assumes strings of equal length
- - The edit distance between the empty string and a string of length 3 is 3
- - It is possible for the Hamming distance between two equal length strings to equal the edit distance
- Levenshtein Distance / Edit distanc : minimum # edits (substitituion, insertion, deletion) to turn one string into another
- stick to Hammming distance for now. 

- adapt naive exact matching algo to allow Approximate matching: allow a certain amount of mismatched bases between two strings. Allow some Hamming Distance. 

In [36]:
print(naive_hamming('GGGG', 'ACTGG', 3)) # allows 3 mismatches, should give [0, 1]

[0, 1]


## Approximate Matching

Wanted = a way to apply exact matching algos to approx matching. 

Approach: Want to match P with T. 
1. Partition P into U and V: if P occurs in T with a Hamming distance of 1, either U or V appears with no edits. 

General: if P occurs in T with up to k edits, at least one of k+1 partitions of P appears in T with 0 edits 

<img src="edits.png" alt="Local Image" width="300">

<img src="partitions.png" alt="Local Image" width="300">

Essentially: To find occurences of P with up to k edits in T: 
1. Partition P into k+1 peices. 
2. Use an exact matching algo on each peice. 
3. At least one of those peices must be an exact match. 

Pigeonhole principle: if you have 10 pigeons to put in 9 holes, one hole has to have 2 pigeons. Ours is slightly different. 

1. divide P into k+1 
2. use an exact matching algorithm. 
3. if you find one match, need a verification step to determine if the other partitions match with only a couple offsets to T. 

<img src="pidge_hole_map.png" alt="Local Image" width="300">

## Implementing Approximate Matching 

In [95]:
def approximate_match_official(p,t,n):
    """ This is their code, I used my own algorithm later because couldn't understand how start and end worked."""
    segment_length = int(round(len(p) / (n+1)))
    all_matches = set()
    for i in range(n+1):
        start = i*segment_length
        end = min((i+1)*segment_length, len(p))
        # print(p[start:end])
        p_bm = BoyerMoore(p[start:end], alphabet='ACGT')
        matches = boyer_moore(p[start:end], p_bm, t)
        # Extend matching segments to see if whole p matches
        for m in matches:
            if m < start or m-start+len(p) > len(t):
                continue
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            if mismatches <= n:
                all_matches.add(m - start)
    return list(all_matches)

In [97]:
def approximate_match(p, t, k): 
    """ 
    Steps to approximate matching: 
    1. Split the P string into k + 1 substrings. 
    2. Compares each substring of the P substring to T.
    3. If at least one of those substrings have a match or matches with boyer moore, you can inspect the rest of it. 
    4. Inspect rest to see if it's a match, then append the place of the match in T. 
    """

    # split P string into k+1 substrings 
    # 32 // 3 = 10 
    length_of_partition = len(p) // (k+1) 

    actual_matches = set() # making it a set eliminates possibility of duplicates.

    # Compare each substring of P to T using the Boyer Moore algorithm.
    for i in range(k+1): 
        # instead of creating substrings, use indices 
        start = i * length_of_partition
        end = start + length_of_partition 
        if (len(p) - end < length_of_partition):
            end = len(p) 
        
        # apply the Boyer Moore algorithm to each substring of P. 
        p_bm = BoyerMoore(p[start:end])
        possible_matches = boyer_moore(p[start:end], p_bm, t)
        
        if len(possible_matches) == 0: 
            continue 

        # for every possible match index, check if p really matches. 
        for m in possible_matches:

            # to deal with the case that the match to the segment was found at the very end of t so p would span beyond t
            if m < start or m-start+len(p) > len(t):
                continue

            mismatches = 0 
            
            # check from the start of p to the start of the matching fragment 
            for j in range(start):
                if p[j] != t[m - start + j]:
                    mismatches += 1
                    # break 
                    if mismatches > k: 
                        break 
            
            # check from the end of the matching fragment to end of p 
            for j in range(end, len(p)):

                if p[j] != t[m - start + j]:
                    mismatches += 1
                    if mismatches > k: 
                        break 

            # if the number of mistmatics is tolerated, add it to the set 
            if mismatches <= k:
                actual_matches.add(m - start)

    return list(actual_matches)

In [99]:
p = 'CCCTAAATTC'
t = 'CACTTAATTT'
print(approximate_match(p, t, 2))
print(approximate_match_official(p, t, 2))

[]
[]


In [106]:
p = 'TAATTAA'
p_bm = BoyerMoore(p)
#print("Bad character:", p_bm.bad_character_rule(4, 'T'))
print("Good suffix:", p_bm.good_suffix_rule(3))

Good suffix: 4
