# Week 3 - Sequence Alignment

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

This week we'll look at some of the alignment algorithms discussed in lectures.

</div>

## Jupyter Shortcuts

You can find a User Guide with a description of features at http://jupyterlab.readthedocs.io

The main interface we'll be using is the Jupyter Notebook interface. If you prefer to run just Jupyter Notebook itself instead of Jupyter Lab, that is fine. 

Some useful hotkeys are:

* Shift-Enter : execute the code in the current cell
* Enter : edit the current cell
* ESC : stop editing a cell and return to "command mode" to use other hotkeys
* m : Turn the current cell into a Markdown cell
* y : Turn the current cell into a code cell
* a : add a new cell above
* b : add a new cell below
* dd : delete the current cell
* c : copy the current cell
* v : paste the copied cell(s)

You can also execute the command `?` or `help()` to get help on any function. For instance, `sorted?` or `help(sorted)`.

## Setup

We will use numpy to create arrays and Biopython to handle DNA sequences.

In [None]:
import numpy as np
from Bio import SeqIO
from Bio.Seq import Seq

In [None]:
import os
import requests
from IPython.core.display import HTML

# Handy function to fetch our data files
def fetch_file(url, outpath='.'):
    response = requests.get(url)
    if response.status_code == 200:
        print('File found!')
        # Get the filename from the URL
        filename = os.path.basename(url).split('?', 1)[0]
        # Construct the filepath using the specified directory and filename
        filepath = os.path.join(outpath, filename)
        # Create the directory if it doesn't exist
        if not os.path.exists(outpath):
            print(f'Creating output dir: {outpath}')
            os.makedirs(outpath)
        # Check if the file already exists in the specified directory
        if os.path.exists(filepath):
            print(f'{filename} already exists in {outpath}. Skip download.')
        else:
            with open(filepath, 'wb') as f:
                f.write(response.content)
                f.close()
            print(f'Saved to: {filepath}')
    else:
        print(f'File not found: Code {response.status_code}')

In [None]:
# Load stylesheet
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90014/main/data/2023/style/custom.css').text)

In [None]:
# Fetch helper functions
data = ['alignment_functions.py']

for filename in data:
    url = f'https://github.com/melbournebioinformatics/COMP90014/blob/main/data/2023/Workshop_03/src/{filename}?raw=true'
    fetch_file(url,outpath='src')

## Sequence data 

We'll read in some real data to play with. Nucleotide sequences for the gene incoding Insulin in mice the human homologue.

We'll use nucleotide sequence rather than protein sequence, because the substitution matrix is very important when aligning protein sequences, and we won't implement substitution matrices today.

In [None]:
# Fetch test data
data = ['Homo_sapiens_INS_203_sequence.fa','Mus_musculus_Ins2_205_sequence.fa']

for filename in data:
    url = f'https://github.com/melbournebioinformatics/COMP90014/blob/main/data/2023/Workshop_03/data/{filename}?raw=true'
    fetch_file(url,outpath='data')


We can inspect the contents of these files using the `cat` tool.

In [None]:
# Look at the this file with the linux cat command
!cat data/Homo_sapiens_INS_203_sequence.fa

In [None]:
!cat data/Mus_musculus_Ins2_205_sequence.fa

To work with these sequences in Python we will import the contents of the fasta files using the Biopython library:

In [None]:
# Import with biopython as SeqRecord objects
human_ins_record = SeqIO.read('data/Homo_sapiens_INS_203_sequence.fa', "fasta")
mouse_ins_record = SeqIO.read('data/Mus_musculus_Ins2_205_sequence.fa', "fasta")

In [None]:
# The SeqRecord contains metadata from the fasta file
print(type(human_ins_record))
print(human_ins_record)

We only need the sequences today so will will extract the 'seq' attribute from the SeqRecord.

In [None]:
# Extract sequences 
human_ins = human_ins_record.seq
mouse_ins = mouse_ins_record.seq

Biopython Seq objects will behave as normal strings.

In [None]:
print(human_ins)

# Edit distance 

## Exercise 1: Hamming distance



<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Edit the Hamming distance function below so that it returns the correct Hamming distance for two strings `a` and `b`.
</div>

In [None]:
def hamming(s1, s2):
    """
    Calculate the Hamming distance between strings a and b.
    The strings must be the same length.
    """
    ### BEGIN SOLUTION
    
    '''
    # Long form solution
    
    if len(s1) != len(s2):
        print('Input strings must be equal length')
        return
    
    hamming_distance = 0
    for i in range(len(s1)):
        if a[i] != b[i]:
            hamming_distance += 1
    
    return hamming_distance
    
    '''

    # Compact solution
    assert len(s1) == len(s2), "Input sequences must be same length"
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

    ### END SOLUTION

Think also: what will your function do if the strings are of different length? What *should* it do?

In [None]:
# Should return 2
hamming("GATTACA","GACTATA")

In [None]:
# Should return 6
hamming("tuesday","sundays")

In [None]:
# These strings are of different length!
hamming("happiness","applying")

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 1:</b> Hamming distance is equal to:
    
A. The total number of positions at which two strings of equal length are different.

B. The longest shared substring between two sequences.
    
C. The total number of character substitutions, insertions, or deletions required to convert one string into another.
    </div>
    
=== BEGIN MARK SCHEME ===

A. The total number of positions at which two strings of equal length are different.

=== END MARK SCHEME ===

YOUR ANSWER HERE

## Exercise 2: Levenshtein distance

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Edit the `lev` function below to calculate Levenshtein distance recursively. You can use the costs 
* 1 for an indel
* 1 for a mismatch
* 0 for a match

This is the same function as shown during lectures, but try to implement it without looking back at the slides.

</div>

In [None]:
def lev(a,b):
    """
    Recursively calculate Levenshtein distance between strings a and b.
    """
    if len(a)==0:
        return len(b)
    if len(b)==0:
        return len(a)
    # Add code below to calculate the distance in terms of 
    # lev(a,b[1:]), lev(a[1:],b), and lev(a[1:],b[1:])
    # We're interested in the incremental cost of aligning the current characters,
    # a[0] and b[0]
    
    ### BEGIN SOLUTION
    
    if a[0]==b[0]:
        mismatch_cost = 0
    else:
        mismatch_cost = 1
    return min( lev(a[1:],b[1:]) + mismatch_cost, # Case 1: Replace on first character
                lev(a[1:],b) + 1, # Case 2: Remove the first character
                lev(a,b[1:]) + 1 # Case 3: Insert first character
               )

    ### END SOLUTION

Try inserting a print statement at the top of your code to show the arguement values each time the function is called.

In [None]:
# Should return 2
lev("GATTACA","GACTATA")

In [None]:
# Should return 4
lev("tuesday","sundays")

In [None]:
# Should return 6
lev("happiness","applying")

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 2:</b> What is the time complexity class of the recursive Levenshtein Distance algorithm?


A. Quadratic
    
B. Exponential
    
C. Linear
    
D. Constant

</div>

=== BEGIN MARK SCHEME ===

B. Exponential

=== END MARK SCHEME ===


YOUR ANSWER HERE

## Exercise 3: Levenshtein distance with Dynamic Programming

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

<b>Challange:</b> Find the Levenshtein Distance of two squences using a dynamic programming approach. 

You can use the costs:
* 1 for an indel
* 1 for a mismatch
* 0 for a match

- [ ] Initialise the scoregrid as a numpy array
- [ ] Populate the first row and column with cumulative indel scores
- [ ] Fill the matrix (starting top left to right)
- [ ] Selecting the minimum scoring operation from {insertion, deletion, match, mismatch} at each step.

</div>

In [None]:
import numpy as np

def levenshtein_distance(str1, str2):
    """
    Calculate the Levenshtein distance between two strings using dynamic programming.

    Parameters:
    str1 (str): The first input string.
    str2 (str): The second input string.

    Returns:
    int: The Levenshtein distance between the two input strings.
    """

    ### BEGIN SOLUTION
    
    len_str1 = len(str1)
    len_str2 = len(str2)
    
    # Initialize a matrix to store distances
    dist_matrix = np.zeros((len_str1 + 1, len_str2 + 1), dtype=int)
    
    # Initialize the first row and column of the matrix
    # Note: This bit only works because our penalty score is +1
    #       Think about how you would do this for idel penalties > 1
    for i in range(len_str1 + 1):
        dist_matrix[i, 0] = i
    for j in range(len_str2 + 1):
        dist_matrix[0, j] = j
    
    
    # Fill in the rest of the matrix
    for i in range(1, len_str1 + 1):
        for j in range(1, len_str2 + 1):
            if str1[i - 1] == str2[j - 1]:
                cost = 0
            else:
                cost = 1
            dist_matrix[i, j] = min(dist_matrix[i - 1, j] + 1,       # Deletion
                                    dist_matrix[i, j - 1] + 1,       # Insertion
                                    dist_matrix[i - 1, j - 1] + cost) # Substitution
    
    # The value at the bottom right corner of the matrix is the Levenshtein distance
    levenshtein_distance = dist_matrix[len_str1, len_str2]
    ### END SOLUTION
    
    return levenshtein_distance

In [None]:
# Test your function!
str1 = "kitten"
str2 = "sitting"
distance = levenshtein_distance(str1, str2)
print(f"Levenshtein distance between '{str1}' and '{str2}': {distance}")

# Alignment scores, global and local alignment 

Here is a function which implements a recursive alignment function like `lev()`, but returns an alignment score rather than an edit distance. Notice that it uses `max()` rather than `min()`, as we're trying to find the maximum possible score, not the minimum possible edit distance.

Avoid looking at the function below until you've done Exercise 2 above as it may give away the answer.

In [None]:
def align_recursive(a, b, indel_score=-1, match_score=2, mismatch_score=-1):
    """
    Recursively calculate alignment score between strings a and b,
    using supplied scores for matches, mismatches and indels.
    """
    if len(a)==0:
        return indel_score * len(b)
    if len(b)==0:
        return indel_score * len(a)
    if a[0]==b[0]:
        match_mismatch_score = match_score
    else:
        match_mismatch_score = mismatch_score
    return max(align_recursive(a[1:],b) + indel_score,
               align_recursive(a,b[1:]) + indel_score,
               align_recursive(a[1:],b[1:]) + match_mismatch_score)

Notice that you can now change the scoring system for indels, matches, and mismatches.

In [None]:
align_recursive("GATTACA","GACTATA")

In [None]:
align_recursive("GATTACA","GACTATA",match_score=5)

Of course, this function is recursive and will be slow for large strings. `align_recursive(human_ins, mouse_ins)` is not practical to run.

## Global alignment: Needleman-Wunsch

To do global alignment with the Needleman-Wunsch algorithm, we need two steps:

1. Fill out the grid of alignment scores. This is enough to give the final alignment score.
2. Trace-back from the bottom-right corner of the grid to get the actual alignment of the strings.

Here, we've provided a function to do the traceback, and given an incomplete function to calculate the alignment score grid. Complete in the `calculate_scoregrid()` function to correctly fill out the grid of scores.

For traceback, we have two options:
* Keep track of which cell(s) was/were the origin of the best score(s) for each given cell, and use this information for traceback. This increases storage requirements by a constant factor (i.e. they are still O(N^2)).
* Or, during traceback, calculate which cells(s) could have been the origin of the best score(s) for each cell. This increases the computational cost of traceback by a constant factor (i.e. it is still O(N)).

In this case, the provided traceback function will work out which path to follow, so you don't need to keep track of the path as you calculate the scores.

## Exercise 4: Needleman-Wunsch


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Complete the `calculate_scoregrid()` function to calculate the scores needed for global alignment via the Needleman-Wunsch algorithm.
</div>

In [None]:
# A version with scores rather than costs, which can be specified
# Indels are scored per-base
def calculate_scoregrid(a, b,
                        indel_score=-1, match_score=2, mismatch_score=-1):
    """
    Given two strings a and b, calculate the maximum score grid, using
    specified scores for indels, matches and mismatches. Return the grid.
    Grid row and column 0 correspond to "before" the start of each string,
    so grid indexes are offset by 1 from string indexes. That is,
    grid position [1,1] represents the result of matching a[0] to b[0].
    """
    # The grid needs to be 1 bigger in each direction than the string lengths
    X = len(a)+1
    Y = len(b)+1
    scoregrid = np.zeros((X,Y), int)
    
    # You need to:
    # * initialise the top edge of grid, i.e. scoregrid[x,0] for all x, with indel scores
    # * initialise the left edge of grid, i.e. scoregrid[0,y] for all x, with indel scores
    # * loop over x and y, filling out each cell of the grid by looking for the
    #   maximum possible score from each of the three earlier cells
    
    ### BEGIN SOLUTION

    # Fill out indel scores along the top and left edges
    # It's fine to do this with two for loops instead
    scoregrid[:,0] = list(range(0,indel_score*X,indel_score))
    scoregrid[0,:] = list(range(0,indel_score*Y,indel_score))
    for x in range(1,X):
        for y in range(1,Y):
            # Since we filled out the edges first and are working our way along each row,
            # we can assume that the three cells contibuting to (x,y) are already filled out
            if a[x-1]==b[y-1]: # Because our scoregrid is padded with zeros, coords are 1-indexed, whereas the sequences are 0-indexed
                diagonal_score = match_score
            else:
                diagonal_score = mismatch_score
            # Note maximum score, not minimum cost!
            score = max(scoregrid[x-1,y] + indel_score, # Left
                        scoregrid[x,y-1] + indel_score, # Up
                        scoregrid[x-1,y-1] + diagonal_score # Diagonal
                       )
            scoregrid[x,y] = score
            
    ### END SOLUTION
    
    return scoregrid

In [None]:
# Pre-defined functions to get the traceback given a correct scoregrid
# Use help(traceback) or help(get_alignment) to see how to call them
from src.alignment_functions import traceback, get_alignment

If `calculate_scoregrid()` works correctly, the below will work:

In [None]:
a = "GATTACA"
b = "GACTATA"

In [None]:
# Once you've implemented calculate_scoregrid, this should show the correct
# values instead of all zeroes
scoregrid = calculate_scoregrid(a,b)
scoregrid

In [None]:
print("Alignment score:",scoregrid[-1,-1])

In [None]:
# If the score grid isn't correct and consistent with the scoring system
# and the strings, traceback won't be able to find a path and will give an error
trace = traceback(a,b,scoregrid)
aligned_string_a, aligned_string_b = get_alignment(trace)
print(aligned_string_a)
print(aligned_string_b)

Try aligning the cDNA strings `human_ins` and `mouse_ins`.

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 3:</b> Which is true of the traceback procedure for a Needleman-Wunch global-alignment matrix?

A. Always moves diagonally if the current cell corresponds to the score we would get from applying a match or mismatch score to the diagonal neighbour.
    
B. Always selects the highest neighbouring score at each step.
    
C. Always selects the minimum neighbouring score to the current cell.
    
</div>

=== BEGIN MARK SCHEME ===

A. Always moves diagonally if the current cell corresponds to the score we would get from applying a match or mismatch score to the diagonal neighbour.

=== END MARK SCHEME ===

YOUR ANSWER HERE

## Exercise 5: Local Alignment


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Change your `calculate_scoregrid()` function to perform local instead of global alignment. You can import the `traceback_local()` function to help you test your result.
</div>

In [None]:
from src.alignment_functions import traceback_local

In [None]:
# A version with scores rather than costs, which can be specified
# Indels are scored per-base
def calculate_scoregrid_local(a, b,
                        indel_score=-1, match_score=2, mismatch_score=-1):
    """
    Given two strings a and b, calculate the maximum score grid, using
    specified scores for indels, matches and mismatches. Return the grid.
    Grid row and column 0 correspond to "before" the start of each string,
    so grid indexes are offset by 1 from string indexes. That is,
    grid position [1,1] represents the result of matching a[0] to b[0].
    """
    # The grid needs to be 1 bigger in each direction than the string lengths
    X = len(a)+1
    Y = len(b)+1
    scoregrid = np.zeros((X,Y), int)

    ### BEGIN SOLUTION

    # Fill out indel scores along the top and left edges; this will be zeros for local alignment
    # It's fine to do this with two for loops instead
    scoregrid[:,0] = [0] * X
    scoregrid[0,:] = [0] * Y
    for x in range(1,X):
        for y in range(1,Y):
            # Since we filled out the edges first and are working our way along each row,
            # we can assume that the three cells contibuting to (x,y) are already filled out
            if a[x-1]==b[y-1]:
                diagonal_score = match_score
            else:
                diagonal_score = mismatch_score
            # Note maximum score, not minimum cost!
            # The only addition with local alignment is that the score should be > 0
            score = max(scoregrid[x-1,y] + indel_score,
                        scoregrid[x,y-1] + indel_score,
                        scoregrid[x-1,y-1] + diagonal_score,
                        0)
            scoregrid[x,y] = score

    ### END SOLUTION
    
    return scoregrid

In [None]:
# Test your function
a = 'happily'
b = 'applying'


scoregrid_local = calculate_scoregrid_local(a, b)
scoregrid_local

In [None]:
# If the score grid isn't correct and consistent with the scoring system
# and the strings, traceback won't be able to find a path and will give an error
trace = traceback_local(a,b,scoregrid_local)
aligned_string_a, aligned_string_b = get_alignment(trace)
print(aligned_string_a)
print(aligned_string_b)

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 4:</b> The traceback procedure for a local-alignment matrix begins from which position?

A. Smallest value
    
B. Top-left
    
C. Largest value
    
D. Bottom-right
    
</div>

=== BEGIN MARK SCHEME ===

C. Largest value

=== END MARK SCHEME ===

YOUR ANSWER HERE