<div style="background: rgb(255,165,0); border: solid 1px rgb(129,199,132); padding: 10px;">    

<h1>Week 3 - Sequence Alignment</h1>

</div>

This week we'll look at some of the alignment algorithms discussed in lectures.

**Completing this Tutorial**

The README.md file in this respository has been updated.

Here are the new instructions for working on tutorials/assignments to avoid any merge conflicts. 

1. Open the COMP90014_2025 cloned folder using VS Code.
2. Navigate to the 'Source Control' tab.
3. Click the three dots '...' next to the 'CHANGES' heading.
4. Select 'Pull' to sync updates.
5. ***Copy the updated week tutorial / assignment folder to a separate location on your computer [NEW INSTRUCTION]***
6. ***Finally, open VS Code to the new folder [NEW INSTRUCTION]***

By copying the tutorial / assignment folder into a new location (outside the COMP90014_2025 folder), any updates to the official material won't conflict with changes you have made. This is because the copied folder is no longer within your local version of the COMP90014_2025 repository. 

**Requirements**

Once you have copied the week 3 folder into a new location... 

Create a venv in the week 3 folder using pip,

`python -m venv venv` 

Activate the environment,

`source venv/bin/activate`

Install the requirements for this tutorial,

`pip install ipykernel biopython numpy`

Then select the new `venv` environment as your notebook kernel. 

## Exercise 1: Hamming distance

Hamming distance is an ungapped measure of string distance. 

For sequences of same length, we iterate each position and note whether the letters match or mismatch. 

The returned distance score is the total number of mismatches. 

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Task</b> 

Edit the Hamming distance function below so that it returns the correct Hamming distance for two strings `a` and `b`.
</div>

In [None]:
def hamming(a, b):
    """
    Calculate the Hamming distance between strings a and b.
    The strings must be the same length.
    """
    # YOUR CODE HERE
    raise NotImplementedError

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;"></div>

The cells below provide test input to check your code. 

In [4]:
# Should return 1
hamming("GATTACA","GACTACA")

1

In [5]:
# Should return 2
hamming("GATTACA","GACTACT")

2

In [6]:
# What will your function do if the strings are of different length? 
# What should it do?
hamming("happiness", "applying")

Input strings must be equal length


When we add shifts to either sequence by adding or removing letters, hamming distance increases dramatically. 

This is problematic for biological sequences, as indels are a common form of genetic variation, and introduce these shifts. 

In the example below, inserting a base at the start of s2 (a single edit) results in a disproportionally high distance as all downstream bases are affected by the shift. 

We can rescue this effect by adding a corresponding gap to s1. 

In [7]:
print(hamming("GATTACA","GATTACA"))    # Should return 0
print(hamming("GATTACA","TGATTAC"))    # Should return 6
print(hamming("-GATTAC","TGATTAC"))    # Should return 1

0
6
1


## Exercise 2: Levenshtein distance

Clearly, we need to be able to handle gaps. 

We need to move away from simple matches/mismatches and instead try to identify the smallest number of 'edits' which would transform one sequence into another. 

These edits include inserting / deleting characters. 

The idea of transformation & edits are ideal for sequence alignment as we're counting the number of evolutionary events which separate two sequences. 

Levenshtein distance is a formalisation of this idea, and can be implemented using two strategies: recursion, and dynamic programming.

Levenshtein distance builds upon hamming distance, by considering that a shift at the current location may result in better alignment of the remaining sequence. 


In [8]:
# at position 3, what would return the best score?
# if keeping same alignment, only add an edit if characters don't match. 
# if adding a gap, need to add an edit (as an edit was made!).

# any sequence length difference will count towards to distance.
pos = 3
seq1 = "GATACA"
seq2 = "GATTACA"
#          ^

# Don't introduce a gap. 
# A CA 
# T ACA
# ^
this_dist = 1 if seq1[pos] != seq2[pos] else 0
future_dist = hamming('CA.', 'ACA')
print('inline: ', this_dist + future_dist)

# Introduce a gap in seq1. 
# - ACA 
# T ACA
# ^
this_dist = 1
future_dist = hamming('ACA', 'ACA')
print('seq1 gap: ', this_dist + future_dist)

# Introduce a gap in seq2
# A CA 
# - TACA
# ^
this_dist = 1 
future_dist = hamming('CA..', 'TACA')
print('seq2 gap: ', this_dist + future_dist)


inline:  4
seq1 gap:  1
seq2 gap:  4


This poses an issue. 

If we test 3 alignments at a given base, ***for each*** of these alignments, the next base also requires 3 alignments. 

This is inherently recursive. 


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Task</b>

Edit the `lev` function below to calculate Levenshtein distance recursively. As we are counting the number of edits required to transform one string into the other, use the following costs:

- 1 for an indel
- 1 for a mismatch
- 0 for a match

This is the same function as shown during lectures, but try to implement it without looking back at the slides.
</div>

In [None]:
def lev(a, b):
    """
    Recursively calculate Levenshtein distance between strings a and b.
    """

    # YOUR CODE HERE
    raise NotImplementedError

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;"></div>


### Python Debugger

If we add break points at the base cases, we can see how the distance is calculated step by step.

In [None]:
lev("AT","T")

1

Try inserting a print statement at the top of your code to show the arguement values each time the function is called.

In [30]:
print(lev("GATTACA","GAATACA")) # should return 1
print(lev("-GATTAC","TGATTAC")) # should return 1

1
1


In [31]:
# Should return 4
lev("tuesday","sundays")

4

In [32]:
# Should return 6
lev("happiness","applying")

6

## Exercise 3: Levenshtein distance with Dynamic Programming

During the recursive Levenshtein algorithm, each function call created 3 more function calls. <br>
This results in a huge number of operations, and is inefficient as we are repeating many calculations. 

Using an example from above, we can see why. <br>
Let's imagine that we don't consider shifts. Our recusive levenshtein becomes quite simple (essentially hamming distance).
- score += 1 if bases don't match 
- return score + lev(seq1[i+1], seq2[i+1])

Ie if we provide the strings 'GAATACA' and 'GATTACA' as input we would have the following calls: 
```python
0 + lev('AATACA', 'ATTACA')
0 + 0 + lev('ATACA', 'TTACA')
0 + 0 + 1 + lev('TACA', 'TACA')
0 + 0 + 1 + 0 + lev('ACA', 'ACA')
0 + 0 + 1 + 0 + 0 + lev('CA', 'CA')
0 + 0 + 1 + 0 + 0 + 0 + lev('A', 'A')
```

We can use dynamic programming to reduce the number of calculations we do. 

Dynamic programming approaches break one large problem into a series of local subproblems. <br>
In this setup, the solution to a given subproblem is only dependent on the previous solution. 

To adapt our levenshtein distance algorithm we will need to keep track of previous solutions. <br>
Additionally, we will need to calculate the current solution based on previous solutions, so will need to think about prefixes rather than suffixes. 

To implement this, we can create a matrix to keep track of already calculated levenshtein distances.  <br>
We will place levenshtein distances between all prefixes of string1 and all prefixes of string2 in this matrix.  <br>
To calculate the next distance, we therefore only need the relevant previous solutions, which are in the matix. 

<img src="https://raw.githubusercontent.com/melbournebioinformatics/COMP90014_2024/master/tutorials/media/week3/levenshtein.png" width="600">

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

<b>Challange:</b> Find the Levenshtein Distance of two squences using a dynamic programming approach. 

You can use the costs:
* 1 for an indel
* 1 for a mismatch
* 0 for a match

- [ ] Initialise the scoregrid as a numpy array
- [ ] Populate the first row and column with cumulative indel scores
- [ ] Fill the matrix (starting top left to right)
- [ ] Selecting the minimum scoring operation from {insertion, deletion, match, mismatch} at each step.

</div>

In [None]:
import numpy as np

def levenshtein_distance(str1, str2):
    """
    Calculate the Levenshtein distance between two strings using dynamic programming.

    Parameters:
    str1 (str): The first input string.
    str2 (str): The second input string.

    Returns:
    int: The Levenshtein distance between the two input strings.
    """

    len_str1 = len(str1)
    len_str2 = len(str2)
    
    # Initialize a matrix to store distances
    dist_matrix = np.zeros((len_str1 + 1, len_str2 + 1), dtype=int)
    
    # YOUR CODE HERE 
    raise NotImplementedError

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;"></div>

In [34]:
# Test your function!
str1 = "kitten"
str2 = "sitting"
distance = levenshtein_distance(str1, str2)
print(f"Levenshtein distance between '{str1}' and '{str2}': {distance}")

Levenshtein distance between 'kitten' and 'sitting': 3


## Exercise 4: Pairwise Alignment 

Use the Bio.Align.PairwiseAligner class from Biopython to perform alignment. <br>
See the cell below for an example of using PairwiseAligner. 
- Call the .align() function to do alignment.
- The .align() function returns a list of alignments, where the first item is the best alignment.
- The score for an alignment can be retrieved via the .score property.
- Documentation can be seen at https://biopython.org/docs/dev/api/Bio.Align.html if needed. 


In [105]:
import Bio.Align
aligner = Bio.Align.PairwiseAligner()
aligner.mode = 'local'
alignments = aligner.align("GAACTC", "GAAC")
alignment = alignments[0]
print(alignment)

target            0 GAAC 4
                  0 |||| 4
query             0 GAAC 4



If we want to see all the alignments and their scores:

In [106]:
for alignment in alignments:
    print(alignment)
    print("Alignment score:", alignment.score)

target            0 GAAC 4
                  0 |||| 4
query             0 GAAC 4

Alignment score: 4.0
target            0 GAACTC 6
                  0 |||--| 6
query             0 GAA--C 4

Alignment score: 4.0


The default aligner doesn't punish gaps. We can specify the penalty for opening gaps and extending gaps. As biologically, one large gap is more likely than multiple small gaps.

In [109]:
import Bio.Align
aligner = Bio.Align.PairwiseAligner()
aligner.mode = 'local'
aligner.open_gap_score = -2
aligner.extend_gap_score = -1
alignments = aligner.align("GAACTC", "GAAC")
for alignment in alignments:
    print(alignment)
    print("Alignment score:", alignment.score)

target            0 GAAC 4
                  0 |||| 4
query             0 GAAC 4

Alignment score: 4.0


Now we have only one alignment. `Bio.Align.PairwiseAligner()` can also do global alignment with similar syntax.

In [110]:
import Bio.Align
aligner = Bio.Align.PairwiseAligner()
aligner.mode = 'global'
aligner.open_gap_score = -2
aligner.extend_gap_score = -1
alignments = aligner.align("GAACTC", "GAAC")
for alignment in alignments:
    print(alignment)
    print("Alignment score:", alignment.score)

target            0 GAACTC 6
                  0 |||--| 6
query             0 GAA--C 4

Alignment score: 1.0
target            0 GAACTC 6
                  0 ||||-- 6
query             0 GAAC-- 4

Alignment score: 1.0


<div style="background: rgb(255,165,0); border: solid 1px rgb(129,199,132); padding: 10px;">    

<h1>Extension Activities</h1>

</div>

From here, Levenshtein distance can be modified to create Needleman-Wunsch global alignment by simply changing to a penalty system. 

Local (Smith-Waterman) and semi-global alignment are also similar, but have a few extra tweaks. 

## Global alignment: Needleman-Wunsch

To do global alignment with the Needleman-Wunsch algorithm (dynamic programming), we need the following: 

1. Fill out the grid of alignment scores. This is enough to give the final ***alignment score***.
2. Fill out a separate grid which keeps track of the arrows (for each cell, did we come from diagonal, above, or left).
3. Perform traceback (using the arrows grid) to get the actual alignment of the strings.

Here we will just calculate the alignment score. We won't bother with items 2) and 3). 

Note: feel free to implement these yourself!

## Extension Exercise 1: Tracking Matrix

To trace back the actual alignment after calculating the Levenshtein distance, we need a separate matrix to store the directions at each cell. In this exercise, we will use integer for tracking matrix. `0 for diagonal, 1 for above, and 2 for left`.

Copy your levenshtein_distance() implementation to the following as a starting point for this challange.

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Complete the `levenshtein_tracking_matrix()` function to keep track of the arrows for each cell in a separated matrix.

<b>Note:</b> Instead of returning the distance (the bottom right cell), we are returning the whole matrix as well as the tracking matrix.
</div>

In [None]:
def levenshtein_tracking_matrix(str1, str2):
    """
    Calculate the Levenshtein distance between two strings using dynamic programming.

    Parameters:
    str1 (str): The first input string.
    str2 (str): The second input string.

    Returns:
    tuple of numpy.ndarray: The distance matrix of Levenshtein distance and the tracking matrix
    """
    len_str1 = len(str1)
    len_str2 = len(str2)
    
    # Initialize a matrix to store distances and the tracking matrix
    dist_matrix = np.zeros((len_str1 + 1, len_str2 + 1), dtype=int)
    # Integer, 0 for diagonal, 1 for above, and 2 for left
    tracking_matrix = np.zeros((len_str1 + 1, len_str2 + 1), dtype=int)

    # Your code HERE
    raise NotImplementedError
    
    # Instead of returning the distance, return the matrix
    return dist_matrix, tracking_matrix

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;"></div>

In [112]:
# Once you've implemented calculate_scoregrid, this should show the correct
# values instead of all zeroes
a = "GATTACA"
b = "GACTATA"
dist_matrix, tracking_matrix = levenshtein_tracking_matrix(a, b)
# Should output as following:
"""
Distance Matrix:
[[0 1 2 3 4 5 6 7]
 [1 0 1 2 3 4 5 6]
 [2 1 0 1 2 3 4 5]
 [3 2 1 1 1 2 3 4]
 [4 3 2 2 1 2 2 3]
 [5 4 3 3 2 1 2 2]
 [6 5 4 3 3 2 2 3]
 [7 6 5 4 4 3 3 2]]

Tracking Matrix, 0 for diagonal, 1 for above, and 2 for left
[[-1  2  2  2  2  2  2  2]
 [ 1  0  2  2  2  2  2  2]
 [ 1  1  0  2  2  0  2  0]
 [ 1  1  1  0  0  2  0  2]
 [ 1  1  1  0  0  0  0  2]
 [ 1  1  0  0  1  0  2  0]
 [ 1  1  1  0  1  1  0  0]
 [ 1  1  0  1  0  0  0  0]]
"""

print("Distance Matrix:")
print(dist_matrix)
print("\nTracking Matrix, 0 for diagonal, 1 for above, and 2 for left", )
print(tracking_matrix)

Distance Matrix:
[[0 1 2 3 4 5 6 7]
 [1 0 1 2 3 4 5 6]
 [2 1 0 1 2 3 4 5]
 [3 2 1 1 1 2 3 4]
 [4 3 2 2 1 2 2 3]
 [5 4 3 3 2 1 2 2]
 [6 5 4 3 3 2 2 3]
 [7 6 5 4 4 3 3 2]]

Tracking Matrix, 0 for diagonal, 1 for above, and 2 for left
[[-1  2  2  2  2  2  2  2]
 [ 1  0  2  2  2  2  2  2]
 [ 1  1  0  2  2  0  2  0]
 [ 1  1  1  0  0  2  0  2]
 [ 1  1  1  0  0  0  0  2]
 [ 1  1  0  0  1  0  2  0]
 [ 1  1  1  0  1  1  0  0]
 [ 1  1  0  1  0  0  0  0]]


## Extension Exercise 2: Global Alignment

With the tracking matrix, we can trace back from the bottom-right corner, and construct the actual alignment pattern.

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Complete the `traceback_global_alignment()` function to perform global alignment.


Parameters:
* str1 (str): The first input string.
* str2 (str): The second input string.

Returns:
- tuple of str: The aligned versions of str1 and str2 with '-' for gaps.

Hints:
- [ ] Start from the bottom-right corner.
- [ ] Ends when reach top row, or left column.
</div>

In [None]:
def traceback_global_alignment(str1, str2):
    """
    Reconstruct the global (Needleman-Wunsch) alignment from the distance and tracking matrices.

    Parameters:
    str1 (str): The first input string.
    str2 (str): The second input string.

    Returns:
    tuple of str: The aligned versions of str1 and str2 with '-' for gaps.
    """

    dist_matrix, tracking_matrix = levenshtein_tracking_matrix(str1, str2)

    # Your code HERE
    raise NotImplementedError

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;"></div>

In [117]:
# Test your function
a = 'happily'
b = 'applying'
scoregrid_local = traceback_global_alignment(a, b)
scoregrid_local

('happ--ily', '-applying')