### EECS 730 Project - 3 - without affine gap

#### Steps followed
1. We have two input fasta files, file1, fil2, having the input neucleotide sequences.
2. We read the sequencs from the two files in to a list and perfrom below steps.
3. We use dynamic programming method to determine the global alignment of the two sequences.
4. Determine the substitution and traceback matrices using two input sequences.
5. Determine the alignment and score by tracking back from the last entry to the first in substitution matrix. To trace back, we use trackback matrix as reference.
6. Calculate the score while determing the alignment and print the matrices, alignment and score to the console

#### Instructions to run
1. Install the biopython. This is used for testing in last cell.
2. If we wish not to test, please comment packages relevant to biopython
3. Provide the path where input files are placed in the variable 'path'
4. Change the file names accordingly in the variables 'f1', 'f2'. Check the sequences that were printed to console
5. Once we see the sequences, please run each cell to get the alignment and score.

#### Import relevant packages

In [1]:
# import packages
import os
import Bio
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Align
import numpy as np
#import pandas as pd
#import dask.dataframe as dd
#from dask.multiprocessing import get
#from pydna.common_sub_strings import terminal_overlap
#from pydna.assembly import Assembly
#from pydna.dseqrecord import Dseqrecord

# Print versions
print('The Biopython version is {}..'.format(Bio.__version__))

The Biopython version is 1.78..


#### Create the paths for reference files

In [2]:
# Set the local paths for data
path = r'C:\Users\pmspr\Documents\HS\MS\Sem 4\EECS 730\Bioinformatics\Project 3\Docs'
f1 = os.path.join(path, 'file1.fasta')
f2 = os.path.join(path, 'file2.fasta')
seq = []
# Read file1.fasta
with open(f1) as fl1:
    for line in fl1:
        if(line[0].strip() != '>'):
            seq.append(line.strip())

# Read file2.fasta
with open(f2) as fl2:
    for line in fl2:
        if(line[0].strip() != '>'):
            seq.append(line.strip())

print('Sequence 1 - {}'.format(seq[0]))
print('Sequence 2 - {}'.format(seq[1]))

Sequence 1 - ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCACTCTGCTTCTCCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC
Sequence 2 - ACTCTTCTGGTCCCCACAGACTCAGCCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCACCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC


#### Main Logic

In [3]:
# Determine the lengths.
ls0 = len(seq[0]); ncols = ls0 + 1
ls1 = len(seq[1]); nrows = ls1 + 1

# Create numpy arrays for each input sequence
s0_arr = np.array(['*' for i in range(0, ncols)], dtype=np.object)
s0_arr[1:] = [seq[0][i] for i in range(0, ls0)]
#print(s0_arr)

s1_arr = np.array(['*' for i in range(0, nrows)], dtype=np.object)
s1_arr[1:] = [seq[1][i] for i in range(0, ls1)]
#print(s1_arr)

# Set the scoring parameters
match_score = 1.0
mismatch_score = -2.0
gap_score = -3

# Initialize the matrices with default values
sub_arr = np.zeros(shape=(nrows,ncols),dtype=np.int64)
sub_arr[0,1:] = np.linspace(gap_score, gap_score*ls0, ls0,dtype=np.int64)
sub_arr[1:,0] = np.linspace(gap_score, gap_score*ls1, ls1,dtype=np.int64)
trace_arr = np.array([['x' for i in range(0,ncols)]for j in range(0,nrows)], dtype=np.object)
trace_arr[0,1:] = np.array(['l' for i in range(0,ls0)],dtype=np.object)
trace_arr[1:,0] = np.array(['u' for i in range(0,ls1)],dtype=np.object)

# Traverse the matrices and using Needleman algorithm, fill the values
# substitution matrix(i,j) = max(i-1,j-1 & i,j-1 & i-1,j) 
# traceback matrix(i,j) = argmax(i-1,j-1 & i,j-1 & i-1,j)
neighbors = np.array([0,0,0], dtype=np.int64)
for i in range(1,nrows):
    for j in range(1,ncols):
        if(s1_arr[i] == s0_arr[j]):
            neighbors[0] = sub_arr[i-1,j-1] + match_score
        if(s1_arr[i] != s0_arr[j]):
            neighbors[0] = sub_arr[i-1,j-1] + mismatch_score
        neighbors[1] = sub_arr[i,j-1] + gap_score
        neighbors[2] = sub_arr[i-1,j] + gap_score
        sub_arr[i,j] = np.amax(neighbors)
        if(np.argmax(neighbors) == 0):
            trace_arr[i,j] = 'd'
        if(np.argmax(neighbors) == 1):
            trace_arr[i,j] = 'l'
        if(np.argmax(neighbors) == 2):
            trace_arr[i,j] = 'u'

print('Substitution Matrix:')
print(sub_arr)
print()
print('Trace back Matrix:')
print(trace_arr)

# Determine the alignment using traceback matrix
# Deter mine the score using scoring parameters
i = trace_arr.shape[0] - 1
j = trace_arr.shape[1] - 1
a0 = []; a1 = []; a2 = [];
score = 0
while(trace_arr[i,j] != 'x'):
    if(trace_arr[i,j] == 'd'):
        a0.append(s0_arr[j])
        a1.append(s1_arr[i])
        a2.append('|')
        score = score + match_score
        i = i-1;j=j-1;
    if(trace_arr[i,j] == 'l'):
        a0.append(s0_arr[j])
        a1.append('-')
        a2.append(' ')
        score = score + gap_score
        i=i;j=j-1;
    if(trace_arr[i,j] == 'u'):
        a0.append('-')
        a1.append(s1_arr[i])
        a2.append(' ')
        score = score + gap_score
        i=i-1;j=j;
    #print(i,j)
as0 = ''.join(a0[::-1]); as1 = ''.join(a1[::-1]); as2 = ''.join(a2[::-1])

# Pring the alignment to console
print()
print('Global Alignment without Affine gap with score = {}:'.format(score))
print(as0)
print(as2)
print(as1)

Substitution Matrix:
[[   0   -3   -6 ... -858 -861 -864]
 [  -3    1   -2 ... -854 -857 -860]
 [  -6   -2    2 ... -850 -853 -856]
 ...
 [-798 -794 -790 ...  206  203  200]
 [-801 -797 -793 ...  203  207  204]
 [-804 -800 -796 ...  200  204  208]]

Trace back Matrix:
[['x' 'l' 'l' ... 'l' 'l' 'l']
 ['u' 'd' 'l' ... 'l' 'l' 'l']
 ['u' 'u' 'd' ... 'd' 'd' 'd']
 ...
 ['u' 'u' 'd' ... 'd' 'd' 'd']
 ['u' 'u' 'd' ... 'd' 'd' 'd']
 ['u' 'u' 'd' ... 'd' 'd' 'd']]

Global Alignment without Affine gap with score = 208.0:
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCACTCTGCTTCTCCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC
|||||||||||||||||||||||    ||      ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

#### Testing

In [4]:
# Using biopython
# Sequence alignment without affine gap
from Bio.Align import substitution_matrices
aligner_noaffine = Align.PairwiseAligner()
aligner_noaffine.match_score = 1.0
aligner_noaffine.mismatch_score = -2.0
aligner_noaffine.gap_score = -3
alignments = aligner_noaffine.align(seq[0], seq[1])
print(aligner_noaffine.substitution_matrix)
for al in alignments:
    print('Below Alignment has a score:{}'.format(al.score))
    print(al)

None
Below Alignment has a score:208.0
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCACTCTGCTTCTCCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC
|||||||||||||||||||||||||------||-|---||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||-|--|--|-----|||||||||||||||||||||||||||||||||||||||||||
ACTCTTCTGGTCCCCACAGACTCAG------CC-A---TGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCAC-C--C--C-----GCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC

Below Alignment has a score:208.0
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACA

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCACTCTGCTTCTCCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC
||||||||||||||||||||||||----|----|--||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||----|--|---|-|||||||||||||||||||||||||||||||||||||||||||
ACTCTTCTGGTCCCCACAGACTCA----G----C--CATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCTCCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCCGGACCCAAACCCCACCCCTCAC----C--C---C-GCAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCC

Below Alignment has a score:208.0
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGC

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

