## Part 1

1) For both Smith-Waterman and Needleman-Wunsch algorithms:
    a) What are the parameters and variables required for algorithm initialization, execution, and termination?
    b) What quantities are returned?
    c) What is the runtime complexity?
    
    Smith-Waterman:
        a) Parameters:
            Initialization:
                a) seq1: First protein sequence
                b) seq2: Second protein sequence
                c) smat: scoring matrix
            Execution:
                a) scores: matrix of size M x N that records scores of optimal alignment
                b) traceback: matrix of size M x N that records previous cell from which each cell was computed from
            Termination:
                a) result: optimal alignment string
        b) Returns: Optimal sub-alignment with positive scores
        c) Complexity: O(MxN)

    Needleman-Wunsch:
        a) Parameters:
            Initialization:
                a) seq1: First protein sequence
                b) seq2: Second protein sequence
                c) smat: scoring matrix
            Execution:
                a) scores: matrix of size M x N that records scores of optimal alignment
                b) traceback: matrix of size M x N that records previous cell from which each cell was computed from
            Termination:
                a) result: optimal alignment string
        b) Returns: Full alignment string of sequence 
        c) Complexity: O(MxN)

2) What functionalities in initialization, execution and termination are shared between these algorithms? Which are not shared?

    Initialization: Both methods initialize and M x N matrix where M is the length of seq1 and N is the length of seq2. However, in initializing the SW matrix, you set the first row and first column to 0. 
    
    Execution: Both methods share methods for calculating the score at a particular cell in the matrix. This entails taking the max of the three following scores:
        a) Gap from left to right
        b) Gap from top to bottom
        c) Match
    However, SW adds an extra option where you can reset to 0 if the score goes negative.
    
    Termination: Both methods have to use the traceback matrix to return an alignment. NW must start at the lower rightmost cell and trace back all the way to the start and return that string. SW needs to start at the max element in the matrix and trace back until a 0 is reached, then return the resulting string.

3) How does affine-gap based alignment differ from linear-gap alignment in terms of implementation?
    
    Affine gap alignment requires you to keep track of the gap status of the previous cell when computing the score of the current cell (gap start, match/mismatch, gap extension). To return the optimal alignment you need two extra score matrices, one for seq1 and one for seq2, which keep track of the max score of either opening or extending a gap at the current cell. For the final traceback you would need to interface between the original score matrix and these two extra matrices to get the alignment.

In [None]:
f = open("./sequences/prot-0088.fa",'r').read()
print(f)

In [None]:
sw = SmithWaterman("/Users/mtsui1/Documents/Classes/Algs/Project1/scoring_matrices/BLOSUM50.mat")
scores, alignment = sw.align("./sequences/test1.fa", "./sequences/test2.fa")
print(scores)
print(alignment)

## Part 2

In [1]:
# read Pospairs and negpairs into a list

f = open("./scoring_matrices/Pospairs.txt",'r')
lines = f.readlines() # skip header
pos_pairs = []
for line in lines:
    line = line.strip()
    line = [i for i in line.split(" ") if i != ""]
    pos_pairs.append(line)
    
f = open("./scoring_matrices/Negpairs.txt",'r')
lines = f.readlines() # skip header
neg_pairs = []
for line in lines:
    line = line.strip()
    line = [i for i in line.split(" ") if i != ""]
    neg_pairs.append(line)

In [13]:
from align.algs import *

In [14]:
sw = SmithWaterman("/Users/mtsui1/Documents/Classes/Algs/Project1/scoring_matrices/BLOSUM50.mat")

In [15]:
sw.set_gap_open(-11)
sw.set_gap_extend(-3)

In [5]:
len(pos_pairs)

50

In [6]:
len(neg_pairs)

50

In [None]:
pos_pairs[0][0]

In [7]:
pos_scores = {}
for pair in pos_pairs:
    print(pair)
    score = sw.align(pair[0], pair[1], return_alignment=False)

    pos_scores[(pair[0], pair[1])] = score

['sequences/prot-0004.fa', 'sequences/prot-0008.fa']
['sequences/prot-0031.fa', 'sequences/prot-0034.fa']
['sequences/prot-0047.fa', 'sequences/prot-0050.fa']
['sequences/prot-0065.fa', 'sequences/prot-0069.fa']
['sequences/prot-0077.fa', 'sequences/prot-0082.fa']
['sequences/prot-0098.fa', 'sequences/prot-0102.fa']
['sequences/prot-0124.fa', 'sequences/prot-0125.fa']
['sequences/prot-0166.fa', 'sequences/prot-0177.fa']
['sequences/prot-0170.fa', 'sequences/prot-0173.fa']
['sequences/prot-0175.fa', 'sequences/prot-0178.fa']
['sequences/prot-0187.fa', 'sequences/prot-0190.fa']
['sequences/prot-0240.fa', 'sequences/prot-0242.fa']
['sequences/prot-0271.fa', 'sequences/prot-0275.fa']
['sequences/prot-0298.fa', 'sequences/prot-0300.fa']
['sequences/prot-0326.fa', 'sequences/prot-0327.fa']
['sequences/prot-0354.fa', 'sequences/prot-0355.fa']
['sequences/prot-0357.fa', 'sequences/prot-0360.fa']
['sequences/prot-0361.fa', 'sequences/prot-0364.fa']
['sequences/prot-0369.fa', 'sequences/prot-037

In [8]:
pos_scores

{('sequences/prot-0004.fa', 'sequences/prot-0008.fa'): 42.0,
 ('sequences/prot-0031.fa', 'sequences/prot-0034.fa'): 49.0,
 ('sequences/prot-0047.fa', 'sequences/prot-0050.fa'): 37.0,
 ('sequences/prot-0065.fa', 'sequences/prot-0069.fa'): 79.0,
 ('sequences/prot-0077.fa', 'sequences/prot-0082.fa'): 69.0,
 ('sequences/prot-0098.fa', 'sequences/prot-0102.fa'): 34.0,
 ('sequences/prot-0124.fa', 'sequences/prot-0125.fa'): 50.0,
 ('sequences/prot-0166.fa', 'sequences/prot-0177.fa'): 31.0,
 ('sequences/prot-0170.fa', 'sequences/prot-0173.fa'): 31.0,
 ('sequences/prot-0175.fa', 'sequences/prot-0178.fa'): 34.0,
 ('sequences/prot-0187.fa', 'sequences/prot-0190.fa'): 32.0,
 ('sequences/prot-0240.fa', 'sequences/prot-0242.fa'): 36.0,
 ('sequences/prot-0271.fa', 'sequences/prot-0275.fa'): 53.0,
 ('sequences/prot-0298.fa', 'sequences/prot-0300.fa'): 41.0,
 ('sequences/prot-0326.fa', 'sequences/prot-0327.fa'): 120.0,
 ('sequences/prot-0354.fa', 'sequences/prot-0355.fa'): 32.0,
 ('sequences/prot-0357.

In [None]:
sw = algs.NeedlemanWunsch("/Users/mtsui1/Documents/Classes/Algs/Project1/scoring_matrices/BLOSUM50.mat")
#sw.set_gap_open(-11)
#sw.set_gap_extend(-3)
score, alignment = sw.align("sequences/test1.fa", "sequences/test2.fa")
print(score)
print(alignment)

In [9]:
neg_scores = {}
for pair in neg_pairs:
    print(pair)
    score = sw.align(pair[0], pair[1], return_alignment=False)

    neg_scores[(pair[0], pair[1])] = score

['sequences/prot-0018.fa', 'sequences/prot-0198.fa']
['sequences/prot-0035.fa', 'sequences/prot-0071.fa']
['sequences/prot-0056.fa', 'sequences/prot-0826.fa']
['sequences/prot-0075.fa', 'sequences/prot-0561.fa']
['sequences/prot-0093.fa', 'sequences/prot-0091.fa']
['sequences/prot-0110.fa', 'sequences/prot-0613.fa']
['sequences/prot-0131.fa', 'sequences/prot-0547.fa']
['sequences/prot-0148.fa', 'sequences/prot-0369.fa']
['sequences/prot-0166.fa', 'sequences/prot-0839.fa']
['sequences/prot-0186.fa', 'sequences/prot-0014.fa']
['sequences/prot-0206.fa', 'sequences/prot-0313.fa']
['sequences/prot-0228.fa', 'sequences/prot-0121.fa']
['sequences/prot-0244.fa', 'sequences/prot-0817.fa']
['sequences/prot-0262.fa', 'sequences/prot-0540.fa']
['sequences/prot-0286.fa', 'sequences/prot-0641.fa']
['sequences/prot-0305.fa', 'sequences/prot-0785.fa']
['sequences/prot-0326.fa', 'sequences/prot-0712.fa']
['sequences/prot-0346.fa', 'sequences/prot-0063.fa']
['sequences/prot-0363.fa', 'sequences/prot-034

In [10]:
neg_scores

{('sequences/prot-0018.fa', 'sequences/prot-0198.fa'): 38.0,
 ('sequences/prot-0035.fa', 'sequences/prot-0071.fa'): 28.0,
 ('sequences/prot-0056.fa', 'sequences/prot-0826.fa'): 25.0,
 ('sequences/prot-0075.fa', 'sequences/prot-0561.fa'): 32.0,
 ('sequences/prot-0093.fa', 'sequences/prot-0091.fa'): 40.0,
 ('sequences/prot-0110.fa', 'sequences/prot-0613.fa'): 41.0,
 ('sequences/prot-0131.fa', 'sequences/prot-0547.fa'): 45.0,
 ('sequences/prot-0148.fa', 'sequences/prot-0369.fa'): 20.0,
 ('sequences/prot-0166.fa', 'sequences/prot-0839.fa'): 24.0,
 ('sequences/prot-0186.fa', 'sequences/prot-0014.fa'): 20.0,
 ('sequences/prot-0206.fa', 'sequences/prot-0313.fa'): 44.0,
 ('sequences/prot-0228.fa', 'sequences/prot-0121.fa'): 19.0,
 ('sequences/prot-0244.fa', 'sequences/prot-0817.fa'): 24.0,
 ('sequences/prot-0262.fa', 'sequences/prot-0540.fa'): 40.0,
 ('sequences/prot-0286.fa', 'sequences/prot-0641.fa'): 60.0,
 ('sequences/prot-0305.fa', 'sequences/prot-0785.fa'): 34.0,
 ('sequences/prot-0326.f