# Worksheet PCALG

## Class AlignmentSequences
This class implements recursively three alignment algorithms:
1. Global alignment (Needleman-Wunsch inspired).
2. Local alignment (Smith-Waterman inspired) 
3. Longest common substring.(the search for the longest common sequence can also be considered a type of alignment).

In [443]:
"""This module shows alternative recursive implementations of global sequence alignments:
    Global alignment (Needleman-Wunsch based)
    Local (Smith-Waterman based)
    Finding of the longest common substring. 
Todo:
    * Return all the solutions of the alignments. Now it only returns one solution
    * Control of errors
    * Implement multi-alignments
    * Implement heuristic algorithms
"""
import time
import sys

MIN = -sys.maxsize - 1
COMPAC = 100000
"""int: Constant to compact max score."""
SCORE_MATCH = 2
"""int: Default match score."""
SCORE_NO_MATCH = -3
"""int: Default no match score."""
SCORE_GAP_INI = -10
"""int: Default gap ini in affine gap penalty."""
SCORE_GAP_CONT = -2
"""int: Default gap continuation in affine gap penalty"""
DEFAULT_SUBST_MATRIX = {('A', 'A'): 0, ('A', 'C'): 1, ('A', 'G'): 1, ('A', 'T'): 1, ('C', 'A'): 1, ('C', 'C'): 0, ('C', 'G'): 1, ('C', 'T'): 1, ('G', 'A'): 1, ('G', 'C'): 1, ('G', 'G'): 0, ('G', 'T'): 1, ('T', 'A'): 1, ('T', 'C'): 1, ('T', 'G'): 1, ('T', 'T'): 0}
"""dict: Default substitution matrix (for "ACGT" common nucleotide alphabet)"""

sys.setrecursionlimit(5000)

class AlignSequences:
    """Recursive implementation of global, local and long substring alignments methods.

    Attributes:
        sequences (list of str): Contains the wo sequences to align. The first
            one (index 0) is the query sequence (BLAST concept) or bottom sequence on alignment prints
            or vertical sequence in the common graphical representation of score matrix.
        len_seq0 (int): Sequence 0 length.
        len_seq1 (int): Sequence 1 length.
        mode (str): Computation mode:
            'GLOBAL'           Global Alignment
            'LOCAL'            Local Alignment
            'LONG_SUBSTRING'   Obtain long common substring
        score_match (int): Score of match characters.
        score_no_match (int): Score of no match characters.
        score_gap_ini (int): Score of gap init.
        score_gap_cont (int): Score of gap continuation.
        score (int): Score of last computed alignment.
        gaps (int): Number of gaps of the last computed alignment.
        matches (int): Number of matches of the last computed alignment.
        unmatches (int): Number of unmatches of the last computed alignment.
        align_seq0 (str): Sequence 0 with the gaps necessary for the alignment.
        align_seq1 (str): Sequence 1 with the gaps necessary for the alignment.
        matching (str): Printable line with the align relations ('|, '.', ' ') between
            both align_seq, necessary for printing the alignment.
        ini_time (int): Initial time of computation, for profiling purposes
        finish_time (int): Final time of computation, for profiling purposes
        score_store (dict of tuple int): Store of scores, for each calculated cell with tuple (i,j,g) 
            where i is the coordinate of the bottom sequence, j the coordinate of the top sequence 
            and g has the value 1 if the cell is a gap init cell and 0 if it's a gap continuation.
            For a explanation of calculared cell see align method.
        matches_store (dict of tuple int): Store of the number of matches in the calculated cell
        gaps_store (dict of tuple int): Store of the number of gaps in the calculated cell
        max_score_index (tuple of int): Cell coordinate tuple of the cell with the maximun score
        max_score (int): maximum computed score
        forward_arrow (dict of str): Store of the optimal displacements accomplished at a cell 
            to guarantee an optimal score: 'v' vertical (down), 'h' horizontal (rigth), 
            'd' diagonal.
        stacks (list of lost of str): Stacks of sequences related to principal sequences in a msa
        
    """
   
    def __init__(self, sequences, mode="ALIGN", score_match=SCORE_MATCH, score_no_match=SCORE_NO_MATCH,\
                 score_gap_ini=SCORE_GAP_INI, score_gap_cont=SCORE_GAP_CONT, subst_matrix={}, stacks=[[],[]]):
        """Init parameters of alignment"""
        self.set_sequences(sequences)
        self.set_stacks(stacks)
        self.len_seq0 = len(self.sequences[0])
        self.len_seq1 = len(self.sequences[1])
        self.init_stores()
        self.set_scores(score_match, score_no_match, score_gap_ini, score_gap_cont)
        self.set_mode(mode)
        self.score = 0
        self.matches = 0
        self.unmatches = 0
        self.gaps = 0
        self.align_seq0 = ""
        self.align_seq1 = ""
        self.matching = ""
        self.ini_time = 0
        self.finish_time = 0
        self.set_subst_matrix(subst_matrix)
       
    def init_stores(self):
        """Init dictionary that store temp data of the alignment"""
        self.score_store = {}
        self.matches_store = {}
        self.gaps_store = {}
        self.max_score_index = (0, 0, 0)
        self.max_score = 0
        self.forward_arrow = {}
    
    def set_sequences(self, sequences):
        """Update the target sequences of the alignment"""
        self.sequences = sequences
        
    def set_stacks(self, stacks=[[],[]]):
        """Update the stacks for msa"""
        self.stacks = stacks
        
    def set_subst_matrix(self, subst_matrix={}):
        """Update the score matrix"""
        self.subst_matrix = subst_matrix
    
    def set_scores(self, score_match=SCORE_MATCH, score_no_match=SCORE_NO_MATCH,\
                   score_gap_ini=SCORE_GAP_INI, score_gap_cont=SCORE_GAP_CONT):
        """Update the weigth scores of the alignment"""
        self.score_match = score_match
        self.score_no_match = score_no_match
        self.score_gap_ini = score_gap_ini
        self.score_gap_cont = score_gap_cont
    
    def set_mode(self, mode="ALIGN"):
        """Set computation mode"""
        self.mode = mode
        
    def forward_track(self, index):
        """Calc alignments in forward direction.
        
            The alignment strings are calculated from init cell (0,0) in global
            alignments or maximum score cell in local alignments.
            
            In local mode it's necessary to extend the alignments (local) to the total length of 
            the sequences to show the location of the alignment, and in order to compare with 
            BioPython outputs.
            
            Args:
                index (tuple of int): Cell coordinates of the starting cell

            Returns:
                string: align sequence 0 (bottom) for printing purposes
                string: align sequence 1 (top) for printing purposes
                tuple of int: Coordinates of the last cell
        """
        ret_align_seq0, ret_align_seq1 = "", ""
        (i, j, gap_ini) = index
        ret_final_pos = (self.len_seq0, self.len_seq1)
        while i < self.len_seq0 or j < self.len_seq1:
            if self.mode == "LOCAL" and self.score_store[(i, j, gap_ini)] == 0: 
                ret_final_pos = (i, j)
                break
            arrow = self.forward_arrow[(i, j, gap_ini)]
            if arrow == "d":
                ret_align_seq0 += self.sequences[0][i]
                ret_align_seq1 += self.sequences[1][j]
                i, j, gap_ini = i + 1, j + 1, 1
            elif arrow == "h":
                ret_align_seq0 += "-"
                ret_align_seq1 += self.sequences[1][j]
                i, j, gap_ini = i , j + 1, 0
            elif arrow == "v":
                ret_align_seq0 += self.sequences[0][i]
                ret_align_seq1 += "-"
                i, j, gap_ini = i + 1, j, 0
        #compute the complete align in local mode
        if self.mode == "LOCAL":
            ret_align_seq0 = self.sequences[0][0:index[0]] +\
                             ret_align_seq0 + self.sequences[0][ret_final_pos[0]:]
            ret_align_seq1 = self.sequences[1][0:index[1]] +\
                             ret_align_seq1 + self.sequences[1][ret_final_pos[1]:]
            diff_pos_ini = index[1] - index[0]
            if diff_pos_ini > 0:
                ret_align_seq0 = '-' * diff_pos_ini + ret_align_seq0
            else:
                ret_align_seq1 = '-' * -diff_pos_ini + ret_align_seq1
            diff_len = len(ret_align_seq1) - len(ret_align_seq0)
            if diff_len > 0:
                ret_align_seq0 += '-' * diff_len
            else:
                ret_align_seq1 += '-' * -diff_len
        return ret_align_seq0, ret_align_seq1, ret_final_pos
    
    def calc_matching(self, align_seq0, align_seq1, ini_pos=(), final_pos=()):
        """Calc matching string
        
            The matching string is the string line to print between the top and
            bottom alignment strings. It contains the match (|), no match (.) and
            gap ( ) indicators.
            
            Args:
                align_seq0 (string): Bottom sequence
                align_seq1 (string): Top sequence
                ini_pos (tuple of int): Initial cell coordinates
                final_pos (tuple of int): Final cell coordinates

            Returns:
                string: Matching string
                
        """
        count = 0
        ret_matching = ""
        diff_pos_ini = ini_pos[1] - ini_pos[0]
        if diff_pos_ini > 0:
            delta_pos = diff_pos_ini
        else:
            delta_pos = 0
        for n, (i, j) in enumerate(zip(align_seq0, align_seq1)):
            if self.mode == "LOCAL" and not (n >= ini_pos[0] + delta_pos and n < final_pos[0] + delta_pos):
                ret_matching += ' '
            else:
                if i == j: ret_matching += '|'
                elif i != j and i != '-' and j != '-': ret_matching += '.'
                else: ret_matching += ' '
            count += 1
        return ret_matching
        
    def store(self, key, score, matches, gaps):
        """Store info related to a computed cell
        The maximum score is computed having into account the number of matches, if there are
        most than one solution. If the score are equal, the path with more matches is selected.
             Args:
                key (tuple of int): Cell coordinates
                score (int): Cell score
                matches (int): Cell matches
                gaps (int): Cell gaps
        """
        self.score_store[key] = score
        super_score = score * COMPAC + 10 * matches
        if super_score > self.max_score: 
            self.max_score_index = key
            self.max_score = super_score
        self.matches_store[key] = matches
        self.gaps_store[key] = gaps
    
    def calc_score_binary(self, seq_0, seq_1, i, j):
        """Compute alignment scores for two sequences
        If there are a substitution matrix (actually dictionary) defined, 
        the scores are computed from the dictionary.
            Args:
                seq_0 (int): Sequence 0
                seq_1 (int): Sequence 1
                i (int): Sequence 0 index
                j (int): Sequence 1 index
        """
        if self.subst_matrix:
            subst_matrix_index = (seq_0[i], seq_1[j])
            subst_matrix_index_swap = (seq_1[j], seq_0[i])
            if subst_matrix_index in self.subst_matrix:
                matrix_score = self.subst_matrix[subst_matrix_index]
            elif subst_matrix_index_swap in self.subst_matrix:
                matrix_score = self.subst_matrix[subst_matrix_index_swap]
        # There could be gaps in msa, in this case we penalize as minimun value 
        # of score matrix (TODO)
        # if not subst matrix we penalize it as initial gap position
        # it's not a CLUSTALW like implementation
        if seq_0[i] == "-" or  seq_1[j] == '-':
            inc_matches = 0
            if self.subst_matrix:
                inc_score = self.score_gap_ini + self.score_gap_cont
            else:
                inc_score = self.score_gap_ini + self.score_gap_cont
        else:
            if seq_0[i] == seq_1[j]:
                if self.subst_matrix:
                    inc_score = matrix_score
                else:
                    inc_score = self.score_match
                inc_matches = 1
            else:
                if self.subst_matrix:
                    inc_score = matrix_score
                else:
                    inc_score = self.score_no_match
                inc_matches = 0
        return inc_score, inc_matches
    
    def calc_score(self, i, j):
        """Compute alignment scores.
        If there are stacks associated with the sequence, we compute the score weigthing the
        scores of the stacks. Stacks contains also the guiding sequences.
            Args:
                i (int): Sequence 0 index
                j (int): Sequence 1 index
        """
        if self.stacks == [[],[]]:
            return self.calc_score_binary(self.sequences[0], self.sequences[1], i, j)
        else:
            computed_score = 0
            computed_matches = 0
            nvalues = 0
            for seq_0 in self.stacks[0]:
                for seq_1 in self.stacks[1]:
                    score, matches = self.calc_score_binary(seq_0, seq_1, i, j)
                    computed_score += score
                    computed_matches += matches
                    nvalues += 1
            ret_score = computed_score /  nvalues
            ret_matches = computed_matches / nvalues
            return ret_score, ret_matches
            
    def align(self, i=0, j=0, ini_gap=1):
        """Recursive align of sequences
        For each cell, which coordinates are (i, j, ini_gap), calc the maximum score path from
        three alternative displacements:
        
        1) To (i + 1, j + 1, 1), that is, matching or no matching the seq0(i) and seq1(i) characters.
        This is a diagonal displacement. 
        2) To (i, j + 1, 0), that is, seting a gap in seq0 and advance seq1. Horizontal displacement.
        3) To (i + 1, j, 0), that is, seting a gap in seq1 and advance seq0. Vertical displacement.
        
        The scores of these displacements are calculated adding the score ot the target cells 
        (that are computed recursively) and the matrix, default of gap scores in each case.
        
        The score, matches, gaps and forward_arrow are stored at related dictionary entry based on 
        coordinates (i, j, ini_gap), all of them asociated to the maximum score of 
        the three possible paths starting from the cell, avoiding recomputation 
        of the cell if it's called from another recuersive path.
        
        Each cell has a third score coordinate, because a cell could be called from a cell with yet 
        has a gap (only from horizontal or vertical prior displacement) or from a cell with has a match/no match.
        Then we need to store two scores, matches, gaps and forward_arrows related to the two possible 
        cell incarnations at coordinates (i, j, 0) and (i, j, 1).
        
        We store matches and gaps in order to have one aditional criterion to tiebreaker 
        if some of the scores are equal. We are using this aproach in local alignment computation. If two
        scores are equal we choose the solucion with the greatest number of matches.
        
        We store the displacement directions in forward_arrow dict to compute the alignment. 
        It's posible to avoid this, using only the score information, but we have let this aproach 
        as proof of concept and for clarity in the algorithm.
        
        In this scenario we observe that the differences between the global, 
        local and long substring algorithms are minimal.
        
        Local algorithm:

            Starting from the global algorithm, which would be the most general, 
            the local algorithm only changes two aspects:
                1. Rejection of the roads with negative values of the score, equaling these values 
                to 0, that is, not letting previous alignments of poor quality affect the final result.
                
                2. Use as cell of beginning of the alignment the one with the highest scores. 
                In our implementation we also take into account the number of matches, 
                as we have already mentioned.

        Finally, but outside the algorithm of alignment itself (at forward_track and matching methods)
        it only remains to extend the alignment obtained to show its location within the chains to be aligned.

        Search algorithm of the long common substring:

            Modify the global algorithm in the following aspects:
                1. Only computes matches between characters or gaps in one or another initial sequence.
        
         Args:
                i (int): Sequence 0 index
                j (int): Sequence 1 index
                ini_gap (int): 1 if gap initiation, 0 if gap continuation
        """
        score_diag, score_hor, score_ver = MIN, MIN, MIN
        matches_diag, matches_hor, matches_ver = MIN, MIN, MIN
        gaps_diag, gaps_hor, gaps_ver = MIN, MIN, MIN
        #align and advance seq0 and seq1
        #in long_substring mode only matches are processed
        if i < self.len_seq0 and j < self.len_seq1 and\
        (self.mode != "LONG_SUBSTRING" or self.sequences[0][i] == self.sequences[1][j]):
            inc_score, inc_matches = self.calc_score(i, j)
            key = (i + 1, j + 1, 1)
            if key in self.score_store:
                score_diag, matches_diag, gaps_diag = \
                self.score_store[key] + inc_score, self.matches_store[key] + inc_matches, self.gaps_store[key]
            else:
                score, matches, gaps = self.align(i + 1, j + 1, 1)
                self.store(key, score, matches, gaps)
                score_diag, matches_diag, gaps_diag = score + inc_score, matches + inc_matches, gaps
        #don't align and gap in seq0 (advance seq1)
        if j < self.len_seq1:
            gap_score = self.score_gap_cont + ini_gap * self.score_gap_ini
            key = (i, j + 1, 0)
            if key in self.score_store:
                score_hor, matches_hor, gaps_hor = self.score_store[key] + gap_score,\
                self.matches_store[key], self.gaps_store[key] + 1
            else:
                score, matches, gaps = self.align(i, j + 1, 0)
                self.store(key, score, matches, gaps)
                score_hor, matches_hor, gaps_hor = score + gap_score, matches, gaps + 1
        #don't align and gap in seq1 (advance seq0)
        if i < self.len_seq0:
            gap_score = self.score_gap_cont + ini_gap * self.score_gap_ini
            key = (i + 1, j, 0)
            if key in self.score_store:
                score_ver, matches_ver, gaps_ver =\
                self.score_store[key] + gap_score, self.matches_store[key], self.gaps_store[key] + 1
            else:
                score, matches, gaps = self.align(i + 1, j, 0)
                self.store(key, score, matches, gaps)
                score_ver, matches_ver, gaps_ver = score + gap_score, matches, gaps + 1
        #choose the high score path
        matcher_diag, matcher_hor, matcher_ver = score_diag, score_hor, score_ver
        if i < self.len_seq0 or j < self.len_seq1:
            if self.mode == "LOCAL" and matcher_diag < 0 and matcher_hor < 0 and matcher_ver < 0:
                score_diag, score_hor, score_ver = 0, 0, 0
                #matcher_diag, matcher_hor, matcher_ver  = 0, 0, 0
            if matcher_diag > matcher_hor and matcher_diag > matcher_ver:
                ret_score, ret_matches, ret_gaps, ret_arrow =\
                score_diag, matches_diag, gaps_diag, "d"
            elif matcher_hor > matcher_ver:
                ret_score, ret_matches, ret_gaps, ret_arrow =\
                score_hor, matches_hor, gaps_hor, "h"
            else:
                ret_score, ret_matches, ret_gaps, ret_arrow =\
                score_ver, matches_ver, gaps_ver, "v"
        else:
            ret_score, ret_matches, ret_gaps, ret_arrow =\
            0, 0, 0, ""
        self.forward_arrow[(i, j, ini_gap)] = ret_arrow
        if i == 0 and j == 0:
            self.store((0, 0, 1), ret_score, ret_matches, ret_gaps)
            if self.mode in ["GLOBAL", "LONG_SUBSTRING"]: self.max_score_index = (0, 0, 1)
            else: ret_score = self.max_score // COMPAC
            ret_matches = self.matches_store[self.max_score_index]
            ret_gaps = self.gaps_store[self.max_score_index]
            
        return ret_score, ret_matches, ret_gaps

    def compute(self, mode="LOCAL", silent=False):
        """Calc alignment
            Args:
                mode (str): Type of algorithm (local, global or long substring)
                silent (bool): If true don't show alignment output
        """
        self.ini_time = time.time()
        self.init_stores()
        self.set_mode(mode)
        self.score, self.matches, self.gaps = self.align()
        self.align_seq0, self.align_seq1, final_pos = self.forward_track(self.max_score_index)
        self.matching = self.calc_matching(self.align_seq0, self.align_seq1, self.max_score_index, final_pos)
        self.unmatches = self.matching.count('.')
        self.gaps = self.matching.count(' ')
        self.finish_time = time.time()
        if not silent:
            self.view()
    
    def get_len_long_common_substring(self):
        """Getter for the len of the common substring
        That is equal to the number of matches of the alignment        
        """
        return self.matches
    
    def get_long_common_substring(self):
        """Returns the longest common substring
        whitout alignment (positional) information
        """
        long_common_substring = ""
        for (char, match_char) in zip(self.align_seq1, self.matching):
            if match_char == '|':
                long_common_substring += char
        return long_common_substring
        
    def view(self):
        """Prints the alignment data"""
        #unmatches = self.matching.count('.')
        #gaps = self.matching.count(' ')
        if self.matching:
            gap_groups = self.matching.count('| ') + self.matching.count('. ') + self.matching[0].count(' ')
        else:
            gap_groups = 0
        print(" ")
        if self.mode == "LOCAL":
            print("### AlignSequences. Local alignment (Smith-Waterman)")
        elif self.mode == "LONG_SUBSTRING":
            print("### AlignSequences. Long substring finder")
        else:
            print("### AlignSequences. Global alignment (Needleman-Wunsch)")
        if self.subst_matrix:
            print("\tUsing score matrix")
        print(self.align_seq1)
        print(self.matching)
        print(self.align_seq0)
        print("\tScore:", self.score)
        print("\tSimilarity (wo gaps):", self.matches / (self.matches + self.unmatches))
        print("\tDistance (wo gaps):", self.unmatches / (self.matches + self.unmatches))
        print("\tDistance:", self.unmatches / (self.matches + self.unmatches + self.gaps))
        print("\tInit index:", self.max_score_index)
        print("\tMatches:", self.matches, " Unmatches:", self.unmatches, " Gaps:", self.gaps, " Gap groups:", gap_groups)
        #simple scoring verification todo: apply to matrix
        if not self.subst_matrix:
            print("\tScore verified:", self.matches * self.score_match + unmatches * self.score_no_match \
                  + gaps * self.score_gap_cont + gap_groups * self.score_gap_ini)
        print("\tFinish. Execution milliseconds:", round((self.finish_time - self.ini_time) * 1000))
        print("\tScore Dictionary Size", len(list(self.score_store.keys())))
        
    def edit_distance(self, score_match=0, score_no_match=-1, score_gap_ini=0, score_gap_cont=-1):
        """Calculates an edit distance as requested in questions 1 and 3
        It's the same computation as a global alignment with -1 penalities applied to
        score_gap_cont and score_nomatch and 0 in score_match and score_gap_ini
            
            Args:
                score_match (int): Score of match characters.
                score_no_match (int): Score of no match characters.
                score_gap_ini (int): Score of gap init.
                score_gap_cont (int): Score of gap continuation.
        """
        self.set_scores(score_match, score_no_match, score_gap_ini, score_gap_cont)
        self.compute("GLOBAL", True)
        return abs(self.score)

## Exercise 1
In this exercise you will test the Needleman-Wunsch algorithm on a short
sequence parts of hemoglobin (PDB code 1AOW) and myoglobin 1 (PDB code 1AZI). Here
you will align the sequence **HGSAQVKGHG** to the sequence **KTEAEMKASEDLKKHGT**.
The two sequences are arranged in a matrix in Table 1. 

The sequences start at the upper right corner, the initial gap penalties are listed at each offset starting position. The gap penalty is considered as -8.
 

The similarity scores Si,j from looking up the matches comes from the **BLOSUM40** table. The
table is shown here labeled with 1-letter amino acid codes:

### Response
As I have developed my version of global alignement, very similar to Needleman-Wunsch algorithm, I'm going to use this.
Tne algorithm is in the class `AlignSequences` that I include on top.

In [406]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.SubsMat import MatrixInfo

s1 = "HGSAQVKGHG"
s2 = "KTEAEMKASEDLKKHGT"
mode = "GLOBAL"
matrix = MatrixInfo.blosum40

align = AlignSequences([s1, s2])
align.set_scores(score_match = 0, score_no_match = 0, score_gap_ini = 0, score_gap_cont = -8)
align.set_subst_matrix(matrix)
align.compute(mode.upper(), silent = False)
        

 
### AlignSequences. Global alignment (Needleman-Wunsch)
	Using score matrix
KTEAEMKASEDLKKHGT
  ..  .| . .|.|| 
--HG--SA-Q-VKGHG-
	Score: -21
	Similarity (wo gaps): 0.4
	Distance (wo gaps): 0.6
	Distance: 0.35294117647058826
	Init index: (0, 0, 1)
	Matches: 4  Unmatches: 6  Gaps: 7  Gap groups: 5
	Finish. Execution milliseconds: 4
	Score Dictionary Size 368


It's OK?
I'm going to compare against **biopython**. By means of the function `test_alignment`. This function compare `AlignSequences` results with the alignments of Biopython. This software returns all the solutions of maximun score. My software not yet and so we test that my result is into the set of Biopython ones.
I do the test with several BLOSUM matrices.

In [586]:
import re

failed = 0
passed = 0
launched = 0

def test_alignment(number, s1, s2, verbose=False, tipus="local", matrix={},\
                          score_match=2, score_no_match=-3, score_gap_ini=-5, score_gap_cont=-2):
    """Comparisons of global and local alignments between Biopython and AlignSequences implementation.
    
    Args:
        number (int): The number(identifier) of the test.
        s1 (str): Query string to align.
        s2 (str): Subject string to align
        verbose (bool): If True print outputs, default False
        tipus (str) : If 'local' the alignment is local (Smith), if 'global' Waterman.
        matrix (dict of int) : Substitution matrix
        score_match (int): Score of character match
        score_no_match (int): Score of character no match
        score_gap_ini (int): Score of gap initiation
        score_gap_cont (int): Score of gap continuation
        
    """
    global failed, passed, launched
    try:
        launched += 1
        align = AlignSequences([s1, s2])
        align.set_scores(score_match, score_no_match, score_gap_ini, score_gap_cont)
        if matrix != {}:
            method = getattr(pairwise2.align, tipus + 'ds')
            alignments = method(s2, s1, matrix,\
                                             score_gap_ini + score_gap_cont, score_gap_cont)
            align.set_subst_matrix(matrix)
        else:
            method = getattr(pairwise2.align, tipus + 'ms')
            alignments = method(s2, s1, score_match, score_no_match,\
                                              score_gap_ini + score_gap_cont, score_gap_cont)
        
        align.compute(tipus.upper(), silent=not verbose)
        m = re.match(r".*Score=([-1234567890]*)", format_alignment(*alignments[0]).replace("\n", ""))
        score = int(m.group(1))
        
        #search AlignSequences alignment in all possibles alignments fron Biopython
        found = False
        for a in alignments:
            if verbose: 
                print()
                print("BioPython alignment:")
                print(format_alignment(*a))
            if (align.align_seq0 == a[1]):
                    found = True
                    break             
        assert(align.score == score)
        print ("Passed test %s: scores are equal '%s'" % (number, align.score ))
        assert(found)
        print ("Passed test %s: alignments are equal '%s'" % (number, align.align_seq0 ))
        passed += 1
    
    except AssertionError:
        print ("Failed test %s: alignments differ: \nBiopython:\n'%s'\nScore = %s \
        \nAlignSequences\n'%s'\nScore = %s"\
               % (number, alignments[0][1], score, align.align_seq0, align.score ))
        failed += 1
        exit(1)

prot1 = "HGSAQVKGHG"
prot2 = "KTEAEMKASEDLKKHGT"

test_alignment(1, prot1, prot2, False, "global", MatrixInfo.blosum80, 0, 0, 0, -8)
test_alignment(2, prot1, prot2, False, "global", MatrixInfo.blosum62, 0, 0, 0, -8)
test_alignment(3, prot1, prot2, True, "global", MatrixInfo.blosum40, 0, 0, 0, -8)
# prot1 = "PPGVKSDCAS"
# prot2 = "PADGVKDCAS"
# prot1 = "PPGVKSDCAS"
# prot3 = "PPDGKSDS"
# test_alignment(1, prot1, prot2, True, "global", MatrixInfo.blosum40, 0, 0, -10, -8)
# test_alignment(2, prot1, prot3, True, "global", MatrixInfo.blosum40, 0, 0, -10, -8)


print(" ")
if launched == passed: print('Passed All Test')
else: print("ERROR: There are failed tests")

Passed test 1: scores are equal '-31'
Passed test 1: alignments are equal '--HGS--A-Q-VKGHG-'
Passed test 2: scores are equal '-32'
Passed test 2: alignments are equal '--HG--SA-Q-VKGHG-'
 
### AlignSequences. Global alignment (Needleman-Wunsch)
	Using score matrix
KTEAEMKASEDLKKHGT
  ..  .| . .|.|| 
--HG--SA-Q-VKGHG-
	Score: -21
	Similarity (wo gaps): 0.4
	Distance (wo gaps): 0.6
	Distance: 0.35294117647058826
	Init index: (0, 0, 1)
	Matches: 4  Unmatches: 6  Gaps: 7  Gap groups: 5
	Finish. Execution milliseconds: 2
	Score Dictionary Size 368

BioPython alignment:
KTEAEMKASEDLKKHGT
  ..  .| . .|.|| 
--HG--SA-Q-VKGHG-
  Score=-21

Passed test 3: scores are equal '-21'
Passed test 3: alignments are equal '--HG--SA-Q-VKGHG-'
 
Passed All Test


## Exercise 2 
Given the multiple sequence set:
- S1: PPGVKSDCAS
- S2: PADGVKDCAS
- S3: PPDGKSDS
- S4: GADGKDCCS
- S5: GADGKDCAS

Use the popular progressive alignment method to globally align the previous set of sequences.
Generate the guide tree by neighbord-joining. Compare your outcome (alignment) with the
one from Clustal-W.

### Fasta files for the exercise

In [487]:
%%writefile exercise2_sequences.fasta
>s0
PPGVKSDCAS
>s1
PADGVKDCAS
>s2
PPDGKSDS
>s3
GADGKDCCS
>s4
GADGKDCAS

Overwriting exercise2_sequences.fasta


In [548]:
%%writefile globin_sequences.fasta
>myoglobine_human
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVL
TALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFR
KDMASNYKELGFQG
>myoglobine_chimp
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLT
ALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLHSKHPGDFGADAQGAMNKALELFRK
DMASNYKELGFQG
>myoglobine_gorilla
GLSDGEWQLVLNVWGKVEADISGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLT
ALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRK
DMASNYKELGFQG
>myoglobine_orang-oetang
GLSDGEWQLVLNVWGKVEADIPSHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLT
ALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISESIIQVLQSKHPGDFGADAQGAMNKALELFRK
DMASNYKELGFQG
>myoglobine_babboon
GLSDGEWQLVLNVWGKVEADIPSHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLT
ALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLELISESIIQVLQSKHPGDFGADAQGAMNKALELFRN
DMAAKYKELGFQG
>myoglobine_pig
MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGNTVL
TALGGILKKKGHHEAELTPLAQSHATKHKIPVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFR
NDMAAKYKELGFQG
>myoglobine_beaver
GLSDGEWQLVLHVWGKVEADLAGHGQEVLIRLFKGHPETLEKFNKFKHIKSEDEMKASEDLKKHGVTVLT
ALGGVLKKKGHHEAEIKPLAQSHATKHKIPIKYLEFISEAIIHVLQSKHPGBFGADABGAMNKALELFRK
DIAAKYKELGFQG
>myoglobine_rabbit
GLSDAEWQLVLNVWGKVEADLAGHGQEVLIRLFHTHPETLEKFDKFKHLKSEDEMKASEDLKKHGNTVLT
ALGAILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISEAIIHVLHSKHPGDFGADAQAAMSKALELFRN
DIAAQYKELGFQG
>myoglobine_horse
GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVVLT
ALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHPGDFGADAQGAMTKALELFRN
DIAAKYKELGFQG
>myoglobine_chicken
MGLSDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGLKTPDQMKGSEDLKKHGATVL
TQLGKILKQKGNHESELKPLAQTHATKHKIPVKYLEFISEVIIKVIAEKHAADFGADSQAAMKKALELFR
NDMASKYKEFGFQG
>myoglobine_alligator
MELSDQEWKHVLDIWTKVESKLPEHGHEVIIRLLQEHPETQERFEKFKHMKTADEMKSSEKMKQHGNTVF
TALGNILKQKGNHAEVLKPLAKSHALEHKIPVKYLEFISEIIVKVIAEKYPADFGADSQAAMRKALELFR
NDMASKYKEFGYQG
>myoglobine_lizard
MGLSDEEWKKVVDIWGKVEPDLPSHGQEVIIRMFQNHPETQDRFAKFKNLKTLDEMKNSEDLKKHGTTVL
TALGRILKQKGHHEAEIAPLAQTHANTHKIPIKYLEFICEVIVGVIAEKHSADFGADSQEAMRKALELFR
NDMASRYKELGFQG
>myoglobine_carp
MHDAELVLKCWGGVEADFEGTGGEVLTRLFKQHPETQKLFPKFVGIASNELAGNAAVKAHGATVLKKLGE
LLKARGDHAAILKPLATTHANTHKIALNNFRLITEVLVKVMAEKAGLDAGGQSALRRVMDVVIGDIDTYY
KEIGFAG
>myoglobine_tuna
MADFDAVLKCWGPVEADYTTIGGLVLTRLFKEHPDTQKLFPKFAGIAQADLAGNAAISAHGATVLKKLGE
LLKAKGSHASILKPMANSHATKHKIPINNFKLISEVLVKVMQEKAGLDAGGQTALRNVMGIIIADLEANY
KELGFTG

Overwriting globin_sequences.fasta


### MSA Clustal progressive alignment. Methods.

In [581]:
MIN_SCORE = 0

def pairwise_align(s1, s2, matrix, mode, score_gap_ini=0, score_gap_cont=-8):
    align = AlignSequences([s1, s2])
    align.set_scores(0, 0, score_gap_ini, score_gap_cont)
    align.set_subst_matrix(matrix)
    align.compute(mode.upper(), silent = True)
    return round((align.matches) * 100 / (align.matches + align.unmatches + 0.01))
    #return align.score
    
def msa_align_UPGMA(sequences, matrix={}, mode="GLOBAL", score_gap_ini=0, score_gap_cont=-8):
    """
    """
    tree = {} #initial tree
    guide_tree = [] #guided tree, pairs to align in sequence
    max_score = MIN_SCORE
    max_score_position = ()
    for i in range(0, len(sequences)):
        for j in range(0 , i):
            if (i,j) not in tree:
                score = pairwise_align(sequences[i], sequences[j], matrix, mode,\
                                score_gap_ini, score_gap_cont)
                tree[(i,j)] = score
                if score >= max_score:
                    max_score = score
                    max_score_position = (i,j)
    
    print(tree)
    len_tree = len(sequences)
    guide_tree_nodes = {}
    # Generate guide tree. At every step we compute another row averaging the
    # most closer rows and removing all their row coordinates from the tree
    while len(tree) > 0:
        (imax, jmax) = max_score_position
        guide_tree.append((imax, jmax, len_tree))
        guide_tree_nodes[imax] = True
        guide_tree_nodes[jmax] = True
        guide_tree_nodes[len_tree] = False
        
        # Average scores from i,j rows into new row in new_row_pos
        for j in range(0, len_tree):
            if j in [imax, jmax]:
                continue
            nscores = 0.0;
            for coordinate in [(imax, j), (j, imax), (jmax, j), (j, jmax)]:
                if coordinate in tree:
                    score = tree[coordinate]
                    nscores += 1
                    if (len_tree, j) not in tree:
                        tree[(len_tree, j)] = score
                    else:
                        tree[(len_tree, j)] += score
            if nscores > 0:
                tree[(len_tree, j)] = tree[(len_tree, j)] / nscores
            
        # Tree cleaning and calc max score
        max_score = MIN_SCORE
        max_score_position = ()
        for i in range(0, len_tree + 1):
            for j in range(0, len_tree + 1):
                if (i,j) in tree:
                    if i == imax or i == jmax or j == imax or j == jmax:
                        del(tree[(i,j)])
                    else:
                        if tree[(i,j)] >= max_score:
                            max_score =  tree[(i,j)]
                            max_score_position = (i,j)
        
        len_tree += 1

    return guide_tree, guide_tree_nodes

def msa_align_UPGMA_from_fasta(file, matrix={}, mode="GLOBAL", score_gap_ini=0, score_gap_cont=-8):
    sequences = list(readFasta(file).values())
    sequence_names = list(readFasta(file).keys())
    print(sequence_names)
    guide_tree, guide_tree_nodes = msa_align_UPGMA(sequences, matrix, mode, score_gap_ini, score_gap_cont)
    return guide_tree, guide_tree_nodes, sequences

def readFasta(file):
    """ Reads all sequences of a FASTA file 
        returns a dictionary """  
    ret_seqs = {}
    seq = ""
    key_found = False
    with open(file, 'r') as f:
        key = ""
        for line in f:
            line = line.replace('\n', '')
            if len(line) > 0:
                if line[0] == ">":
                    if key_found:
                        ret_seqs[key] = seq
                    key_found = True
                    key = line[1:].split(" ")[0]
                    seq = ""
                elif key_found:
                    seq += line
    if key_found:
        ret_seqs[key] = seq
    return ret_seqs

def pairwise_align_distance(s1, s2, matrix, mode, score_gap_ini=0, score_gap_cont=-8):
    align = AlignSequences([s1, s2])
    align.set_scores(0, 0, score_gap_ini, score_gap_cont)
    align.set_subst_matrix(matrix)
    align.compute(mode.upper(), silent = True)
    return round(100 - (align.matches * 100) / (align.matches + align.unmatches + 0.01))

def q(i, j, nseq, n, dmatrix):
    d = 0
    d = (nseq - 2) * dmatrix[(i,j)]
    for k in range(0, n):
        if (i,k) in dmatrix:
            d -= dmatrix[(i,k)]
        if (j,k) in dmatrix:
            d -= dmatrix[(j,k)]
    return d

def calc_qmatrix(nseq, n, dmatrix):
    qmatrix = {}
    for (i,j) in dmatrix:
        qmatrix[(i,j)] = q(i, j, nseq, n, dmatrix)
    return qmatrix

def smallest_q(qmatrix):
    sq = ()
    min_sq = - MIN
    for key in qmatrix.keys():
        if qmatrix[key] <  min_sq:
            min_sq = qmatrix[key]
            sq = key
    return sq

def djoin(joined_pair, nseq, n, dmatrix):
    (i, j) = joined_pair
    d_i_1 = dmatrix[(i,j)] / 2.0
    d_i_2 = 0
    for k in range(0, n):
        if (i,k) in dmatrix:
            d_i_2 += dmatrix[(i,k)]
        if (j,k) in dmatrix:
            d_i_2 -= dmatrix[(j,k)]
    d_i = d_i_1 - d_i_2 / (2*(nseq - 2))
    d_j = dmatrix[(i,j)] - d_i
    return d_i, d_j

def dnjoin(k, joined_pair, dmatrix):
    (i, j) = joined_pair
    d_k = 0
    if (i,k) in dmatrix:
        d_k += dmatrix[(i,k)]
    if (j,k) in dmatrix:
        d_k += dmatrix[(j,k)]
    d_k = (d_k - dmatrix[(i,j)]) / 2.0
    return d_k

def recalc_dmatrix(joined_pair, n, dmatrix):
    (i, j) = joined_pair
    # Recalculate distances
    for k in range(0, n):
        if (i,k) in dmatrix and (j,k) in dmatrix:
            dmatrix[(n + 1, k)] = dnjoin(k, joined_pair, dmatrix)
            dmatrix[(k, n + 1)] = dmatrix[(n + 1, k)]
    # Remove joined rows from dmatrix
    for k in range(0, n + 1):
        for l in range(0, n + 1):
            if k == i or k == j or l == i or l== j:
                if (k, l) in dmatrix:
                    del(dmatrix[(k, l)])   
    return

def msa_align_NJ(sequences, matrix={}, mode="GLOBAL", score_gap_ini=0, score_gap_cont=-8):
    dmatrix = {} #initial distance matrix
    n = len(sequences)
    for i in range(0, n):
        for j in range(0 , i):
            if (i,j) not in dmatrix:
                distance = pairwise_align_distance(sequences[i], sequences[j], matrix, mode,\
                                score_gap_ini, score_gap_cont)
                dmatrix[(i,j)] = distance
                dmatrix[(j,i)] = distance
    nseq = n
    new_nodes = n - 2
    guide_tree = [] #guided tree, pairs to align in sequence
    guide_tree_nodes = {} #guided tree rooted nodes to complete
    for i in range(0, n):
        guide_tree_nodes[i] = False
    while new_nodes > 0:
        qmatrix = calc_qmatrix(nseq, n, dmatrix)
        (joined_i, joined_j) = smallest_q(qmatrix)
        #print("JOIN:", (joined_i, joined_j))
        guide_tree.append((joined_i, joined_j, n + 1))
        guide_tree_nodes[joined_i] = True
        guide_tree_nodes[joined_j] = True
        guide_tree_nodes[n + 1] = False
        recalc_dmatrix((joined_i, joined_j), n, dmatrix)
        n += 1
        nseq -= 1
        new_nodes -= 1
    # Root the tree
    print("DMATRIX:", dmatrix)
    rooting_tuple = []
    for node in guide_tree_nodes:
        if not guide_tree_nodes[node]:
            rooting_tuple.append(node)
    rooting_tuple.append(n + 1)
    guide_tree_nodes[n + 1] = True
    print("Rooting tuple:", rooting_tuple)
    if len(rooting_tuple) == 3:
        guide_tree.append(rooting_tuple)
    assert len(rooting_tuple) == 3
    return guide_tree, guide_tree_nodes

def gapping(a, a_gapped, b_stack):
    ini_a_gapped = a_gapped
    b_gapped_stack = []
    for b in b_stack:
        b_gapped = ""
        a_gapped = ini_a_gapped
        for k, (i, j) in enumerate(zip(a, b)):
            index = a_gapped.index(i)
            a_gapped = a_gapped[index + 1:]
            #print("a_gapped", a_gapped )
            #print(i, j, index)
            b_gapped += "-" * index + j
            #print("b_gapped", b_gapped )
        b_gapped += b[k + 1:]
        b_gapped_stack.append(b_gapped)
    return b_gapped_stack

def pairwise_align_msa_step(stack_0, stack_1, matrix, mode="GLOBAL", score_match=0,\
                            score_no_match=0, score_gap_ini=0, score_gap_cont=-8):
    align = AlignSequences([stack_0[0], stack_1[0]])
    align.set_scores(score_match, score_no_match, score_gap_ini, score_gap_cont)
    align.set_subst_matrix(matrix)
    align.set_stacks([stack_0, stack_1])
    align.compute(mode.upper(), silent = True)
    # align_seq0 align_seq1 are the seq0 and seq1 alignments
    # we need to deduce the rest of alignments. 
    # what we do is perform the same gap insertions, if any, as the first sequence of the stacks
    stack_0_gapped = gapping(stack_0[0], align.align_seq0, stack_0)
    stack_1_gapped = gapping(stack_1[0], align.align_seq1, stack_1)
    return stack_0_gapped, stack_1_gapped

def msa_align_NJ_from_fasta(file, matrix={}, mode="GLOBAL", score_gap_ini=0, score_gap_cont=-8):
    sequences = list(readFasta(file).values())
    sequence_names = list(readFasta(file).keys())
    print(sequence_names)
    guide_tree, guide_tree_nodes = msa_align_NJ(sequences, matrix, mode, score_gap_ini, score_gap_cont)
    return guide_tree, guide_tree_nodes, sequences

def do_msa_from_fasta(file, method="NJ", mode="GLOBAL", matrix={}, score_gap_ini =-10,\
             score_gap_cont=-5, verbose=False):
    """
    """
    if method == "NJ":
        guide_tree, guide_tree_nodes, sequences = msa_align_NJ_from_fasta(file, matrix, \
                                                           mode, score_gap_ini, score_gap_cont) 
    else: #UPGMA
        guide_tree, guide_tree_nodes, sequences = msa_align_UPGMA_from_fasta(file, matrix, mode,\
                                                                              score_gap_ini, score_gap_cont)
    sequences_store = {}
    verbose = False
    for i in guide_tree_nodes.keys():
        if i < len(sequences):
            sequences_store[i] = [sequences[i]]
            
    if verbose: print(sequences_store)
    for (i ,j ,k) in guide_tree:
        stack_i = sequences_store[i]
        stack_j = sequences_store[j]
        if verbose: print("Stack i", i, stack_i)
        if verbose: print("Stack j", j, stack_j)
        stack_0, stack_1 = pairwise_align_msa_step(stack_i, stack_j, matrix)
        sequences_store[k] = []
        if verbose: print("==================")
        for s in stack_0:
            if verbose: print(s)
            sequences_store[k].append(s)
        for s in stack_1:
            if verbose: print(s)
            sequences_store[k].append(s)
        if verbose: print("==================")
        if verbose: print("New stack:", k, sequences_store[k])
    alignment = sequences_store[k]
    return guide_tree, alignment


### Obtain MSA

In [582]:
#sequences = ["AAA", "AAG", "AKK", "KKK", "KGG"]
#sequences = ["PPGVKSDCAS", "PADGVKDCAS", "PPDGKSDS", "GADGKDCCS", "GADGKDCAS"]
guide_tree, align = do_msa_from_fasta("exercise2_sequences.fasta", method = "UPGMA", mode = "GLOBAL",\
                 matrix = MatrixInfo.blosum62, score_gap_ini = -10, 
                 score_gap_cont = -5, verbose = False)

print("# Guide Tree UPGMA:", guide_tree)
print("# Alignment:")
for s in align:
    print(s)
    print()
print()

guide_tree, align = do_msa_from_fasta("exercise2_sequences.fasta", method = "NJ", mode = "GLOBAL",\
                 matrix = MatrixInfo.blosum62, score_gap_ini = -10, 
                 score_gap_cont = -5, verbose = False)

print()
print("# Guide Tree NJ:", guide_tree)
print("# Alignment:")
for s in align:
    print(s)
    print()
print()

['s0', 's1', 's2', 's3', 's4']
{(1, 0): 50, (2, 0): 75, (2, 1): 62, (3, 0): 44, (3, 1): 78, (3, 2): 50, (4, 0): 55, (4, 1): 89, (4, 2): 50, (4, 3): 89}
# Guide Tree UPGMA: [(4, 3, 5), (5, 1, 6), (2, 0, 7), (7, 6, 8)]
# Alignment:
PPDGKSD--S

PPGVKSDCAS

GADG-KDCAS

GADG-KDCCS

PADGVKDCAS


['s0', 's1', 's2', 's3', 's4']
DMATRIX: {(8, 3): 9.625, (3, 8): 9.625}
Rooting tuple: [3, 8, 9]

# Guide Tree NJ: [(2, 0, 6), (6, 1, 7), (7, 4, 8), [3, 8, 9]]
# Alignment:
GADGK-DCCS

PPDGKSD--S

PPGVKSDCAS

PADGVKDCAS

GADGK-DCAS




### Obtain MSA mioglobines

In [585]:
guide_tree, align = do_msa_from_fasta("globin_sequences.fasta", method = "UPGMA", mode = "GLOBAL",\
                 matrix = MatrixInfo.blosum95, score_gap_ini = -10, 
                 score_gap_cont = -5, verbose = False)
print()
print("# Guide Tree UPGMA:", guide_tree)
print("# Alignment:")
for s in align:
    print(s)
    print()
print()

guide_tree, align = do_msa_from_fasta("globin_sequences.fasta", method = "NJ", mode = "GLOBAL",\
                 matrix = MatrixInfo.blosum95, score_gap_ini = -10, 
                 score_gap_cont = -5, verbose = False)
print()
print("# Guide Tree NJ:", guide_tree)
print("# Alignment:")
for s in align:
    print(s)
    print()
print()

['myoglobine_human', 'myoglobine_chimp', 'myoglobine_gorilla', 'myoglobine_orang-oetang', 'myoglobine_babboon', 'myoglobine_pig', 'myoglobine_beaver', 'myoglobine_rabbit', 'myoglobine_horse', 'myoglobine_chicken', 'myoglobine_alligator', 'myoglobine_lizard', 'myoglobine_carp', 'myoglobine_tuna']
{(1, 0): 99, (2, 0): 99, (2, 1): 99, (3, 0): 99, (3, 1): 98, (3, 2): 98, (4, 0): 96, (4, 1): 95, (4, 2): 95, (4, 3): 97, (5, 0): 94, (5, 1): 93, (5, 2): 93, (5, 3): 93, (5, 4): 94, (6, 0): 90, (6, 1): 90, (6, 2): 90, (6, 3): 90, (6, 4): 90, (6, 5): 90, (7, 0): 90, (7, 1): 90, (7, 2): 90, (7, 3): 89, (7, 4): 90, (7, 5): 92, (7, 6): 89, (8, 0): 88, (8, 1): 89, (8, 2): 88, (8, 3): 88, (8, 4): 89, (8, 5): 91, (8, 6): 88, (8, 7): 90, (9, 0): 77, (9, 1): 76, (9, 2): 76, (9, 3): 76, (9, 4): 76, (9, 5): 77, (9, 6): 71, (9, 7): 76, (9, 8): 76, (10, 0): 66, (10, 1): 65, (10, 2): 65, (10, 3): 65, (10, 4): 65, (10, 5): 66, (10, 6): 62, (10, 7): 66, (10, 8): 63, (10, 9): 72, (11, 0): 72, (11, 1): 72, (11, 2

### Verification against CLUSTALW

In [584]:
%%bash
~/clustalw2  -OUTPUTTREE=fasta -NEGATIVE -INFILE=globin_sequences.fasta -OUTORDER=ALIGN -STATS=align.log -TREE -ALIGN -CLUSTERING=UPGMA -OUTFILE=align.fasta -OUTPUT=FASTA -MATRIX=BLOSUM -TYPE=PROTEIN -PWGAPOPEN=10 -PWGAPEXT=5 
cat globin_sequences.fasta
cat align.fasta
cat align.log




 CLUSTAL 2.1 Multiple Sequence Alignments


Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: myoglobine_human          154 aa
Sequence 2: myoglobine_chimp          153 aa
Sequence 3: myoglobine_gorilla        153 aa
Sequence 4: myoglobine_orang-oetang   153 aa
Sequence 5: myoglobine_babboon        153 aa
Sequence 6: myoglobine_pig            154 aa
Sequence 7: myoglobine_beaver         153 aa
Sequence 8: myoglobine_rabbit         153 aa
Sequence 9: myoglobine_horse          153 aa
Sequence 10: myoglobine_chicken        154 aa
Sequence 11: myoglobine_alligator      154 aa
Sequence 12: myoglobine_lizard         154 aa
Sequence 13: myoglobine_carp           147 aa
Sequence 14: myoglobine_tuna           147 aa
Start of Pairwise alignments
Aligning...

Sequences (1:2) Aligned. Score:  99
Sequences (1:3) Aligned. Score:  99
Sequences (1:4) Aligned. Score:  98
Sequences (1:5) Aligned. Score:  96
Sequences (1:6) Aligned. Score:  93
Sequences (1:7) Aligned. Scor


Unknown OUTPUT TREE type: fasta


## Outputs

In [18]:
%%bash
#cd /Users/nandoide/Desktop/uni/STRBI.practical
jupyter nbconvert --to=latex --template=~/report.tplx worksheet.ipynb
pdflatex -shell-escape worksheet
jupyter nbconvert --to html_with_toclenvs worksheet.ipynb

This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018) (preloaded format=pdflatex)
 \write18 enabled.
entering extended mode
(./worksheet.tex
LaTeX2e <2018-04-01> patch level 2
Babel <3.18> and hyphenation patterns for 84 language(s) loaded.
(/usr/local/texlive/2018/texmf-dist/tex/latex/base/article.cls
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
(/usr/local/texlive/2018/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2018/texmf-dist/tex/latex/placeins/placeins.sty)
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsfonts/amssymb.sty
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsfonts/amsfonts.sty))
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsmath/amsmath.sty
For additional information on amsmath, use the `?' option.
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsmath/amstext.sty
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsmath/amsgen.sty))
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsmath/amsbsy.sty)
(/usr/local/texlive/20

[NbConvertApp] Converting notebook worksheet.ipynb to latex
[NbConvertApp] Writing 54427 bytes to worksheet.tex
[NbConvertApp] Converting notebook worksheet.ipynb to html_with_toclenvs
[NbConvertApp] Writing 397499 bytes to worksheet.html
