# Zika virus Project
## by Jason Anggada
This is an experiment of Zika virus sequence alignment using Needleman-Wunsch and Smith-Waterman Algorithm.

## Table of Contents

1. [Import Libraries](#import-libraries)
2. [Extract Data](#extract-data)
3. [Cluster Data](#cluster-data)
4. [Global Alignment](#global-alignment)
5. [Local Alignment](#local-alignment)
6. [Execution](#execution)
7. [Result](#result)  
 7.1 [First Virus](#first-virus)  
 7.2 [Second Virus](#second-virus)

<a id="import-libraries"></a>

## 1. Import Libraries
Import libraries used in the experiment.

In [16]:
# Import Libraries
%matplotlib inline
import re
import matplotlib.pyplot as plt
import numpy as np
import Bio.SeqIO as SeqIO
import time
from sklearn.cluster import KMeans, k_means
from ipy_table import *
from IPython.core.display import HTML

<a id="extract-data"></a>

## 2. Extract Data
Extract raw GenBank Data and prepare them to use in the alignment.

In [2]:
# Extract Data
all_virus = []
indonesian_virus = []
foreign_virus = []
for i, record in enumerate(SeqIO.parse("sequence.gb", "genbank")):
    if all (key in record.features[0].qualifiers for key in ('host','country','collection_date')):
        
        host = record.features[0].qualifiers['host'][0]
        host = host.split(';')[0]
        location = record.features[0].qualifiers['country'][0]
        location = location.split(':')[0]
        year = record.features[0].qualifiers['collection_date'][0]
        year = year.split('-')[-1]
        
        row = {
            'id'          :record.id,
            'name'        :record.name,
            'description' :record.description,
            'host'        :host,
            'location'    :location,
            'year'        :year,
            'seq'         :record.seq,
        }
        
        if location == 'Indonesia':
            indonesian_virus.append(row)
            
        else:
            foreign_virus.append(row)
        
        all_virus.append(row)

<a id="cluster-data"></a>

## 3. Cluster Data
Cluster the data using K-means Algorithm to get Threshold from the Centroids.

In [28]:
# Get Centroids
seq_len_kmeans = [[len(row['seq'])] for row in all_virus]
seq_centroid, seq_cluster, seq_cluster_error = k_means(seq_len_kmeans, n_clusters=2)
print(seq_centroid)

[[ 10614.50588235]
 [   902.625     ]]


In [29]:
# Get Threshold from the Centroids
threshold = (seq_centroid[0] + seq_centroid[1])/2
print(threshold)

[ 5758.56544118]


<a id="global-alignment"></a>

## 4. Global Alignment
Global alignment using Needleman-Wunsch Algorithm.

In [5]:
# Global Alignment
def global_alignment(v,w):
    # Backtrack Enum
    R_UP = 1
    R_LEFT = 2
    R_DIAG = 3

    # Gap Cost
    d = -1
    
    # Matrix Initialization
    F = [[0 for x in range(len(v)+1)] for y in range(len(w)+1)]
    B = [[0 for x in range(len(v)+1)] for y in range(len(w)+1)]

    # Matrix Fill
    for i in range(1,len(w)+1):
        F[i][0] = d*i
        B[i][0] = R_UP
        
    for j in range(1,len(v)+1):
        F[0][j] = d*j
        B[0][j] = R_LEFT
        
    for i in range(1,len(w)+1):
        for j in range(1,len(v)+1):
            match_score = 1 if v[j-1] == w[i-1] else -1
            Match = F[i-1][j-1] + match_score
            Insert = F[i-1][j] + d
            Delete = F[i][j-1] + d
            F[i][j] = max(Match, Insert, Delete)
            if F[i][j] == Match:
                B[i][j] = R_DIAG
            elif F[i][j] == Insert:
                B[i][j] = R_UP
            elif F[i][j] == Delete:
                B[i][j] = R_LEFT
    
    # Backtrack Initialization
    vr = ""
    wr = ""
    i = len(w)
    j = len(v)
    match = 0
    mismatch = 0
    gap = 0
    
    # Backtrack
    while B[i][j] > 0:
        if B[i][j] == R_DIAG:
            vr = v[j-1] + vr
            wr = w[i-1] + wr
            if (v[j-1] == w[i-1]):
                match += 1
            else:
                mismatch += 1
            i = i-1
            j = j-1
        elif B[i][j] == R_LEFT:
            vr = v[j-1] + vr
            wr = '-' + wr
            gap += 1
            j = j-1
        elif B[i][j] == R_UP:
            vr = '-' + vr
            wr = w[i-1] + wr
            gap += 1
            i = i-1

    score = F[len(w)][len(v)]
    length = len(vr)
    
    return score, length, match, mismatch, gap

<a id="local-alignment"></a>

## 5. Local Alignment
Local alignment using Smith-Waterman Algorithm.

In [6]:
# Local Alignment
def local_alignment(v,w):
    # Backtrack Enum
    R_UP = 1
    R_LEFT = 2
    R_DIAG = 3

    # Gap Cost
    d = -1
    
    # Matrix Initialization
    max_val = -1
    max_row = 0
    max_col = 0
    F = [[0 for x in range(len(v)+1)] for y in range(len(w)+1)]
    B = [[0 for x in range(len(v)+1)] for y in range(len(w)+1)]

    # Matrix Fill
    for i in range(1,len(w)+1):
        for j in range(1,len(v)+1):
            match_score = 1 if v[j-1] == w[i-1] else -1
            Match = F[i-1][j-1] + match_score
            Insert = F[i-1][j] + d
            Delete = F[i][j-1] + d
            F[i][j] = max(Match, Insert, Delete, 0)
            if F[i][j] > max_val:
                max_val = F[i][j]
                max_row = i
                max_col = j
            if F[i][j] == 0:
                continue
            elif F[i][j] == Match:
                B[i][j] = R_DIAG
            elif F[i][j] == Insert:
                B[i][j] = R_UP
            elif F[i][j] == Delete:
                B[i][j] = R_LEFT

    # Backtrack Initialization
    vr = ""
    wr = ""
    i = max_row
    j = max_col
    match = 0
    mismatch = 0
    gap = 0
    
    # Backtrack
    while B[i][j] > 0:
        if B[i][j] == R_DIAG:
            vr = v[j-1] + vr
            wr = w[i-1] + wr
            if (v[j-1] == w[i-1]):
                match += 1
            else:
                mismatch += 1
            i = i-1
            j = j-1
        elif B[i][j] == R_LEFT:
            vr = v[j-1] + vr
            wr = '-' + wr
            gap += 1
            j = j-1
        elif B[i][j] == R_UP:
            vr = '-' + vr
            wr = w[i-1] + wr
            gap += 1
            i = i-1
            
    score = max_val
    length = len(vr)
    
    return score, length, match, mismatch, gap

<a id="execution"></a>

## 6. Execution
Execute the algorithms and generate the result.

In [7]:
# Write result.txt
start_time = time.time() # execution time
result = ""

for indo in indonesian_virus:
    for foreign in foreign_virus:
        virus_time = time.time()
        score = ""
        alignment = ""
        
        if len(foreign['seq']) > threshold:
            alignment = "Local"
            score, length, match, mismatch, gap = local_alignment(indo['seq'], foreign['seq'])
        else:
            alignment = "Global"
            score, length, match, mismatch, gap = global_alignment(indo['seq'], foreign['seq'])
        
        indo_id = str(indo['id'])
        foreign_id = str(foreign['id'])
        foreign_location = str(foreign['location'])
        length = str(length)
        match = str(match)
        mismatch = str(mismatch)
        gap = str(gap)
        score = str(score)
        elapsed_time = str(time.time() - virus_time)
        
        result += (indo_id + "|" + foreign_id + "|" + foreign_location + "|" + alignment + "|" +
                   length + "|" + match + "|" + mismatch + "|" + gap + "|" + score + "|" + elapsed_time + "\n")

print("Total time elapsed: %s seconds" % (time.time() - start_time))
f = open('result.txt','w')
f.write(result)
f.close()

Total time elapsed: 3464.707650899887 seconds


<a id="result"></a>

## 7. Result
Display result in the table.

In [12]:
# Read result.txt
f = open("result.txt","r")
lines = f.readlines()
f.close()

result = []

for line in lines:
    line_split = line.split('|')
    detail = (
        '<a href="http://www.ncbi.nlm.nih.gov/nuccore/' + line_split[0] + '">' + line_split[0] + "</a>",
        '<a href="http://www.ncbi.nlm.nih.gov/nuccore/' + line_split[1] + '">' + line_split[1] + "</a>",
        line_split[2],
        line_split[3],
        int(line_split[4]),
        int(line_split[5]),
        int(line_split[6]),
        int(line_split[7]),
        int(line_split[8]),
        round(float(line_split[9]),2)
    )
    result.append(detail)

In [14]:
# Split the virus
first_virus = result[:int(len(result)/2)]
second_virus = result[int(len(result)/2):]

In [25]:
# Sort the result
sorted_first = sorted(first_virus, key=lambda x: x[8], reverse=True)
sorted_second = sorted(second_virus, key=lambda x: x[8], reverse=True)

<a id="first-virus"></a>

### 7.1 First Virus

In [26]:
# First virus
T = sorted_first
header = ['Indo Virus','Foreign Virus','Location','Alignment', 'Length', 'Match', 'Mismatch', 'Gap', 'Score', 'Time (s)']
T.insert(0,header)

i = 1
S = "<table>"
for row in sorted_first:
    color = ""
    if i == 1:
        color = 'LightGray'
    elif i % 2 == 0:
        color = 'Ivory'
    else:
        color = 'AliceBlue'
    S += "<tr style='background-color:" + color +";'>"
    for column in row:
        if i == 1:
            S += "<th>" + str(column) + "</th>"
        else:
            S += "<td>" + str(column) + "</td>"
    S += "</tr>"
    i += 1
S += "</table>"

HTML(S)

Indo Virus,Foreign Virus,Location,Alignment,Length,Match,Mismatch,Gap,Score,Time (s)
KU179098.1,KF993678.1,Canada,Local,1148,1140,8,0,1132,24.78
KU179098.1,KU509998.3,Haiti,Local,1148,1139,9,0,1130,26.99
KU179098.1,KX051563.1,USA,Local,1148,1139,9,0,1130,26.67
KU179098.1,KJ776791.1,French Polynesia,Local,1148,1139,9,0,1130,25.89
KU179098.1,KU647676.1,Martinique,Local,1148,1138,10,0,1128,25.95
KU179098.1,KU497555.1,Brazil,Local,1148,1138,10,0,1128,26.85
KU179098.1,KU501217.1,Guatemala,Local,1148,1138,10,0,1128,25.34
KU179098.1,KU501216.1,Guatemala,Local,1148,1138,10,0,1128,25.48
KU179098.1,KX377337.1,Puerto Rico,Local,1148,1137,11,0,1126,26.78
KU179098.1,KU820897.3,Colombia,Local,1148,1137,11,0,1126,26.31


<a id="second-virus"></a>

### 7.2 Second Virus

In [27]:
# Second virus
T = sorted_second
header = ['Indo Virus','Foreign Virus','Location','Alignment', 'Length', 'Match', 'Mismatch', 'Gap', 'Score', 'Time (s)']
T.insert(0,header)

i = 1
S = "<table>"
for row in sorted_second:
    color = ""
    if i == 1:
        color = 'LightGray'
    elif i % 2 == 0:
        color = 'Ivory'
    else:
        color = 'AliceBlue'
    S += "<tr style='background-color:" + color +";'>"
    for column in row:
        if i == 1:
            S += "<th>" + str(column) + "</th>"
        else:
            S += "<td>" + str(column) + "</td>"
    S += "</tr>"
    i += 1
S += "</table>"

HTML(S)

Indo Virus,Foreign Virus,Location,Alignment,Length,Match,Mismatch,Gap,Score,Time (s)
KF258813.1,KF993678.1,Canada,Local,402,397,5,0,392,9.85
KF258813.1,EU545988.1,Micronesia,Local,402,397,5,0,392,9.92
KF258813.1,KX101064.1,Brazil,Local,402,396,6,0,390,9.01
KF258813.1,KU866423.2,China,Local,402,396,6,0,390,8.78
KF258813.1,KU647676.1,Martinique,Local,402,396,6,0,390,8.95
KF258813.1,KX280026.1,Brazil,Local,402,396,6,0,390,9.24
KF258813.1,KU955593.1,Cambodia,Local,402,396,6,0,390,9.24
KF258813.1,KU681082.3,Philippines,Local,402,396,6,0,390,9.15
KF258813.1,KX253996.1,China,Local,402,396,6,0,390,9.61
KF258813.1,KX197192.1,Brazil,Local,402,396,6,0,390,9.78
