### EECS 730 Project - 2

#### Steps followed
1. Human genome file is huge around 3gb. To extract the sequence by reading this file would be time consuming.
2. Extracted the sequence for a specific chromosome in to temporary fasta file.
3. For the given chromosome, below steps are followed to extract the protein sequence.
4. Extracted sequences for every exon frame, excep for frame = -1. 
5. For Negative strand - 
<br>a. Read the frames, start and end positions in the reverse order.
<br>b. Using exon start and end positions, extract the sequence.
<br>c. Gather sequences for each exon frame in to a list and join in to a single sequence.
<br>d. Compute the starting position of neucleotide sequence using codon end position (cdsEnd from annotation table).
<br>e. Substring the sequence from this new start position.
6. For Positive strand -
<br>a. Read the frames, start and end positions in the normal order.
<br>b. Using exon start and end positions, extract the sequence.
<br>c. Gather sequences for each exon frame in to a list and join in to a single sequence.
<br>d. Compute the starting position of neucleotide sequence using codon start position (cdsSt from annotation table).
<br>e. Substring the sequence from this new start position.    
7. Translate the above extracted squence in to protein sequence using codon table. Stop the translation once a stop codon is read. 
8. Write the protein sequence in to an output file in the below format.

">name:name2 <br>
"Protein sequence

#### Import relevant packages

In [97]:
# import packages
import os
import Bio
from Bio import SeqIO
from Bio.Seq import Seq
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
from pydna.common_sub_strings import terminal_overlap
from pydna.assembly import Assembly
from pydna.dseqrecord import Dseqrecord

# Print versions
print('The Biopython version is {}..'.format(Bio.__version__))

The Biopython version is 1.78..


#### Create the paths for reference files

In [98]:
# Set the local paths for data
path = r'C:\Users\pmspr\Documents\HS\MS\Sem 4\EECS 730\Bioinformatics\Project 2\Docs'
reads = os.path.join(path, 'HW2_reads.fasta')
output = os.path.join(path,'sequence_assembly.txt')

#### Important methods

In [133]:
# This method derive the shortest path for the given list of sequences
def getcontig(seqlist,seq):
    dseq = tuple(Dseqrecord(seq[s]) for i, s in enumerate(seqlist) if i < 9)
    x = Assembly(dseq, limit=49)
    contigs = x.assemble_linear()
    if len(contigs) > 0:
        return contigs[0].seq.watson

In [134]:
# This method verifies if there is an overlap of certain threshold between sequences.
# Threshold = 50bp according to project instructions
def compare_overlap(s1, s2):
    overlaps = terminal_overlap(s1,s2, limit=49)
    if len(overlaps) > 0:
        return 'y'
    else:
        return 'n'
    

In [135]:
# This method derive the contigs for sequences with overlap
def contigs(seq):
    lr = [i for i in range(0,len(seq))]  
    contigList = []
    while (len(lr) != 0):
        i1 = min(lr)
        c1 = [i for i in lr if compare_overlap(seq[i1],seq[i]) == 'y' ]
        c1 = sorted(c1)
        contigList.append(getcontig(c1,seq))
        #print(c1)
        lr = list(set(lr) - set(c1))
    return contigList

#### Main logic

In [132]:
# Gather all the sequences from th input fasta file
seq = []
with open(reads) as genome:
    for line in genome:
        if(line[0].strip() != '>'):
            seq.append(line.strip())
print('Total number of reads in file is {}..'.format(len(seq)))

# Derive assembled DNA sequence from the individual contigs
while (len(seq) > 1): 
    contigList = contigs(seq)
    seq = contigList


print(seq[0])

Total number of reads in file is 127..
CCCTGTCTACCACCCAGACTATCGTGTAGTTCTGCCTGTTCCGTAAGTCGTAGATTGCTATCCTGGAAATCATCGTGCTCAGGATGTTAATATCTAGCGTCCTACGTTACGAGTTGGCAGATGACAGATCGTAGTCGTGGTAAGGGGCATTGCCGCTTGTGACCCAGTTCGCGTGCCTAGCAGCACTCCAAAATAAAGTTTACAGTACCGTCCGGACGGCAGAACTGTCCTCTAGATCGTCCTAACGCCTTAGTCGAATCCCTTGCCGTCGGTAACCACTGAATAAACTACGCGTTAGGACTTTGTCAGACGCGAGGAGCTAGTAGGAGGACAAATCAGCAAACGACCCTGAATTGAACAATGTGAGTAGGTATAACTGTGCTTGTATGACGTCCCGTTCGGTCGTTCTTGAGCAACTTCGGCCAGTGCATGCTATGGGGGAAGCTATGAATTCTATGTTGGAACTTGGGCCCGGCATAGTAGTTTATGCCTGTGGACCGGTGTTGAGTGTATCTGCTGGACCCCGGCGCGTTCACCTGTCCACATCTAATCCAAACATATACTATTGGTATTTGAGCGTCTCACAACGACATCGACTGGTATTAGACACCTACCAGGAACAACCAATCGGTTTAGATGACGCACAGCCACGGACAGCCTCTGTTGCTTGAGCAGTCCCAAAGTGCGTACCTGAAGCCTGCCAAAACGTAGCCTAGGCAAATGCCCGTCGTCTTGCTCATAACTCCTTGGGACTGGCGTATCCATAAATAATCCATTCGATTCCTTGAGAGTTCCACATTAGAGACTTATCCATCGAGGATCAGGCCAAATCCGCGAGACCCGACCGAGATCAAGTATAACTCATTACGCGTGGTGTGGTTGCGGCCCACCCTTATCGTGAGCCAGTTGTTGGATATACCCCTGGGCGGGCCTAAAGCTCCGCAACGAACACCCCCTCCGC

In [73]:
from pydna.common_sub_strings import terminal_overlap
s1 = seq[0]; s2 = seq[4]
overlaps = terminal_overlap(s1,s2, limit=49)
if len(overlaps) > 0:
    ind = overlaps[0]
    s1_st = ind[0]; s2_st = ind[1]; o_len = ind[2]
    print(s1[s1_st: s1_st + o_len])
    print(s2[s2_st: s2_st + o_len])

In [74]:
from pydna.assembly import Assembly
from pydna.dseqrecord import Dseqrecord
dseq = tuple(Dseqrecord(s) for i, s in enumerate(seq) if i < 4)
s1 = Dseqrecord(seq[0]); s2 = Dseqrecord(seq[4])
x = Assembly((s1,s2), limit=49)
#x = Assembly(dseq, limit=49)
contigs = x.assemble_linear()
print(len(contigs))
print(contigs[0].seq.watson)#detailed_figure()
#contigs[0].detailed_figure()

0


IndexError: list index out of range

In [49]:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
score = aligner.score(s1, s2)
alignments = aligner.align(s1, s2)
print(score)
cnt = 0
print(len(alignments))
print(alignments[8383287])
# for al in sorted(alignments):
#     print(al)
#     if (al.score == score):
#         cnt += 1
# print(cnt)        

ValueError: sequence has unexpected format

In [1]:
def score(sequence1,sequence2,offset):
    start_of_overlap = max(0-offset,0)
    end_of_overlap = min([len(sequence2)-offset, len(sequence2), len(sequence1)-offset])
    total_score = 0
    for position in range(start_of_overlap, end_of_overlap):
        if sequence2[position] == sequence1[position+offset]:
            total_score = total_score + 1
    return total_score

In [2]:
def find_best_offset(sequence1,sequence2):
    lowest_offset = 1-len(sequence2)
    highest_offset = len(sequence1)
    all_offsets = []
    for offset in range(lowest_offset,highest_offset):
        # add the 4-tuple for this offset
        all_offsets.append((score(sequence1,sequence2,offset),offset,sequence2,sequence1))
    return max(all_offsets)

In [3]:
def find_best_match(sequence,others):
    all_matches = []
    for sequence2 in others:
        if sequence2 != sequence:
            all_matches.append(find_best_offset(sequence,sequence2))
    return max(all_matches)

In [4]:
def consensus(score,offset,sequence1,sequence2):
    sequence2_left_overhang = sequence2[0:max(0,offset)]
    sequence2_right_overhang = sequence2[len(sequence1)+offset:]
    return sequence2_left_overhang + sequence1 + sequence2_right_overhang

In [5]:
def assemble(sequence, others):
    # remember, best_matching_other is a 4-tuple
    best_matching_other = find_best_match(sequence, others)
    # the * expands the elements of the tuple so we can use them as arguments to consensus()
    consensus_sequence = consensus(*best_matching_other)
    if len(others) == 1:
        return consensus_sequence
    else:
        # get the second element of the best_matching_other tuple, which is the sequence
        best_matching_sequence = best_matching_other[2]
        others.remove(best_matching_sequence)
        return assemble(consensus_sequence, others)

In [16]:
contig = assemble(seq[0],seq[1:])
print(contig)

TGTCGACAAGCCAGTCACTATGTAAGCGAACCACCATAATTGATCGACGATAAAGTGACGCGTCCATGCTCATGTATTTATATGACGGCCAAAAATGGAGATATTATAGTCGACCAAGTATTGGCGTCGAACAACCGCGCCCTGCAGAATCCCAAGATTCGCCAGGCGGCGAACGAGGCCTACGGGCAACGGGTTATACTTAGCTGCAACCAACGCCTTTCCACATGTTTGAGAACCACCATAATTGATCGACGATAAAGTGACGCGTCCATGCTCATGTATTTATATGATCACCTGCGATCGGGCCCGGTGTGTTCATATACGATGCCTCTCCACTTGTCGACAAGCAAAATAAAGTTTACAGTACCGTCCGGACGGCAGAACTGTCCTCTAGATCGTCCTAACGCCTTAGTCGAATCCCTTGCCGTCGGTAACCACTGAATAAACTACGCGTTAGGACTTTGTCAGACGCGAGGAGCTAGTAGGAGGACAAATCAGCAAACGACCCTGAATTGAACAATGTGAGTAGGTATAACTGTGCTTGTATGACGTCCCGTTCGGTCGTTCTTGAGCAACTTCGGCCAGTGCATGCTATGGGGGAAGCTATGAATTCTATGTTGGAACTTGGGCCCGGCATAGTAGTTTATGCCTGTGGACCGGTGTTGAGTGTATCTGCTGGACCCCGGCGCGTTCACCTGTCCACATCTAATCCAAACATATACTATTGGTATTTGAGCGTCTCACAACGACATCGACTGGTATTAGACACCTACCAGGAACAACCAATCGGTTTAGATGACGCACAGCCACGGACAGCCTCTGTTGCTTGAGCAGTCCCAAAGTGCGTACCTGAAGCCTGCCAAAACGTAGCCTAGGCAAATGCCCGTCGTCTTGCTCATAACTCCTTGGGACTGGCGTATCCATAAATAATCCATTCGATTCCTTGAGAGTTCCACATTAGAGACTTATCCATCGAGGATCAGGCCAAATCCGCGAGACC