Problem

After identifying the exons and introns of an RNA string, we only need to delete the introns and concatenate the exons to form a new string ready for translation.

Given: A DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns. All strings are given in FASTA format.

Return: A protein string resulting from transcribing and translating the exons of s. (Note: Only one solution will exist for the dataset provided.)
Sample Dataset

>Rosalind_10
ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
>Rosalind_12
ATCGGTCGAA
>Rosalind_15
ATCGGTCGAGCGTGT

Sample Output 

MVYIADKQHVASREAYGHMFKVCA  


In [21]:
from Bio import SeqIO
from collections import OrderedDict

sequences = []
path = "C:/Users/niina/Google Drive/mdp_bioinformatics/rosalind/bioinformatics stronghold/"
file = "rosalind_splc.txt"
f = path+file

# read data
for record in SeqIO.parse(f, "fasta"):
    sequences.append(record.seq)
    
# DNA string
s = sequences[0]
# introns as a list
sequences = sequences[1:]

# Collect starts of introns into ordered dictionary
starts = {}
for seq in sequences:
    starts[seq] = s.find(seq)
starts = OrderedDict(sorted(starts.items(), key=lambda x: x[1]))
  

s_new = "" # exons will be added into this 
start = 0 # first exon starts at zero
for k, v in starts.items():
    s_new = s_new + s[start:v] # exons
    start = v + len(k) # calculate next exon start site
s_new = s_new + s[start:] # add final exon
print(s_new.translate(stop_symbol = ""))

MTKASCRPWLHTRGGARRQSFAGKVITGLLLLRLPYQLYQEPSTISEAYVDISITFLLSLNGRYTSLVVCLFPYRAPMEVIDRGSHGEDKTILQMDGVYASILGVRLITVLYIGGCQHVLCTSGTPKWFIAQGQLHLSGYEGQCHSRRSIDDPGGGVLTGYFGPLTKGSDYIQVARRHGTNPRP


