# RNA Splicing
**Given:** A DNA string *s* (of length at most 1 kbp) and a collection of substrings of *s* acting as introns. All strings are given in FASTA format.

**Return:** A protein string resulting from transcribing and translating the exons of *s*. (Note: Only one solution will exist for the dataset provided.)


In [1]:
%%file RNA_Codon_Table.txt
UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G 

Overwriting RNA_Codon_Table.txt


# Sample Dataset

In [2]:
%%file Sample_Dataset.txt
>Rosalind_10
ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
>Rosalind_12
ATCGGTCGAA
>Rosalind_15
ATCGGTCGAGCGTGT



Overwriting Sample_Dataset.txt


# Sample Output

In [3]:
%%file Sample_Output.txt
MVYIADKQHVASREAYGHMFKVCA



Overwriting Sample_Output.txt


# Solution

In [4]:
class FastaRecord:
    def __init__(self, name, sequence=""):
        self.name = name
        self.sequence = sequence

def parseFastaFile(fasta_file_path):
    fasta_file = open(fasta_file_path,'r')
    fasta_file_lines = fasta_file.readlines()
    
    fasta_records = []
    
    for line in fasta_file_lines:
        if line[0] == ">":
            fasta_records.append(FastaRecord(line[1:].rstrip()))
        else:
            fasta_records[-1].sequence += line.rstrip()
    
    fasta_file.close()
    
    return fasta_records


In [5]:
def transcribeDNAToRNA(input_string):
    "Given a DNA string t having length at most 1000 nt, return the transcribed RNA string of t."
    
    output_string = input_string.replace("T","U")
    
    return output_string

In [6]:
def parseCodonTable(codon_table_file_path = "RNA_Codon_Table.txt"):
    codon_table_file = open(codon_table_file_path,'r')
    codon_table_file_tokens = codon_table_file.read().strip().split()
    codons = codon_table_file_tokens[0::2]
    amino_acids = codon_table_file_tokens[1::2]
    codon_table = dict((codon, amino_acid) for codon, amino_acid in list(zip(codons, amino_acids)))
    return codon_table


In [7]:
def translateRNA(input_string, codon_table = parseCodonTable(), start_codon = "AUG"):
    "Given an RNA string s corresponding to a strand of mRNA (of length at most 10 kbp), return the protein string encoded by s."
    
    rna_after_start_codon = input_string[input_string.find(start_codon):]
    rna_codons = [rna_after_start_codon[i:i+3] for i in range(0, len(rna_after_start_codon), 3)]
    rna_proteins = [codon_table[codon] for codon in rna_codons]
    rna_proteins_before_stop = rna_proteins[0:rna_proteins.index("Stop")]
    rna_proteins_string = "".join(rna_proteins_before_stop)
    
    return rna_proteins_string


In [8]:
s = "ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG"
introns = ["ATCGGTCGAA", "ATCGGTCGAGCGTGT"]
for intron in introns:
    s = s.replace(intron, ' ')
    
"".join(s.split())


'ATGGTCTACATAGCTGACAAACAGCACGTAGCATCTCGAGAGGCATATGGTCACATGTTCAAAGTTTGCGCCTAG'

In [9]:
def spliceRNA(input_string, introns):
    "Given an RNA string s corresponding to a strand of mRNA, and RNA strings s corresponding to intronic RNA, return the string of exonic RNA from s"
    for intron in introns:
        input_string = input_string.replace(intron, ' ')
    
    output_string = "".join(input_string.split())
    return output_string

In [10]:
def transcribeDNAToRNAThenSpliceAndTranslateRNA(input_string, introns, codon_table = parseCodonTable(), start_codon = "AUG"):
    "Given a DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns, return a protein string resulting from transcribing and translating the exons of s."
    
    transcribed_rna = transcribeDNAToRNA(input_string)
    transcribed_intronic_rna = [transcribeDNAToRNA(intron) for intron in introns]
    
    spliced_rna = spliceRNA(transcribed_rna, transcribed_intronic_rna)
    
    translated_protein = translateRNA(spliced_rna, codon_table, start_codon)
    
    return translated_protein
    

def transcribeDNAToRNAThenSpliceAndTranslateRNAFromFileToFile(input_file_path, output_file_path, codon_table_file_path = "RNA_Codon_Table.txt", start_codon = "AUG"):
    "Wraps transcribeDNAToRNAThenSpliceAndTranslateRNA to read from input_file_path and write to output_file_path"
    
    fasta_records = parseFastaFile(input_file_path)
    genomic_dna = fasta_records[0].sequence
    intronic_dna = [fasta_record.sequence for fasta_record in fasta_records[1:]]
    
    codon_table = parseCodonTable(codon_table_file_path)
    
    translated_protein = transcribeDNAToRNAThenSpliceAndTranslateRNA(genomic_dna, intronic_dna, codon_table, start_codon)
    
    output_file = open(output_file_path, 'w')
    output_file.write("%s\n" % translated_protein)
    output_file.close()
    
    return


# Test Solution

In [11]:
transcribeDNAToRNAThenSpliceAndTranslateRNAFromFileToFile("Sample_Dataset.txt", "Test_Output.txt")

In [12]:
%%bash
echo Sample_Output.txt
md5sum Sample_Output.txt
cat Sample_Output.txt

Sample_Output.txt
f8806c6a147f43be93beae4df6221db3  Sample_Output.txt
MVYIADKQHVASREAYGHMFKVCA


In [13]:
%%bash
echo Test_Output.txt
md5sum Test_Output.txt
cat Test_Output.txt

Test_Output.txt
f8806c6a147f43be93beae4df6221db3  Test_Output.txt
MVYIADKQHVASREAYGHMFKVCA


In [14]:
%%bash
if [ $(md5sum Sample_Output.txt|cut -f1 -d' ') == $(md5sum Test_Output.txt|cut -f1 -d' ') ]
then
    echo Sample output matches test output.
else
    echo Sample output does not Match test output.
fi

Sample output matches test output.


# Downloaded Dataset

In [15]:
%%bash
cp ~/Downloads/rosalind_splc.txt ./
cat rosalind_splc.txt

>Rosalind_6882
ATGTGGGCCCGCTATTCGTGTCTTATTGTTCGGGAGATACACAGAACTCAACGTTGACAT
CGCGGAGTGCTATTTGCCAAAACCCGCAGGATTGGATTTTCATCCGGTTGTACTCGGCTC
TAGGACATATCGTCTGCCCTAGGAAGGGTCTACGAGTGCCTTCACATAAGACAAGCGTAG
GCATCTGGGGATACTAGATCGATCCCCTTAGGATTCAATCTTCAAGCCAATTAGCAGCTG
CTGCCGGCGGGACTTAACTAAACATTACGCAGTGCAAGCGCGTGAGTCACTCGTACCGTC
TACGCACTAGAAGTCTCCGTCACAAGCAAGACCAGCACTCCCTGTGTGATGTACGAAACT
ATGGTCTTACCTGCAGGCGAGTCGCATTTTCGACTCTACATCCGCTACCATGCAGACACC
ATCACATAAGATCGCCAGTCCGGCACAGATACCAGATCCTAAGTAGCGCAACTCCCCTGT
CTGTGTGCGAGACGTTAAAACTAATCCTTGTCACTCGGGCGTGCTGAGTTGAGGCAACCG
TGTAAGCGTCCCGACGCTAAAGATGGAATGTCAGTAAAGCCTTTCCAGTTCGGTGGTCGA
TTTCTTCCTTTGGACGGTCCGCTTAGCAGTTAATGTGCATCTACAAAACGATGCAGGTGA
GAGTACAGGTAATCAGCCGGTACCGCCCTATCCCAGCCGCTTTAACGACAACCTGATCGC
GGAAAGTTATGCTGTTTCGGCTTGTCTACCGGGTCGAATATTCGCCCCCGACGGTGATTG
GCAAGATAAGGGACACCGATACGTGTCCAGCGTTCTTGACCCATCGTCGAAAGGTTCTCT
TCTCATCTCGCACATTAATTGCCGCGGCGAACGGCAGAGGTACCGGTCCTCCAATACTGT
ATAGGCCTATCCCGTTCCACACCTAACTGTGCACCCCAGGAGGCAACTGCTGTCTAA
>Rosalind_55

# Solution to Downloaded Dataset

In [16]:
transcribeDNAToRNAThenSpliceAndTranslateRNAFromFileToFile("rosalind_splc.txt", "Solution_Output.txt")

In [17]:
%%bash
cat Solution_Output.txt

MWARYSCLIVREIIAECYLPKPAGLDFHSALGRVYEDPLRIQSSSQRVSHSYRLRTRSLRHKQDQQASRIFDSTSATMQTSSATPLSVCETLKLILVTRWNVSKAFPVRHLQNDAGESTGNQPVPPYPDRGKLCCFGLHRYVSSVLDPSSKGSLLGERQRYRSSRSTPNCAPQEATAV
