# RNA Splicing

## Genes are Discontiguous

In “Transcribing DNA into RNA”, we mentioned that a strand of DNA is copied into a strand of RNA during transcription, but we neglected to mention how transcription is achieved.

In the nucleus, an enzyme (i.e., a molecule that accelerates a chemical reaction) called RNA polymerase (RNAP) initiates transcription by breaking the bonds joining complementary bases of DNA. It then creates a molecule called precursor mRNA, or pre-mRNA, by using one of the two strands of DNA as a template strand: moving down the template strand, when RNAP encounters the next nucleotide, it adds the complementary base to the growing RNA strand, with the provision that uracil must be used in place of thymine.

Because RNA is constructed based on complementarity, the second strand of DNA, called the coding strand, is identical to the new strand of RNA except for the replacement of thymine with uracil. Recall “Transcribing DNA into RNA”.

After RNAP has created several nucleotides of RNA, the first separated complementary DNA bases then bond back together. The overall effect is very similar to a pair of zippers traversing the DNA double helix, unzipping the two strands and then quickly zipping them back together while the strand of pre-mRNA is produced.

For that matter, it is not the case that an entire substring of DNA is transcribed into RNA and then translated into a peptide one codon at a time. In reality, a pre-mRNA is first chopped into smaller segments called introns and exons; for the purposes of protein translation, the introns are thrown out, and the exons are glued together sequentially to produce a final strand of mRNA. This cutting and pasting process is called splicing, and it is facilitated by a collection of RNA and proteins called a spliceosome. The fact that the spliceosome is made of RNA and proteins despite regulating the splicing of RNA to create proteins is just one manifestation of a molecular chicken-and-egg scenario that has yet to be fully resolved.

In terms of DNA, the exons deriving from a gene are collectively known as the gene's coding region.

## Problem

After identifying the exons and introns of an RNA string, we only need to delete the introns and concatenate the exons to form a new string ready for translation.

### Given: 
A DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns. All strings are given in FASTA format.

### Return: 

A protein string resulting from transcribing and translating the exons of s. (Note: Only one solution will exist for the dataset provided.)

## Sample Dataset

```
>Rosalind_10
ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG
>Rosalind_12
ATCGGTCGAA
>Rosalind_15
ATCGGTCGAGCGTGT
```

## Sample Output

```
MVYIADKQHVASREAYGHMFKVCA
```

# Solution 1 - find

Parts of code:
* Convert fasta into sequences
* Remove intron sequence from precursor sequence
* Convert mature sequence into protein

In [1]:
# Import the biopython package
import Bio.Seq as bio
from Bio import SeqIO
from Bio.Seq import Seq

In [2]:
# Function to strip introns from sequence

In [3]:
# Function to load fasta file and split into precursor 
# and introns
def getSeq(path):
    # load the fasta file and split the various sequences into a list
    seq_records = list(SeqIO.parse(path, "fasta"))
#     for seq_record in seq_records:
#         # id of the sequence
#         print(seq_record.id)
#         # sequence object
#         print(repr(seq_record.seq))
#         # length of record
#         print(len(seq_record))
        
#     # string of the sequence
#     print(seq_records[0].seq)
    
    # return the fasta object
    return seq_records

In [4]:
# Function to make array of strings sfrom the sequence
## first element the immature sequence
## remainder the introns
def makeRNAarray(bioObject):
    rnaArray = []
    for seq in bioObject:
        #print(str(seq.seq))
        rnaArray.append(str(seq.seq))
        
    return rnaArray

In [5]:
# Function to remove the introns
def removeIntrons(rnaArr):
    mRNA = rnaArr[0]
    #print(len(mRNA))
    for intron in range(1, len(rnaArr)):
        #print(rnaArr[intron])
        #print(len(rnaArr[intron]))
        mRNA = mRNA.replace(rnaArr[intron], "")
        #print(mRNA)
        #print(len(mRNA))
    #print(mRNA)
    return mRNA

In [6]:
# getProtein Function
def getProtein(path):
    ## Load sequences
    rna = makeRNAarray(getSeq(path))
    ## Create mature messanger RNA
    mRNA = removeIntrons(rna)
    #print(mRNA)
    ## Translate to peptides
    peptides = Seq(mRNA).translate()
    ## Print protein string
    ## May have to remove trailing *
    print(str(peptides).strip('*'))

In [7]:
#removeIntrons(makeRNAarray(getSeq("ros22test.fasta")))
getProtein("ros22test.fasta")

MVYIADKQHVASREAYGHMFKVCA


In [8]:
getProtein("rosalind_splc.txt")

MAPGIVGRLRTRTHHLPEVLGHTYSGRLVPGFSPSSTSPRRKRTKCIDTDVHVPHCHHPRYRVLRPFKKTTRGGHTMSNIQGHVRAFTEPIPSRRPSQTYYRTVRNSPKVRLGSCRSGTAVFYLEGGRCRLPRRPVPLRIPQRLLGSVAYNTLNNDLQQIERADQKCLLLSRPRAALQVMDCHLRTYTRIG


# Rosalind Explanation

Program core: the spliceosome
In real life, an enzyme complex that processes the pre-mRNA to mature RNA by cutting out the introns.
In this program, it is a function the takes the first sequences of the FASTA file as the pre-mRNA, and all following as the introns.
taking the introns one by one, it then splits the sequence in the part before the intron, and the part behind it.
those two parts are joined into one string again
the process is then repeated for all following introns on the shortened sequence from the previous round
for the translation of the final "mature mRNA", a dictionary that has the codons as the keys can be used, e.g. "ATG": M (methionine, the start codon).

## Pseudo code

function spiceosome on pre-mRNA + list of introns:

```
looping over all introns:
    find start position of intron
    find length of said intron
    sequence before = pre-mRNA[0:start of intron]
    sequence after = pre-mRNA[start of intron + length of intron : end of sequence]
    join sequence before & sequence after
```

return final assembled sequence

## Any advanced information

In the setting of the problem, the sequences are give as DNA. To convert into RNA, replace the Ts with U (for uracil). Also the sugar backbone is different, but this doesn't influence the sequence analysis.

## Refinements

1. Check if the introns overlap. If so give alternative splicing proteins.
2. Sort into one function taking the biopython sequence function, and one loading and parsing the file.